Cryptographic Hardware and Embedded Systems CHES 2016
Cryptographic Hardware and Embedded Systems CHES 2016
Cryptographic Hardware
LNCS 9813
123
Lecture Notes in Computer Science 9813
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zürich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/7410
Benedikt Gierlichs Axel Y. Poschmann (Eds.)
•
Cryptographic Hardware
and Embedded Systems –
CHES 2016
18th International Conference
Santa Barbara, CA, USA, August 17–19, 2016
Proceedings
123
Editors
Benedikt Gierlichs Axel Y. Poschmann
KU Leuven NXP Semiconductors Germany GmbH
Leuven Hamburg
Belgium Germany
the review process, and the editing of the final proceedings were greatly simplified by
the software written by Shai Halevi and we thank him for his kind and immediate
support throughout the whole process.
We would also like to thank the General Chairs, Çetin Kaya Koç and Erkay Savaş,
local organizers Sally Vito and Whitney Morris (of UCSB Conference Services), Juan
Manuel Escalante, who designed the CHES 2016 memorabilia, and the webmaster,
Thomas Eisenbarth. Our thanks also go out to Matt Robshaw and Jonathan Katz, the
Program Chairs of CRYPTO 2016, for the successful collaboration and alignment
of the programs of CHES and CRYPTO. We are very grateful for the financial support
received from our many generous sponsors.
Finally, among the numerous people that contributed to the success of CHES 2016,
above all others are the authors who submitted their research papers to the conference.
Without them, this conference would not exist. We enjoyed chairing the Program
Committee and we hope you will enjoy these proceedings.
General Chairs
Çetin Kaya Koç University of California at Santa Barbara, USA
Erkay Savaş Sabanci University, Turkey
Program Chairs
Benedikt Gierlichs KU Leuven, Belgium
Axel Y. Poschmann NXP Semiconductors, Germany
Program Committee
Josep Balasch KU Leuven, Belgium
Lejla Batina Radboud University, The Netherlands
Daniel J. Bernstein University of Illinois at Chicago, USA and Technische
Universiteit Eindhoven, The Netherlands
Guido Bertoni STMicroelectronics, Italy
Chen-Mou Cheng National Taiwan University, Taiwan
Hermann Drexler Giesecke & Devrient, Germany
Orr Dunkelman University of Haifa, Israel
Junfeng Fan Open Security Research, China
Sebastian Faust Ruhr-Universität Bochum, Germany
Viktor Fischer Jean Monnet University Saint-Etienne, France
Wieland Fischer Infineon Technologies, Germany
Henri Gilbert ANSSI, France
Christophe Giraud Oberthur Technologies, France
Daniel Holcomb University of Massachusetts Amherst, USA
Naofumi Homma Tohoku University, Japan
Michael Hutter Cryptography Research, USA
Kimmo Järvinen Aalto University, Finland
Marc Joye Technicolor, France
Lars R. Knudsen Technical University of Denmark, Denmark
Kerstin Lemke-Rust Bonn-Rhein-Sieg University of Applied Sciences,
Germany
Tancrède Lepoint CryptoExperts, France
VIII CHES 2016
External Reviewers
Automotive Security
Invasive Attacks
New Directions
Software Implementations
Cache Attacks
Hardware Implementations
Fault Attacks
1 Introduction
State of the Art of Timing Attacks. Any cryptographic algorithm in an embed-
ded system is vulnerable to side-channel attacks. Timing attacks on the RSA
Straightforward Method (RSA-SFM) were pioneered by Kocher [12]. The attack
consists in building “templates” whose distributions are compared to that of the
response. It is required that the cryptographic parameters be known since the
attack is profiled.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 3–22, 2016.
DOI: 10.1007/978-3-662-53140-2 1
4 M. Dugardin et al.
Schindler [16] extended timing attacks to RSA with Chinese Remainder The-
orem (RSA-CRT) using chosen messages. This attack exploits a conditional
extra-reduction at the end of modular multiplications. Schindler and co-authors
carried out numerous improvements [1, 2, 17–20] in the case where the exponen-
tiation uses windows or exponent randomization.
Walter and Thompson [21] remarked that even when data is blinded, the
distribution of extra-reductions is different for a square and for a multiply. They
assumed that side-channel measurements such as power or timing during expo-
nentiation are sufficiently clean to detect the presence or absence of an extra-
reduction at each individual operation. Schindler [17] improved this attack by
also distinguishing multiplications by a constant from squarings and multiplica-
tions by non-fixed parameters.
Contributions of This Paper. We show that despite message blinding and regular
exponentiation, it is still possible for an attacker to take advantage of extra-
reductions: A new bias is found, namely a strong negative correlation between
the extra-reduction of two consecutive operations. As shown in this paper, the
bias can be easily leveraged to recover which registers are written to (at line 5 of
Algorithm 1 or at lines 4 and 5 of Algorithm 2) which eventually leads to retrieve
the secret key. The advantages of this method are the following:
– messages are unknown; this captures general situations such as RSA with
OAEP or PSS padding and RSA input blinding [11, Sect. 10];
– RSA parameters can be unknown; hence RSA-CRT is also vulnerable;
Correlated Extra-Reductions Defeat Blinded Regular Exponentiation 5
– all binary exponentiation algorithms are vulnerable, even the regular ones like
Square and Multiply Always, Montgomery Ladder, etc.;
– our attack can also be applied to Elliptic Curve Cryptography (ECC).
From a mathematical viewpoint, we also provide a comprehensive framework for
studying the joint probabilities of extra-reductions in a sequence of multiplies
and squares.
Outline. The rest of the paper is organized as follows1 . Section 2 recalls known
biases induced by extra-reductions in modular multiplication algorithms such
as the Montgomery modular multiplication. Our contribution starts at Sect. 3,
where the theoretical rationale for the strong negative correlation between extra-
reductions of two chained operations is presented. Section 4 shows how this bias
can be turned into a key recovery attack. Experimental validations for synthetic
and practical traces are in Sect. 5. Section 6 concludes.
Fig. 1. Comparison between the output value of multiplication with the input of
the following square in the Square-and-Multiply-Always exponentiation algorithm
(Algorithm 1).
For the ML algorithm (Algorithm 2), the Mi and Si−1 operations depends
directly on the two consecutive key bit values ki and ki−1 . If the bit value ki−1
and its previous bit value ki are different then the output of multiplication
Mi and the input of square Si−1 are equal and yield strongly correlated extra-
reductions; in the opposite case they yield uncorrelated extra-reductions.
The guess value Gi is linked to the key value depending on the regular expo-
nentiation algorithm. For SMA and for a bit ki , an attacker is able to estimate
the probabilities P̂(XMi , XSi−1 ). This probability can be used to find the bit ki
as illustrated in Fig. 1 and explained in Sect. 4 below. For ML, Gi depends on
two consecutive key bits as explained also in Sect. 4.
We have estimated the joint probabilities P(XMi , XSi−1 |Gi ) using 1.000.000
random values for both SMA and ML algorithms and the example RSA-1024-p
defined in [8, Sect. 2.2] for this modulus for which the ratio p/R 0.800907.
The values of the obtained probabilities are shown in Table 1.
It is important to notice that for each (xMi , xSi−1 ) ∈ {0, 1}2 , the condi-
tional joint probabilities are distinct: P(XMi = xMi , XSi−1 = xSi−1 |Gi = F ) =
P(XMi = xMi , XSi−1 = xSi−1 |Gi = T ). Also for Gi = F in ML, it can be observed
p p
that P(XMi , XSi−1 |Gi ) = 4R × 3R = P(XMi )×P(XSi−1 ), which is consistent with
the fact the two operations XMi and XSi−1 should be independent since they
are completely unrelated.
It should be emphasized that the leakage identified in Table 1 is fairly large,
since the Pearson correlations ρ of the two randoms variables are2 :
To the best of our knowledge, such correlations have not been observed previ-
ously. A few observations are in order:
– when a square follows a multiply, and if there has been an extra-reduction
in the multiplication, the result should be short, hence there is less chance
for an extra-reduction to occur in the following square. This accounts for the
negative correlation ρ(XMi , XSi−1 |Gi = T );
– from Fig. 1 iteration i = l − 2 where ki = 0, we can see that one input of
the multiplication Mi equals the input of the following squaring Si−1 . Since a
square and a multiplication share a common operand, provided it is sufficiently
large, both operations are likely to have an extra-reduction at the same time,
which accounts for the positive correlation ρ(XMi , XSi−1 |Gi = F ) for SMA;
2 Cov(XM ,XS ) P(XM =1,XS =1)−(P(XM =1)×P(XS =1))
ρ(XM , XS ) = i i−1 = i i−1 i i−1 .
i i−1 σX σ P(XM =1)(1−P(XM =1)) P(XS =1)(1−P(XS
Mi XSi−1 i i i−1 i−1
=1))
Correlated Extra-Reductions Defeat Blinded Regular Exponentiation 9
In order to estimate the probability P(XMi , XSi−1 |Gi ), we first determine the
distribution of the output value after one MMM (following the method described
by Sato et al. [15]) and then compute the joint probability for each case.
Let A, B be two independent random variables uniformly distributed in [0, p[
(represented in Montgomery form); let C be equal to the MMM product of A and
B and U corresponds to the MMM product of A and B before eXtra-reduction
(if any). Variables C and U coincide with that of Algorithm 3. As a matter
of fact, an attacker cannot observe values, only extra-reductions which occur
during Montgomery reduction (at line 4 of Algorithm 3). We use notations P for
probabilities and f for probability density functions (p.d.f.’s).
Figure 2 shows histograms for C and U obtained from one million simulations;
the binning consists of 100 bins of the interval [0, 2p[. It can be observed that
0.07 0.07
f(C) f(C)
f(C|XM=0) f(C|XS=0)
0.06 f(C|XM=1) 0.06 f(C|XS=1)
f(U=u) f(U=u)
0.05 Theory for mult (Thm. 1) 0.05 Theory for square (Thm. 1)
0.04 0.04
XM=0 XM=1 XS=0 XS=1
0.03 0.03
0.02 0.02
0.01 0.01
0 0
2 2 2 2
0 p /R p R p+p /R 2p 0 p /R p R p+p /R 2p
Fig. 2. Distribution of the output value of Montgomery multiplication (left) and square
(right) for RSA-1024-p.
10 M. Dugardin et al.
The corresponding p.d.f. for the square is also in four pieces with the same
√ inter-
vals for u, and differs only from the multiplication in that it is equal to Ru/p2
2 2
when 0 ≤ u ≤ pR , and 1/p − R(u − p)/p2 when p ≤ u ≤ p + pR .
p 1
p 48
4 R 3R
1
p 4 R
XMi = 1 4R
− 48 R 48 R
3
Proof of this theorem is given in [8].
Correlated Extra-Reductions Defeat Blinded Regular Exponentiation 11
When Gi = F in SMA:
When Gi = F in ML:
RSA-1024-q
RSA-1024-p
P-256
i-1
0.1
-0.1
-0.2
-0.3
1/2 3/4 1
Ratio p/R
When the guess is correct, ρ(XMi , XSi−1 |Gi = T ) is negative and increasingly
negative as p/R increases, where
− 16
3
7 ≈ −0.158
5
≤ ρ(XMi , XSi−1 |Gi = T ) ≤ − 4√ 3
6
≈ −0.306.
When the guess is incorrect, either the correlation is null (in the case of ML), or
it is positive and increasing with p/R, where for 1/2 ≤ p/R ≤ 1,
√1
2 5×7
≈ 0.085 ≤ ρ(XMi , XSi−1 |Gi = F ) ≤ 1
√
2 6
≈ 0.204.
The variations of the correlation coefficients between XMi and XSi−1 in the three
scenarios of Corollary 1 are plotted in Fig. 3.
Figure 3 shows that the correlation difference between guesses True/False
is greater for the SMA algorithm than for the ML algorithm. Thus our attack
on SMA should outperform that on ML. Also notice that the larger the ratio
p/R, the larger the correlation difference; hence, we expect P-256 to be easier
to break than brainpoolP256r1 with our attack.
The attacker then computes the Pearson correlation4 ρ̂(XMi , XSi−1 ) for each pair
(xMi , xSi−1 ) ∈ {0, 1}2 using the estimated probability P̂(XMi , XSi−1 ). Finally,
ˆ
Cov(X
4 Mi ,XSi−1 ) P̂(XM =1,XS
i i−1
=1)−(P̂(XM =1)×P̂(XS
i i−1
=1))
ρ̂(XM , XS ) = = .
i i−1 σ̂X σ̂
Mi XSi−1 P̂(XM =1)(1−P̂(XM =1)) P̂(XS =1)(1−P̂(XS =1))
i i i−1 i−1
Correlated Extra-Reductions Defeat Blinded Regular Exponentiation 13
she estimates the exponent bit ki with her knowledge corresponding to threshold
T and decision function FALG .
– In RSA-SFM and ECC, the attacker knows the parameters p and R defined
in Sect. 2.1. In RSA-SFM, p is equal to the public modulus nRSA . In ECC,
p equals the characteristic of the finite field over which the elliptic curve is
defined. The attacker can compute the Pearson correlations ρ(XMi , XSi−1 |Gi =
T ) and ρ(XMi , XSi−1 |Gi = F ) using corollary 1. The threshold for the success-
ful attack is defined by:
In fact, the threshold value T computed in (7) or (9) does not depend on i. The
indication of index i was kept as a reminder that the multiplication Mi is done
in the iteration which precedes that of the square Si−1 .
5
Notice that in some cases, e.g. if the key bits happen not to be balanced,
Êi P̂(XMi , XSi−1 ) can be estimated in a less biased way using maxi {P̂(XMi , XSi−1 )}−
mini {P̂(XMi , XSi−1 )}.
14 M. Dugardin et al.
– In the SMA algorithm, the scalar bit ki decides whether the output of Mi
is the input of Si−1 or not (see Fig. 1). If the bit value ki equals 1, then the
square Si−1 depends on Mi (Gi = T ), otherwise the output value of Mi is
different from the input value of Si−1 (Gi = F ). Using the Sect. 3, we see
that ρ(XMi , XSi−1 |Gi = T ) < ρ(XMi , XSi−1 |Gi = F ), so the decision function
FSM A is defined by:
0 if ρ̂(XMi , XSi−1 ) ≥ T ,
k̂i = FSM A (ρ, T ) = (10)
1 otherwise.
– For the Montgomery Ladder (ML) algorithm, the Mi and Si−1 operations do
not depend directly on the key bit value ki . The dependence comes from the
bit value ki−1 and the previous bit value ki . If the two bits value ki−1 and
ki are different then the output of multiplication Mi and the input of square
Si−1 are equal (Gi = T ), otherwise these output/input are different (Gi = F ).
Using Sect. 3, we see that ρ(XMi , XSi−1 |Gi = T ) < ρ(XMi , XSi−1 |Gi = F ), so
the decision function FM L using the previously estimated bit k̂i−1 is defined
for each i (l − 1 > i ≥ 1) by:
k̂i−1 if ρ̂(XMi , XSi−1 ) ≥ T ,
k̂i = FM L (k̂i−1 , ρ, T ) = (11)
¬k̂i−1 otherwise.
Regarding the second most significant bit kl−1 of the exponent, either both
values kl−1 = 0 and kl−1 = 1 are tested to recover the full secret key, or
our attack can be applied between the first square FS (defined at line 2 of
Algorithm 2) and the square Sl−1 (line 5 of Algorithm 2).
Algorithm 4. ρ-attack
Input: (xMi , xSi−1 ), a set of Q pairs of (l − 1) bits
Output: An estimation k̂ ∈ {0, 1}l−1 of the secret exponent
1: for i = l − 1 downto 1 do
2: P̂(XMi , XSi−1 ) ← 0
3: for q = 1 to Q do
4: P̂(XMi = xqMi , XSi−1 = xqSi−1 ) ← P̂(XMi = xqMi , XSi−1 = xqSi−1 ) + 1
5: end for
6: P̂(XMi , XSi−1 ) ← P̂(XMi , XSi−1 ) / Q Normalization
7: Compute ρ̂(XMi , XSi−1 ) using P̂(XMi , XSi−1 )
8: end for
9: Compute T depending on the attacker’s knowledge
10: for i = l − 1 downto
1 do
11: k̂i ← FALG ρ̂(XMi , XSi−1 ), T Threshold
12: end for
13: return k̂
Correlated Extra-Reductions Defeat Blinded Regular Exponentiation 15
Summary of the Attack. To estimate the exponent k by k̂, we define two attacks:
– The attack named “ρ-attack-Hard”, knowing the values of P(XMi , XSi−1 |Gi =
T ) and P(XMi , XSi−1 |Gi = F ), using the threshold T computed by (7).
– The attack named “ρ-attack-Soft”, when the theoretical value P(XMi ,
XSi−1 |Gi ) is unknown. It uses the estimated probability P̂(XMi , XSi−1 ) to
compute the threshold T by (9).
Algorithm 4 describes the attack to recover a full key. Lines 1-8 correspond
to the computation of the estimated probabilities for each bit ki defined by (6).
Line 9 is the computation of the threshold: if the attack is ρ-attack-Hard the
attacker uses (7), otherwise the attack is ρ-attack-Soft and she uses (9). The
lines 10-12 compute the full estimated key using the decision function FALG ,
defined by the Eqs. (10) or (11).
5 Experimental Results
In the first part of this section, we detail a simulated attack which exploits
the bias (explained in Corollary 1) to determine the number of queries neces-
sary for the success of the attack. Then, we detail the side-channel part (local
timing analysis using power consumption and electromagnetic analysis to dis-
tinguish functional vs dummy subtractions) in order to detect whether an eXtra-
reduction is performed (X = 1) or not (X = 0) during the Montgomery reduction
(Algorithm 3).
5.1 Simulations
Let RSA-1024-p defined at [8, Sect. 2.2] the modulus p used in the SMA algo-
rithm (Algorithm 1). We generated one thousand random queries and saved
for all MMM the information whether an extra-reduction is done or not. The
length of static key k is 512 bits. As detailed in the ρ-attack (Algorithm 4)
we computed the estimated probabilities P̂(XMi , XSi−1 ) and the estimated
Pearson correlation ρ̂(XMi , XSi−1 ) to retrieve each ki . The estimated threshold
T computed by (9) in our simulation is equal to −0.06076, which is an excel-
lent approximation of the theoretical threshold (7). To retrieve each bit if the
exponent, we used the decision function FSM A described for ρ-attack in SMA
by (10).
Figure 4 shows the estimated Pearson correlation values ρ̂(XMi , XSi−1 ) for
the first iterations. It can be easily seen that the estimated key value by this
sequence corresponds to 0×1000111110101110111010011 . . . = 0×11f5dd3 . . .
Our ρ-attack retrieves the 511 bits of the exponent using 1000 randoms queries
with success rate 100%.
Success Rate Curves. We implemented ρ-attack-Hard and ρ-attack-Soft in
the ideal case, i.e., without noise. The success rate to recover one bit of the
exponent is represented in Fig. 5, for both SMA and ML cases. Interestingly,
16 M. Dugardin et al.
0.3 ρ(XM , XS )
i i-1
ki=0 Threshold
0.2
0.1
Estimated ρ
-0.1
-0.2
ki=1
-0.3
510 508 506 504 502 500 498 496 494 492
i
Fig. 4. Estimated Pearson correlations using 1000 randoms queries for RSA-1024-p for
the first 20 iterations.
Fig. 5. Evolution of the success rate for the ρ-attack-Soft and the ρ-attack-Hard as
a function of the number Q of queries (upper bound is the maximum likelihood), for
RSA-1024-p.
ρ-attack-Hard and ρ-attack-Soft yield the same success rate, which happens to
be (very close to) the optimal value. This optimal value is that obtained with
the maximum likelihood distinguisher derived in [8].
The reason for the hard and soft attacks to have similar success probability is
that the online estimation of the threshold is very good. Indeed, in the example
of Fig. 5, the threshold T (Eq. (9)) is estimated based on 512Q traces, which
is huge (one needs only to estimate 4 probabilities to get the estimation of T ).
So, in the rest of this section, we make no difference between the hard and soft
versions of the attacks from a success rate point of view.
The ρ-attacks are very close to the Maximum Likelihood attack for a similar
reason. Estimating the difference between two random variables of very little
dimensionality (recall that (XMi , XSi−1 ) lives in {0, 1}2 ) can be done almost
Correlated Extra-Reductions Defeat Blinded Regular Exponentiation 17
Fig. 6. Evolution of the success rate for the ρ-attack in function of queries Q using
p = RSA-1024-p for four increasing noise values.
1. How to exploit the local timing to distinguish the eXtra-reduction using power
consumption measurements, on OpenSSL v1.0.1k-3 (6 )?
2. How to exploit the difference between a real and a dummy final subtraction
using electromagnetic (EM) emanations, on mbedTLS v 2.2.0 (7 )?
(1a) Experiment Setup in Power. The target is a dual core LPC43S37 micro-
controller fabricated in CMOS 90 nm Ultra Low Leakage process soldered on
an LPCXpresso4337 board, and running at its maximum frequency (208 MHz).
The side-channel traces where obtained measuring the instantaneous power con-
sumption with a PICOSCOPE 6402C featuring 256 MB of memory, 500 MHz
bandwidth and 5 GS/s sampling rate. We executed the private function of RSA
6
Latest stable version at the time of submission.
7
Latest version at the time of submission.
18 M. Dugardin et al.
(1b) OpenSSL Experiment. In OpenSSL (see Listing 1.1 in Appendix A), the
final subtraction is made when U is greater than p like described in Algorithm 3.
A simple power analysis using the delay (referred to as “SPA-Timing”) between
two MMM operations found whether the extra-reduction is present (X = 1)
or not (X = 0). On the Cortex M4 core, the delay between the Mi and Si−1
when XMi = 1 is 41.4952 μs, whereas the delay when XMi = 0 is 41.1875 μs.
For the square operation Si−1 , the delay is 41.5637 μs when XSi−1 = 1 and it
is 41.2471 μs when XSi−1 = 0. All in one, the observable timing differences are
respectively 308 ns and 317 ns. When OpenSSL is offloaded on the Cortex M0
core of the LPC43S37, the timing difference is respectively 399 ns and 411 ns.
The success rate of this detection attack is 100 %, hence Pnoise = 0.
XMi = 1 XMi = 0
50 50
EM
EM
0 0
-50 -50
-100 -100
-150 -150
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Number of samples Number of samples
Fig. 7. Electromagnetic acquisition focus on one real subtraction (left) and pattern of
one dummy subtraction (right) between two consecutive MMM operations.
Table 2. Summary of the number of queries (see Fig. 6(b)) to retrieve all key bits of a
secret exponent, as a function of side-channel detection method and regular exponen-
tiation algorithm.
6 Conclusion
This paper has presented a new theoretical and practical attack against asym-
metrical computation with regular exponentiation using extra-reductions as a
side-channel. The working factor is the existence of a strong bias between the
extra-reductions during the Montgomery Modular Multiplication of two consecu-
tive operations. This new bias can be exploited in each regular binary algorithm,
because each iteration consists in a square and a multiply whose inputs depend
on the outputs of an operation from the previous iteration.
20 M. Dugardin et al.
The new attacks have been detailed on RSA but are also applicable to ECC
with appropriate customizations for various ECC implementations. As an exam-
ple [5] for addition madd-2004-hmv, the Z-coordinate in output of addition is
computed by a multiplication Z3 = Z1 × T 1 and for doubling dbl-2007-bl, the
Z-coordinate in input of doubling is a square ZZ = Z1 × Z1.
Acknowledgements. The authors would like to thank the anonymous reviewers for
their useful comments that improved the quality of the paper. The first author would
also like to thank François Dassance, Jean-Christophe Courrège and her colleagues for
the suggestion of the main idea of this paper and their valuable insights.
References
1. Acıiçmez, O., Schindler, W.: A vulnerability in RSA implementations due to
instruction cache analysis and its demonstration on openSSL. In: Malkin, T. (ed.)
CT-RSA 2008. LNCS, vol. 4964, pp. 256–273. Springer, Heidelberg (2008)
2. Aciiçmez, O., Schindler, W., Koç, Ç.K.: Improving Brumley and Boneh timing
attack on unprotected SSL implementations. In: Atluri, V., Meadows, C., Juels,
A. (eds.) CCS 2005, pp. 139–146. ACM, New York (2005)
3. Bauer, A., Jaulmes, É., Prouff, E., Reinhard, J.-R., Wild, J.: Horizontal collision
correlation attack on elliptic curves - extended version. Cryptogr. Commun. 7(1),
91–119 (2015)
4. Belgarric, P., Bhasin, S., Bruneau, N., Danger, J.-L., Debande, N., Guilley, S.,
Heuser, A., Najm, Z., Rioul, O.: Time-frequency analysis for second-order attacks.
In: Francillon, A., Rohatgi, P. (eds.) CARDIS 2013. LNCS, vol. 8419, pp. 108–122.
Springer, Heidelberg (2014)
5. Bernstein, D.J., Lange, T.: Explicit formulas database. http://www.hyperelliptic.
org/EFD/
6. Clavier, C., Feix, B., Gagnerot, G., Roussellet, M., Verneuil, V.: Horizontal corre-
lation analysis on exponentiation. In: Soriano, M., Qing, S., López, J. (eds.) ICICS
2010. LNCS, vol. 6476, pp. 46–61. Springer, Heidelberg (2010)
7. Courrège, J.-C., Feix, B., Roussellet, M.: Simple power analysis on exponentiation
revisited. In: Gollmann, D., Lanet, J.-L., Iguchi-Cartigny, J. (eds.) CARDIS 2010.
LNCS, vol. 6035, pp. 65–79. Springer, Heidelberg (2010)
8. Dugardin, M., Guilley, S., Danger, J.-L., Najm, Z., Rioul, O.: Correlated extra-
reductions defeat blinded regular exponentiation - extended version. Cryptology
ePrint Archive, Report 2016/597 (2016). http://eprint.iacr.org/2016/597
9. Fouque, P.-A., Réal, D., Valette, F., Drissi, M.: The carry leakage on the ran-
domized exponent countermeasure. In: Oswald, E., Rohatgi, P. (eds.) CHES 2008.
LNCS, vol. 5154, pp. 198–213. Springer, Heidelberg (2008)
10. Hanley, N., Kim, H.S., Tunstall, M.: Exploiting collisions in addition chain-based
exponentiation algorithms using a single trace. In: Nyberg, K. (ed.) CT-RSA 2015.
LNCS, vol. 9048, pp. 429–446. Springer, Heidelberg (2015)
11. Kocher, P.C.: Timing attacks on implementations of Diffie-Hellman, RSA, DSS,
and other systems. In: Koblitz, N. (ed.) CRYPTO 1996. LNCS, vol. 1109, pp.
104–113. Springer, Heidelberg (1996)
12. Kocher, P.C.: On certificate revocation and validation. In: Hirschfeld, R. (ed.) FC
1998. LNCS, vol. 1465, pp. 172–177. Springer, Heidelberg (1998)
13. Menezes, A.J., van Oorschot, P.C., Vanstone, S.A.: Handbook of Applied Cryptog-
raphy. CRC Press, Boca Raton (1996). http://www.cacr.math.uwaterloo.ca/hac/
14. Peter, L.: Montgomery: modular multiplication without trial division. Math. Com-
put. 44(170), 519–521 (1985)
15. Sato, H., Schepers, D., Takagi, T.: Exact analysis of montgomery multiplication.
In: Canteaut, A., Viswanathan, K. (eds.) INDOCRYPT 2004. LNCS, vol. 3348,
pp. 290–304. Springer, Heidelberg (2004)
16. Schindler, W.: A timing attack against RSA with the Chinese Remainder Theorem.
In: Paar, C., Koç, Ç.K. (eds.) CHES 2000. LNCS, vol. 1965, pp. 109–124. Springer,
Heidelberg (2000)
17. Schindler, W.: A combined timing and power attack. In: Naccache, D., Paillier, P.
(eds.) PKC 2002. LNCS, vol. 2274, pp. 263–279. Springer, Heidelberg (2002)
22 M. Dugardin et al.
18. Schindler, W.: Exclusive exponent blinding may not suffice to prevent timing
attacks on RSA. In: Güneysu, T., Handschuh, H. (eds.) CHES 2015. LNCS, vol.
9293, pp. 229–247. Springer, Heidelberg (2015)
19. Schindler, W., Koeune, F., Quisquater, J.-J.: Improving divide and conquer attacks
against cryptosystems by better error detection/correction strategies. In: Honary,
B. (ed.) Cryptography and Coding 2001. LNCS, vol. 2260, pp. 245–267. Springer,
Heidelberg (2001)
20. Schindler, W., Walter, C.D.: More detail for a combined timing and power attack
against implementations of RSA. In: Paterson, K.G. (ed.) Cryptography and Cod-
ing 2003. LNCS, vol. 2898, pp. 245–263. Springer, Heidelberg (2003)
21. Walter, C.D., Thompson, S.: Distinguishing exponent digits by observing modular
subtractions. In: Naccache, D. (ed.) CT-RSA 2001. LNCS, vol. 2020, pp. 192–207.
Springer, Heidelberg (2001)
22. Whitnall, C., Oswald, E.: A comprehensive evaluation of mutual information analy-
sis using a fair evaluation framework. In: Rogaway, P. (ed.) CRYPTO 2011. LNCS,
vol. 6841, pp. 316–334. Springer, Heidelberg (2011)
23. Whitnall, C., Oswald, E., Standaert, F.-X.: The myth of generic DPA..and the
magic of learning. In: Benaloh, J. (ed.) CT-RSA 2014. LNCS, vol. 8366, pp. 183–
205. Springer, Heidelberg (2014)
24. Witteman, M.F., van Woudenberg, J.G.J., Menarini, F.: Defeating RSA multiply-
always and message blinding countermeasures. In: Kiayias, A. (ed.) CT-RSA 2011.
LNCS, vol. 6558, pp. 77–88. Springer, Heidelberg (2011)
Horizontal Side-Channel Attacks
and Countermeasures
on the ISW Masking Scheme
1 Introduction
Side-channel analysis is a class of cryptanalytic attacks that exploit the physical
environment of a cryptosystem to recover some leakage about its secrets. To
secure implementations against this threat, security developers usually apply
techniques inspired from secret sharing [Bla79,Sha79] or multi-party computation
[CCD88]. The idea is to randomly split a secret into several shares such that the
adversary needs all of them to reconstruct the secret. For these schemes, the
E. Prouff—Part of this work has been done at Safran Identity and Security, and
while the author was at ANSSI, France.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 23–39, 2016.
DOI: 10.1007/978-3-662-53140-2 2
24 A. Battistello et al.
number of shares n in which the key-dependent data are split plays the role of
a security parameter.
A common countermeasure against side-channel attacks consists in using
the masking scheme originally introduced by Ishai, Sahai and Wagner (ISW)
[ISW03]. The countermeasure achieves provable security in the so-called probing
security model [ISW03], in which the adversary can recover a limited number of
intermediate variables of the computation. This model has been argued to be
practically relevant to address so-called higher-order side-channel attacks and it
has been the basis of several efficient schemes to protect block ciphers.
More recently, it has been shown in [DDF14] that the probing security of an
implementation actually implies its security in the more realistic noisy leakage
model introduced in [PR13]. More precisely, if an implementation obtained by
applying the compiler in [ISW03] is secure at order n in the probing model,
then [DFS15, Theorem3] shows that the success probability of distinguishing
the correct key among |K| candidates is bounded above by |K| · 2−n/9 if the
leakage Li on each intermediate variable Xi satisfies:
where I(·; ·) denotes the mutual information and where the index i ranges from
1 to the total number of intermediate variables.
In this paper we investigate what happens when the above condition is not
satisfied. Since the above mutual information I(Xi ; Li ) can be approximated
by k/(8σ 2 ) in the Hamming weight model in F2k , where σ is the noise in the
measurement (see the full version of this paper [BCPZ16]), this amounts to
investigating the security of Ishai-Sahai-Wagner’s (ISW) implementations when
the number of shares n satisfies:
n>c·σ
As already observed in previous works [VGS14,CFG+10], the fact that the same
share (or more generally several data depending on the same sensitive value) is
manipulated several times may open the door to new attacks which are not
taken into account in the probing model. Those attacks, sometimes called hori-
zontal [CFG+10] or (Template) algebraic [ORSW12,VGS14] exploit the algebraic
dependency between several intermediate results to discriminate key hypotheses.
In this paper, we exhibit two (horizontal) side channel attacks against the
ISW multiplication algorithm. These attacks show that the use of this algorithm
(and its extension proposed by Rivain and Prouff in [RP10]) may introduce a
weakness with respect to horizontal side channel attacks if the sharing order n
is such that n > c · σ 2 , where σ is the measurement noise. While the first attack
is too costly (even for low noise contexts) to make it applicable in practice, the
second attack, which essentially iterates the first one until achieving a satisfying
likelihood, shows very good performances. For instance, when the leakages are
simulated by noisy Hamming weights computed over F28 with σ = 1, it recovers
all the shares of a 21-sharing. We also confirm the practicality of our attack with
a real life experiment on a development platform embedding the ATMega328
Horizontal Side-Channel Attacks and Countermeasures 25
processor (see the full version of this paper [BCPZ16]). Actually, in this context
where the leakages are multivariate and not univariate as in our theoretical
analyses and simulations, the attack appears to be more efficient than expected
and recovers all the shares of a n-sharing when n 40.
Eventually, we describe a variant of Rivain-Prouff’s multiplication that is still
provably secure in the original ISW model, and also heuristically secure against
our new attacks. Our new countermeasure is similar to the countermeasure in
[FRR+10], in that it can be divided in two steps: a “matrix” step in which start-
ing from the input shares xi and yj , one obtains a matrix xi ·yj with n2 elements,
and a “compression” step in which one uses some randomness to get back to a
n-sharing ci . Assuming a leak-free component, the countermeasure in [FRR+10]
is proven secure in the noisy leakage model, in which the leakage function reveals
all the bits of the internal state of the circuit, perturbed by independent bino-
mial noise. Our countermeasure does not use any leak-free component, but is
only heuristically secure in the noisy leakage model (see Sect. 8.2 for our security
analysis).
2 Preliminaries
For two positive integers n and d, a (n, d)-sharing of a variable x defined over
n finite field F2k is a random vector (x1 , x2 , . . . , xn ) over F2k such that x =
some
i=1 xi holds (completeness equality) and any tuple of d−1 shares xi is a uniform
random vector over (F2k )d−1 . If n = d, the terminology simplifies to n-sharing.
An algorithm with domain (F2k )n is said to be (n − 1)th-order secure in the
probing model if on input an n-sharing (x1 , x2 , . . . , xn ) of some variable x, it
admits no tuple of n − 1 or fewer intermediate variables that depends on x.
We refer to the full version of this paper [BCPZ16] for the definitions of Signal
to Noise Ratio (SNR), Gaussian distribution, entropy and differential entropy.
Algorithm 1. SecMult
Input: the n-sharings (xi )i∈[1..n] and (yj )j∈[1..n] of x and y respectively
Output: the n-sharing (ci )i∈[1..n] of x · y
1: for i = 1 to n do
2: for j = i + 1 to n do
3: ri,j ←$ F2k
4: rj,i ← (ri,j + xi · yj ) + xj · yi
5: end for
6: end for
7: for i = 1 to n do
8: ci ← xi · yi
9: for j = 1 to n, j = i do ci ← ci + ri,j
10: end for
11: return (c1 , c1 , . . . , cn )
22kn · px,y =
n
fLi |Xi (ij , xi ) · fLj |Yj (ij , yj ) · fLij |Xi Yj (ij , xi yj ).
x1 ,··· ,xn ∈F k y1 ,··· ,yn ∈F k i,j=1
2 2
x=x1 +···+xn y=y1 +···+yn
Unfortunately, even if the equation above shows how to deduce the pdfs
fL|(X ,Y ) (·, (x , y )) from characterizations of the shares’ manipulations, a
direct processing of the probability has complexity O(22nk ). By representing the
sum over the xi ’s as a sequence of convolution products, and thanks to Walsh
transforms processing, the complexity can be easily reduced to O(n2n(k+1) ). The
latter complexity stays however too high, even for small values of n and k, which
led us to look at alternatives to this attack.
1
In (1)–(3), it is assumed that the observations (ij )j∈[1..n] and (ij )i∈[1..n] are aver-
√
aged to build a single observation with noise divided by n. This assumption is not
done here in order to stay as general as possible.
Horizontal Side-Channel Attacks and Countermeasures 29
and in choosing the candidate x̂i which maximizes the probability. We refer to
the full version of this paper [BCPZ16] for the derivation of each score fL|Xi (, x̂i )
in (7); we obtain:
f(Lj ,Lij )|Xi ((j , ij ), x̂i ) = f(Lj ,Lij )|(Xi ,Yj ) ((j , ij ), (x̂i , y)) · pYj (y) , (8)
y∈F2k
and similarly:
f(Li ,Lij )|Yj ((i , ij ), ŷj ) = f(Li ,Lij )|(Xi ,Yj ) ((i , ij ), (x, ŷj )) · pXi (x) . (9)
x∈F2k
Table 1. First attack: number of shares n as a function of the noise σ to succeed with
probability > 0.5
σ (SNR) 0 (+∞) 0.2 (25) 0.4 (6.25) 0.6 (2.77) 0.8 (1.56) 1 (1)
n 12 14 30 73 160 284
being replaced by the likelihood probability new-pYj (y) which has been previ-
ously computed. The scheme is afterwards repeated until the maximum taken
by the pdfs of each share Xi and Yj is greater than some threshold β. In order
to have better results, we perform the whole attack a second time, by starting
with the computation of the likelihood probability for each hypothesis Yj = y
instead of starting by Xi = x.
We give the formal description of the attack processing in Algorithm 2
(in order to have the complete attack, one should perform the while loop a
second time, by rather starting with the computation of new-pYj (y) instead of
new-pXi (x)).
σ 0 0.2 0.4 (6.25, 4.41) 0.6 (2.77, 1.96) 0.8 (1.56, 1.10) 1 (1, 0.7071)
(SNR4 , SNR8 ) (+∞, +∞) (25, 17.67)
n (for F24 ) 2 2 3 6 13 25
n (for F28 ) 5 6 8 11 16 21
7 Practical Results
In the full version of this paper [BCPZ16], we describe the result of practi-
cal experiments of our attack against a development platform embedding the
ATMega328 processor.
32 A. Battistello et al.
1: for i = 1 to n do
2: for x ∈ F2k do # Initialize the likelihood of each candidate for Xi
3: pXi (x) = fLi |Xi (i , x)
4: end for
5: for y ∈ F2k do # Initialize the likelihood of each candidate for Yi
6: pYi (y) = fLi |Yi (i , yi )
7: new-pYi (y) = pYi (y)
8: end for
9: end for
Algorithm 3. RefSecMult
Input: n-sharings (xi )i∈[1..n] and (yj )j∈[1..n] of x and y respectively
Output: an n-sharing (ci )i∈[1..n] of x · y
1: Mij ← MatMult((x1 , . . . , xn ), (y1 , . . . , yn ))
2: for i = 1 to n do
3: for j = i + 1 to n do
4: ri,j ←$ F2k
5: rj,i ← (ri,j + Mij ) + Mji
6: end for
7: end for
8: for i = 1 to n do
9: ci ← Mii
10: for j = 1 to n, j = i do ci ← ci + ri,j
11: end for
12: return (c1 , c1 , . . . , cn )
probing model, and heuristically secure against the horizontal side-channel attacks
described the in previous sections.
As observed in [FRR+10], the ISW and Rivain-Prouff countermeasures can
be divided in two steps: a “matrix” step in which starting from the input shares
xi and yj , one obtains a matrix xi · yj with n2 elements, and a “compression”
step in which one uses some randomness to get back to a n-sharing ci . Namely
the matrix elements (xi · yj )1≤i,j≤n form a n2 -sharing of x · y :
n ⎛ n ⎞
x · y = xi · ⎝ yj ⎠ = xi · yj (10)
i=1 j=1 1≤i,j≤n
and the goal of the compression step is to securely go from such n2 -sharing of
x · y to a n-sharing of x · y .
Our new countermeasure (Algorithm 3) uses the same compression step as
Rivain-Prouff, but with a different matrix step, called MatMult (Algorithm 4),
so that the shares xi and yj are not used multiple times (as when computing
the matrix elements xi · yj in Rivain-Prouff). Eventually the MatMult algorithm
outputs a matrix (Mij )1≤i,j≤n which is still a n2 -sharing of x · y , as in (10);
therefore using the same compression step as Rivain-Prouff, Algorithm 3 outputs
a n-sharing of x · y , as required.
As illustrated in Fig. 1, the MatMult algorithm is recursive and computes the
n×n matrix in four sub-matrix blocs. This is done by splitting the input shares xi
and yj in two parts, namely X (1) = (x1 , . . . , xn/2 ) and X (2) = (xn/2+1 , . . . , xn ),
and similarly Y (1) = (y1 , . . . , yn/2 ) and Y (2) = (yn/2+1 , . . . , yn ), and recursively
processing the four sub-matrix blocs corresponding to X (u) ×Y (v) for 1 ≤ u, v ≤
2. To prevent the same share xi from being used twice, each input block X (u)
and Y (v) is refreshed before being used a second time, using a mask refreshing
algorithm. An example of such mask refreshing, hereafter called RefreshMasks,
can for instance be found in [DDF14]; see Algorithm 5. Since the mask refreshing
34 A. Battistello et al.
Algorithm 4. MatMult
Input: the n-sharings (xi )i∈[1..n] and (yj )j∈[1..n] of x and y respectively
Output: the n2 -sharing (Mij )i∈[1..n],j∈[1..n] of x · y
1: if n = 1 then
2: M ← [x1 · y1 ]
3: else
4: X (1) ← (x1 , . . . , xn/2 ), X (2) ← (xn/2+1 , . . . , xn )
5: Y (1) ← (y1 , . . . , yn/2 ), Y (2) ← (yn/2+1 , . . . , yn )
6: M (1,1) ← MatMult(X (1) , Y (1) )
7: X (1) ← RefreshMasks(X (1) ), Y (1) ← RefreshMasks(Y (1) )
8: M (1,2) ← MatMult(X (1) , Y (2) )
9: M (2,1) ← MatMult(X (2) , Y (1) )
10: X (2) ← RefreshMasks(X (2) ), Y (2) ← RefreshMasks(Y (2) )
11: M (2,2)← MatMult(X (2) , Y (2) )
M (1,1) M (1,2)
12: M ←
M (2,1) M (2,2)
13: end if
14: return M
does not modify the xor of the input n/2-vectors X (u) and Y (v) , each sub-matrix
block M (u,v) is still a n2 /4-sharing of (⊕X (u) ) · (⊕X (v) ), and therefore the
output matrix M is still a n2 -sharing of x · y , as required. Note that without
the RefreshMasks, we would have Mij = xi · yj as in Rivain-Prouff.
Algorithm 5. RefreshMasks
Input: a1 , . . . , an n
Output: c1 , . . . , cn such that ni=1 ci = i=1 ai
1: For i = 1 to n do ci ← ai
2: for i = 1 to n do do
3: for j = i + 1 to n do do
4: r ← {0, 1}k
5: ci ← ci + r
6: cj ← cj + r
7: end for
8: end for
9: return c1 , . . . , cn
Since the RefreshMask algorithm has complexity O(n2 ), it is easy to see that
the complexity of our RefSecMult algorithm is O(n2 log n) (instead of O(n2 )
for the original Rivain-Prouff countermeasure in Algorithm 1). Therefore for
a circuit of size |C| the complexity is O(|C| · n2 log n), instead of O(|C| · n2 )
for Rivain-Prouff. The following lemma shows the soundness of our RefSecMult
countermeasure.
Horizontal Side-Channel Attacks and Countermeasures 35
y1 . . . y n2 y n2 +1 . . . yn
x1
..
. ⊗ R ⊗
x n2
R R
x n2 +1
.. ⊗ R ⊗
.
xn
Fig. 1. The recursive MatMult algorithm, where R represents the RefreshMasks Algo-
rithm, and ⊗ represents a recursive call to the MatMult algorithm.
Proven Security in the ISW Probing Model. We prove that our RefSecMult
algorithm achieves at least the same level of security of Rivain-Prouff, namely it
36 A. Battistello et al.
is secure in the ISW probing model against t probes for n ≥ t + 1 shares. For this
we use the refined security model against probing attacks recently introduced
in [BBD+15], called t-SNI security. This stronger definition of t-SNI security
enables to prove that a gadget can be used in a full construction with n ≥
t + 1 shares, instead of n ≥ 2t + 1 for the weaker definition of t-NI security
(corresponding to the original ISW security proof). The authors of [BBD+15]
show that the ISW (and Rivain-Prouff) multiplication gadget does satisfy this
stronger t-SNI security definition. They also show that with some additional
mask refreshing satisfying the t-SNI property (such as RefreshMasks), the Rivain-
Prouff countermeasure for the full AES can be made secure with n ≥ t+1 shares.
The following lemma shows that our RefSecMult countermeasure achieves
the t-SNI property; we provide the proof in Appendix A. The proof is essentially
the same as in [BBD+15] for the Rivain-Prouff countermeasure; namely the
compression step is the same, and for the matrix step, in the simulation we can
assume that all the randoms in RefreshMasks are given to the adversary. The
t-SNI security implies that our RefSecMult algorithm is also composable, with
n ≥ t + 1 shares.
Lemma 2 (t-SNI of RefSecMult). Let (xi )1≤i≤n and (yi )1≤i≤n be the input
shares of the SecMult operation, and let (ci )1≤i<n be the output shares. For any
set of t1 intermediate variables and any subset |O| ≤ t2 of output shares such
that t1 + t2 < n, there exists two subsets I and J of indices with |I| ≤ t1 and
|J| ≤ t1 , such that those t1 intermediate variables as well as the output shares
c|O can be perfectly simulated from x|I and y|J .
A Proof of Lemma 2
Our proof is essentially the same as in [BBD+15]. We construct two sets I and
J corresponding to the input shares of x and y respectively. We denote by
Horizontal Side-Channel Attacks and Countermeasures 37
Mij the result of the subroutine MatMult((x1 , . . . , xn ), (y1 , . . . , yn )). From the
definition of MatMult and RefreshMasks, it is easy to see that each Mij can be
perfectly simulated from xi and yj ; more generally any internal variable within
MatMult can be perfectly simulated from xi and/or yj for some i and j; for this
it suffices to simulate the randoms in RefreshMasks exactly as they are generated
in RefreshMasks.
We divide the internal probes in 4 groups. The four groups are processed
separately and sequentially, that is we start with all probes in Group 1, and
finish with all probes in Group 4.
We have |I| ≤ t1 and |J| ≤ t1 , since for every probe we add at most one index
in I and J. The simulation of probed variables in groups 1 and 4 is straight-
forward. Note that for i < j, the variable rij is used in all partial sums cik for
k ≥ j; moreover rij is used in rij ⊕ Mij , which is used in rji , which is used in all
partial sums cjk for k ≥ i. Therefore if i ∈ / U , then rij is not probed and does
not enter in the computation of any probed cik ; symmetrically if j ∈ / U , then rji
is not probed and does not enter in the computation of any probed cjk .
For any pair i < j, we can now distinguish 4 cases:
• Case 1: {i, j} ∈ U . In that case, we can perfectly simulate all variables rij ,
Mij , Mij ⊕ rij , Mji and rji . In particular, we let rij ← F2k , as in the real
circuit.
• Case 2: i ∈ U and j ∈ / U . In that case we simulate rij ← F2k , as in the real
circuit. If Mij ⊕ ri,j is probed (Group 3), we can also simulate it since i ∈ U
and j ∈ J by definition of the processing of Group 3 variables.
• Case 3: i ∈ / U and j ∈ U . In that case rij has not been probed, nor any variable
cik , since otherwise i ∈ U . Therefore rij is not used in the computation of any
probed variable (except rji , and possibly Mij ⊕ri,j ). Therefore we can simulate
rji ← F2k ; moreover if Mij ⊕ rij is probed, we can perfectly simulate it using
Mij ⊕ rij = Mji ⊕ rji , since j ∈ U and i ∈ J by definition of the processing of
Group 3 variables.
• Case 4: i ∈ / U and j ∈ / U . If Mij ⊕ ri,j is probed, since rij is not probed
and does not enter into the computation of any other probed variable, we can
perfectly simulate such probe with a random value.
38 A. Battistello et al.
From cases 1, 2 and 3, we obtain that for any i
= j, we can perfectly simulate
any variable rij such that i ∈ U . This implies that we can also perfectly simulate
all partial sums cik when i ∈ U , including the output variables ci . Finally, all
probed variables are perfectly simulated.
We now consider the simulation of the output variables ci . We must show
how to simulate ci for all i ∈ O, where O is an arbitrary subset of [1, n] such
that t1 + |O| < n. For i ∈ U , such variables are already perfectly simulated,
as explained above. We now consider the output variables ci with i ∈ / U . We
construct a subset of indices V as follows: for any probed Group 3 variable
Mij ⊕ rij such that i ∈/ U and j ∈ / U (this corresponds to Case 4), we put j in V
if i ∈ O, otherwise we put i in V . Since we have only considered Group 3 probes,
we must have |U | + |V | ≤ t1 , which implies |U | + |V | + |O| < n. Therefore there
exists an index j ∈ [1, n] such that j ∈ / U ∪ V ∪ O. For any i ∈ O, we can
write: ⎛ ⎞
ci = Mii ⊕ rij = ri,j ⊕ ⎝Mii ⊕ rij ⎠
j=i j=i,j
We claim that neither ri,j nor rj ,i do enter into the computation of any
probed variable or other ci for i ∈ O. Namely i ∈ / U so neither ri,j nor any
partial sum cik was probed; similarly j ∈ / U so neither rj ,i nor any partial
sum cj ,k was probed, and the output cj does not have to be simulated since
by definition j ∈ / O. Finally if i < j then Mi,j ⊕ ri,j was not probed since
otherwise j ∈ V (since i ∈ O); similarly if j < i then Mj ,i ⊕ rj ,i was
References
[BBD+15] Barthe, G., Belaı̈d, S., Dupressoir, F., Fouque, P.-A., Grégoire, B.: Com-
positional verification of higher-order masking: application to a verifying
masking compiler. Cryptology ePrint Archive, Report 2015/506 (2015).
http://eprint.iacr.org/
[BCPZ16] Battistello, A., Coron, J.-S., Prouff, E., Zeitoun, R.: Horizontal side-channel
attacks, countermeasures on the ISW masking scheme. Cryptology ePrint
Archive, Report 2016/540 (2016). Full version of this paper http://eprint.
iacr.org/
[BJPW13] Bauer, A., Jaulmes, E., Prouff, E., Wild, J.: Horizontal and vertical side-
channel attacks against secure RSA implementations. In: Dawson, E. (ed.)
CT-RSA 2013. LNCS, vol. 7779, pp. 1–17. Springer, Heidelberg (2013)
[Bla79] Blakely, G.R.: Safeguarding cryptographic keys. In: National Computer
Conference, vol. 48, pp. 313–317. AFIPS Press, New York, June 1979
[CCD88] Chaum, D., Crépeau, C., Damgård, I.: Multiparty unconditionally secure
protocols (extended abstract). In: Simon, J. (ed.) Proceedings of 20th
Annual ACM Symposium on Theory of Computing, Chicago, Illinois, USA,
pp. 11–19. ACM, 2–4 May 1988
Horizontal Side-Channel Attacks and Countermeasures 39
[CFG+10] Clavier, C., Feix, B., Gagnerot, G., Roussellet, M., Verneuil, V.: Horizontal
correlation analysis on exponentiation. In: Soriano, M., Qing, S., López, J.
(eds.) ICICS 2010. LNCS, vol. 6476, pp. 46–61. Springer, Heidelberg (2010)
[CJRR99] Chari, S., Jutla, C.S., Rao, J.R., Rohatgi, P.: Towards sound approaches
to counteract power-analysis attacks. In: Wiener, M. (ed.) CRYPTO 1999.
LNCS, vol. 1666, pp. 398–412. Springer, Heidelberg (1999)
[DDF14] Duc, A., Dziembowski, S., Faust, S.: Unifying leakage models: from probing
attacks to noisy leakage. In: Nguyen, P.Q., Oswald, E. (eds.) EUROCRYPT
2014. LNCS, vol. 8441, pp. 423–440. Springer, Heidelberg (2014)
[DFS15] Duc, A., Faust, S., Standaert, F.-X.: Making masking security proofs con-
crete. In: Oswald, E., Fischlin, M. (eds.) EUROCRYPT 2015. LNCS, vol.
9056, pp. 401–429. Springer, Heidelberg (2015)
[FRR+10] Faust, S., Rabin, T., Reyzin, L., Tromer, E., Vaikuntanathan, V.: Protect-
ing circuits from leakage: the computationally-bounded and noisy cases.
In: Gilbert, H. (ed.) EUROCRYPT 2010. LNCS, vol. 6110, pp. 135–156.
Springer, Heidelberg (2010)
[GHR15] Guilley, S., Heuser, A., Rioul, O.: A key to success - success expo-
nents for side-channel distinguishers. In: Biryukov, A., Goyal, V. (eds.)
INDOCRYPT 2015. LNCS, vol. 9462, pp. 270–290. Springer, Heidelberg
(2015)
[ISW03] Ishai, Y., Sahai, A., Wagner, D.: Private circuits: securing hardware against
probing attacks. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol. 2729, pp.
463–481. Springer, Heidelberg (2003)
[ORSW12] Oren, Y., Renauld, M., Standaert, F.-X., Wool, A.: Algebraic side-
channel attacks beyond the hamming weight leakage model. In: Prouff, E.,
Schaumont, P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 140–154. Springer,
Heidelberg (2012)
[PR13] Prouff, E., Rivain, M.: Masking against side-channel attacks: a formal secu-
rity proof. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013.
LNCS, vol. 7881, pp. 142–159. Springer, Heidelberg (2013)
[RP10] Rivain, M., Prouff, E.: Provably secure higher-order masking of AES. In:
Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol. 6225, pp.
413–427. Springer, Heidelberg (2010)
[Sha79] Shamir, A.: How to share a secret. Commun. ACM 22(11), 612–613 (1979)
[SVO+10] Standaert, F.-X., Veyrat-Charvillon, N., Oswald, E., Gierlichs, B., Medwed,
M., Kasper, M., Mangard, S.: The world is not enough: another look on
second-order DPA. In: Abe, M. (ed.) ASIACRYPT 2010. LNCS, vol. 6477,
pp. 112–129. Springer, Heidelberg (2010)
[VGS14] Veyrat-Charvillon, N., Gérard, B., Standaert, F.-X.: Soft analytical side-
channel attacks. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014. LNCS,
vol. 8873, pp. 282–296. Springer, Heidelberg (2014)
Towards Easy Leakage Certification
1 Introduction
Side-channel attacks are an important threat against the security of modern
embedded devices. As a result, the search for efficient approaches to secure cryp-
tographic implementations against such attacks has been an ongoing process over
the last 15 years. Sound tools for quantifying physical leakages are a central ingre-
dient for this purpose, since they are necessary to balance the implementation
cost of concrete countermeasures with the security improvements they provide.
Hence, while early countermeasures came with proposals of security evaluations
that were sometimes specialized to the countermeasure, more recent works have
investigated the possibility to consider evaluation methods that generally apply
to any countermeasure. The unified evaluation framework proposed at Euro-
crypt 2009 is a popular attempt in this direction [23]. It suggests to analyze
cryptographic implementations with a combination of information theoretic and
security metrics. The first ones aim at measuring the (worst-case) information
leakage independent of the adversary exploiting it, and are typically instantiated
with the Mutual Information (MI). The second ones aim at quantifying how effi-
ciently an adversary can take advantage of this leakage in order to turn it into
(e.g.) a key recovery, and are typically instantiated with a success rate.
In this context, an important observation is that most side-channel attacks,
and in particular any standard Differential Power Analysis (DPA) attack, require
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 40–60, 2016.
DOI: 10.1007/978-3-662-53140-2 3
Towards Easy Leakage Certification 41
a leakage model [13]. This model usually corresponds to an estimation of the leak-
age Probability Density Function (PDF), possibly simplified to certain statistical
moments. Since the exact distribution of (e.g.) power consumption or electro-
magnetic radiation measurements is generally unknown, it raises the problem
that any physical security evaluation is possibly biased by model errors. In other
words, security evaluations ideally require a perfect leakage model (so that all
the information is extracted from the measurements). But in practice models are
never perfect, so that the quality of the evaluation may highly depend on the
quality of the evaluator. This intuition can be captured with the notion of Per-
ceived Information (PI), that is nothing else than an estimation of the MI biased
by the side-channel evaluator’s model [19]. Namely, the MI captures the worst-
case security level of an implementation, as it corresponds to an (hypothetical)
adversary who can perfectly profile the leakage PDF. By contrast, the PI cap-
tures its practical counterpart, where actual (statistical) estimation procedures
are used by an evaluator, in order to profile the leakage PDF.
Picking up on this problem, Durvaux et al. introduced first “leakage certifica-
tion” methods at Eurocrypt 2014 [8]. Intuitively, leakage certification starts from
the fact that actual leakage models are obtained via PDF estimation, which may
lead to both estimation and assumption errors. As a result, and since it seems
hard to enforce that such estimated models are perfect, the best that one can
hope is to guarantee that they are “good enough”. For estimation errors, this is
easily verified using standard cross–validation techniques (in general, estimation
errors can anyway be made arbitrarily small by measuring more). For assumption
errors, things are more difficult since detecting them requires to find out whether
the estimated model is close to an (unknown) perfect model. Interestingly, the
Eurocrypt 2014 paper showed that indirect approaches allow determining if this
condition is respected, essentially by comparing the model errors caused by incor-
rect assumptions to estimation errors. That is, let us assume that an evaluator is
given a set of leakage measurements to quantify the security of a leaking imple-
mentation. As long as the assumption errors measured from these traces remain
small in front of the estimation errors, the evaluator is sure that any improvement
of his (possibly imperfect) assumptions will not lead to noticeable degradations
of the estimated security level – since the impact of improved assumptions will
essentially be hidden by the estimation errors. By contrast, once the assumption
errors become significant in front of estimation ones, it means that an improved
model is required to extract all the information from the measurements. Hence,
leakage certification allows ensuring that the modeling part of an evaluation is
sound (i.e. only depends on the implementation – not the evaluator).
In practice, the leakage certification test in [8] requires a number of tech-
nical ingredients. Namely, the evaluator first has to characterize the leakages
of the target implementation with a sampled (cumulative) distance distrib-
ution, and to characterize his model with a simulated (cumulative) distance
distribution. Working with distances allows exploiting a univariate goodness–
of–fit test even for leakages of large dimensionalities (i.e. it allows comparing
the univariate distances between multivariate leakages rather than comparing
42 F. Durvaux et al.
independent of whether it can be exploited (e.g. how many traces do you need to
attack). By contrast, leakage certification aims to guarantee that a leakage model
that can be exploited in an attack (and, e.g. can be used to determine a key recov-
ery success rate) is close enough to the true leakage model. That is, it aims to make
evaluators confident that their attacks are close enough to the worst-case ones. So
leakage detection and certification are essentially complementary. Note that leak-
age models (and certification) are needed in any attempt to connect side-channel
analysis with cryptographic security guarantees (e.g. in leakage resilience [10]),
where we will always need an accurate evaluation of the security level, or to build
security graphs such as introduced in [27].
2 Background
2.1 Measurement Setup
We will consider both software and hardware experiments.
Our software experiments are based on measurements of an AES Furious
implementation1 run by an 8-bit Atmel AVR (ATMega644P) microcontroller at
a 20 MHz clock frequency. We monitored the voltage variations across a 22 Ω
resistor introduced in the supply circuit of our target chip. Acquisitions were per-
formed using a Lecroy WaveRunner HRO 66 oscilloscope running at 625 Msam-
ples/second and providing 8-bit samples. In practice, our evaluations focused on
the leakage of the first AES master key byte (but would apply identically to
any other enumerable target). Leakage traces were produced according to the
following procedure. Let x and s be our target input plaintext byte and subkey,
and y = x ⊕ s. For each of the 256 values of y, we generated 1000 encryption
traces, where the rest of the plaintext and key was random (i.e. we generated
256 000 traces in total, with plaintexts of the shape p = x||r1 || . . . ||r15 , keys of
the shape κ = s||r16 || . . . ||r30 , and the ri ’s denoting uniformly random bytes). In
order to reduce the memory cost of our evaluations, we only stored the leakage
corresponding to the 2 first AES rounds (as the dependencies in our target byte
y = x ⊕ s typically vanish after the first round, because of the strong diffusion
properties of the AES). We will denote the 1000 encryption traces obtained from
a plaintext p including the target byte x under a key κ including the subkey s as:
AESκs (px ) lyi (with i ∈ [1; 1000]). Eventually, whenever accessing the points
of these traces, we will use the notation lyi (τ ) (with τ ∈ [1; 10 000], typically).
Subscripts and superscripts are omitted when clear from the context.
Our hardware experiments are based on a similar setup, but consider a
threshold implementation of PRESENT similar to the Profile-4 design described
in [17]. The leakage in such hardware implementations is mostly determined
by the distance between two consecutive values in a target register R. Hence,
we generated traces lti (with i ∈ [1; 100 000]) for the 256 possible transitions
t =: R(x1 ⊕ s) → R(x2 ⊕ s) between 4-bit intermediate results of the PRESENT
S-box computations. This larger evaluation set was motivated by the protected
1
Available at http://point-at-infinity.org/avraes/.
44 F. Durvaux et al.
For convenience, we again express these metrics for software (value-based) pro-
filing. But they can straightforwardly adapted to the transition-based case.
Gaussian templates, that only includes estimates for the first-order moments of
the leakages. That is, for any time sample τ , we have model ˆ cpa (y) = m̂1 (τ ) =
y
Êi (Liy (τ )), with m̂1y a first-order moment and Ê the sample mean operator.
When summing over all s and x values, and a sufficiently large number of leak-
ages, the estimation tends to the correct MI. Yet, as mentioned in introduction,
the chip distribution Prchip [lyi |s, x] is generally unknown to the evaluator. So in
practice, the best that we can hope is to compute the following PI:
P̂I(S; X, L) = H[S] + Pr[s] Pr[x] Prchip [lyi |s, x]. log2 P̂rmodel [s|x, lyi ],
s∈S x∈X i ∈L
ly t
where P̂rmodel ← LpY is typically obtained using the previous Gaussian templates
or LR-based models. Under the assumption that the model is properly estimated,
it is shown in [13] that the CPA and PI metrics are essentially equivalent in
the context of standard univariate side-channel attacks (i.e. exploiting a single
leakage point lyi (τ ) at a time). By contrast, only the PI naturally extends to
multivariate attacks. It can be interpreted as the amount of information leakage
that will be exploited by an adversary using an estimated model. So just as the
MI is a good predictor for the success rate of an ideal TA exploiting the perfect
model Prchip , the PI is a good predictor for the success rate of an actual TA
exploiting the “best available” model P̂rmodel obtained thanks to profiling.
46 F. Durvaux et al.
to test the relevance of a model “moment by moment”. That is, for a number
of traces N in an evaluation set, one could verify that the moments estimated
from actual leakage samples are hard to tell apart from the moments estimated
from the model (with the same number of samples N ). Based on this idea, our
simplified method to detect assumption errors will be based on the following two
hypotheses (one strictly necessary and the other optional but simplifying).
2
Note that theoretical approches to guarantee that a distribution is well characterized
by its moments (such as Carleman’s condition [22]) typically apply when considering
an infinite number of them and in general, no distribution is determined by a finite
number of moments. So the restriction of our reasoning to specific classes of mean-
ingful distributions is in fact necessary for our approach to be sound. Besides, note
also that non-parametric PDF estimations may not suffer from assumption errors
(at the cost of a significantly increased estimation cost), so are out of scope here.
Towards Easy Leakage Certification 49
As for the Gaussian assumption, our motivation is even more pragmatic, and
relates to the observation that simple t-tests are becoming de facto standards in
the preliminary evaluation of leaking devices [11,14,21]. So we find it appealing
to rely on statistical tools that are already widespread in the CHES community,
and to connect them with leakage certification. As will be clear next, this allows
us to use the same evaluation method for statistical moments of different orders.
However, we insist that it is perfectly feasible to refine our approach by using a
well adapted test for each statistical moment (e.g. F-test for variances, . . . ).
The main idea behind our new leakage certification method is to compare
(actual) dth-order moments m̂dy estimated from the leakages with (simulated)
dth-order moments m̃dy estimated from the evaluator’s model P̂rmodel (by sam-
pling this model). Thanks to our second assumption, this comparison can simply
be performed based on Student’s t-test. For this purpose, we need multiple esti-
mations of the moments m̂dy and m̃dy , that we will obtain thanks to an approach
inspired from Sect. 2.4 (although there is no cross–validation involved here).
More precisely, we start by splitting the full set of evaluation traces L into
k (non overlapping) sets of approximately the same size L(j) , with 1 ≤ j ≤ k.
From these k subsets, we produce k estimates of (actual) dth -order moments
d,(j)
m̂y , each of them from a set L(j) . We then produce a set of simulated traces
L̃ that has the same size and corresponds to the same intermediate values as the
real evaluation set L, but where the leakages are sampled according to the model
that we want to evaluate. In other words, we first build the model P̂rmodel ← L,
and then generate a simulated set of traces L̃ ← P̂rmodel . Based on L̃, we produce
d,(j)
k estimates of (simulated) dth -order moments m̃y , each of them from a set
L̃ , as done for the real set of evaluation traces. From these real and simulated
(j)
d,(j) d,(j)
μ̂dy = Êj (m̂y ), σ̂yd = ˆ j (m̂y ),
var
d,(j) d,(j)
μ̃dy = Êj (m̃y ), σ̃yd = var
ˆ j (m̃y ),
where varˆ is the sample variance operator. Eventually, we simply estimate the t
statistic (next denoted with Δdy ) as follows:
μ̂dy − μ̃dy
Δdy = .
(σ̂yd )2 +(σ̃yd )2
k
The p-value of this t statistic within the associated Student’s distribution returns
the probability that the observed difference is the result of estimations issues:
5 Simulated Experiments
In order to validate our moment-based certification method, we first analyze a
couple of simulated experiments, where we can control the assumption errors. In
particular, and in order to keep these simulations reasonably close to concrete
attacks, we consider four distinct scenarios. In the first one (reported in Fig. 1)
we investigate errors in the mean of the model distribution. The upper part of the
figure represents a non-parametric estimate of the true leakage distribution (with
histograms) and a leakage model P̂rmodel following a Gaussian distribution. The
d,(j)
middle part of the figure represents the estimated moments m̂y (in blue) and
d,(j)
m̃y (in red), in function of the number of traces used for their estimation, from
which we clearly see the error in the mean. The lower part of the figure represents
the evolution of our test’s p-value in function of the number of traces used for
certification. As expected, we directly detect an error in the mean (reflected
by a very small p-value for this moment), whereas the p-values of the other
moments remain erratic, reflecting the fact that (hypothetical) assumption errors
are not significant in front of estimation errors (i.e. do not lead to significant
information losses) for those moments. Similar figures corresponding to model
errors in the variance, skewness and kurtosis are reported in the ePrint version of
this work [9]. The last two cases typically correspond to the setting of a masked
implementation for which the true distribution is a mixture [25].
These results confirm the simplicity of the method. That is, as the num-
ber of measurements in the evaluation set increases, we are able to detect the
assumption errors in all cases. The only difference between the applications to
different moments is that errors on higher-order moments may be more difficult
to detect as the noise increases. This difference is caused by the same argument
that justifies the relevance of the higher-order masking countermeasure. Namely,
the sampling complexity when estimating the moments of a sufficiently noisy dis-
tribution increases exponentially in d. However, this is not a limitation of the
certification test: if such errors are not detected for a given evaluation set, it just
means that their impact is still small in front of assumption errors at this stage
of the evaluation. Besides, we note that the respective relevance of the model
errors on different moments will be further discussed in Sect. 7.
3
Student’s t distribution is a parametric probability density function whose only para-
meter is its number of freedom degrees, that can be directly derived from k and the
previous σ estimates as: df = (k − 1) × [(σ̂yd )2 + (σ̃yd )2 ]2 /[(σ̂yd )4 + (σ̃yd )4 ].
Towards Easy Leakage Certification 51
pdf
0.5
0.1
0
-5 0 5 10
mean standard deviation skewness kurtosis
estimated moments
3 1.5
4.5
2.5 1 6
4
2 0.5
3.5
1.5 0 4
3
1 -0.5
2.5
0.5 -1 2
2
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
1 1 1 1
0 0 0 0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
→ number of traces → number of traces → number of traces → number of traces
Fig. 1. Gaussian leakages, Gaussian model, error in the estimated mean. (Color figure
online)
6 Software Experiments
In order to obtain a fair comparison with the results provided in [8], we first
applied our new leakage certification method to the same case-study. Namely,
we used the measurement setup from Sect. 2.1 and evaluated the relevance of
two important profiling methods, namely the Gaussian TA and LR, for the most
informative time sample in our leakage traces (i.e. with maximum PI).
The main difference with the previous simulated experiments is that we now
have to test 256 models independently (each of them corresponding to a target
intermediate value y = x ⊕ s). Our results are represented in Fig. 2, where we
plot the p-values output by our different t-tests in greyscale, for four statistical
moments (i.e. the mean, variance, skewness and kurtosis). That is, each line in
this plot corresponds to the lower part of the previous Fig. 1. A look at the first
two moments essentially confirms the conclusions of Durvaux et al. More pre-
cisely, the Gaussian templates capture the measured leakages quite accurately
(for the 256,000 traces in our evaluation set). By contrast, the linear regression
quickly exhibits inconsistences. Interestingly, assumption errors appear both in
the means and in the variances, which corresponds to the expected intuition.
That is, errors in the means are detected because for most target intermediate
values, the actual leakage cannot be accurately predicted by a linear combina-
tion of the S-box output bits.4 And errors in the variances appear because the
4
This happens for the selected time sample because of pipelining effects in the AVR
microcontroller. Note that as in [8], the linear model did not exhibit any assumption
error for other time samples given the amount of measured traces.
52 F. Durvaux et al.
50 50 50 50
0.5
0 0 0 0
linear regression
50 50 50 50
Fig. 2. Results of the new leakage certification test for software measurements.
LR-based models rely on the homoscedastic error assumption and capture both
physical noise and noise due to assumption errors in a single term.
By contrast, and quite intriguingly, a look at the last two moments
(i.e. skewness and kurtosis) also shows some differences with the results in [8].
That is, we remark that even for Gaussian templates, small model errors appear
in these higher-order moments. This essentially corresponds to the fact that our
measured leakages do not have perfectly key-independent skewness and kurtosis,
as we assume in Gaussian PDF estimations. This last observation naturally raises
the question whether these errors are significant, i.e. do they contradict the
results of the Eurocrypt 2014 leakage certification test? In the next section, we
show that it is not the case, and re-conciliate both approaches by investigating
the respective informativeness of the four moments in our new test.
the resulting estimated correlation features a “metric intuition”: the higher the
value of the MCP-DPA distinguisher computed for an order d, the more efficient
the MCP-DPA attack exploiting this statistical order of the leakage distribution.
Hence, computing the value of the MCP-DPA distinguisher for different values
of d should solve our problem, i.e. determine whether the moments for which
assumption errors are detected are (among) the most informative ones.
Concretely, we start by applying MCP-DPA in the traditional sense and
exploit cross–validation for this purpose, this time following exactly Sect. 2.4.
That is, the set of evaluation traces L is again split into k (non overlapping) sets
(j)
L(i) of approximately the same size, and we use profiling sets Lp = i=j L(i)
(j) (j)
and test sets Lt = L\Lp . We then repeatedly compute the dth-order moments
d,(j) (j)
m̂y ← Lp , and the dth-order MCP-DPA distinguisher:
d,(j) (j)
MCP-DPA(j) (d) = ρ̂ M̂Y , (Ly )d ← Lt .
moments-correlating DPA
mean
variance
real
skewness
kurtosis
Fig. 3. MCP-DPA results for software measurements (with 256 × 1000 traces).
54 F. Durvaux et al.
Our main observations are as follows. First, the upper part of the figure
suggests that the most informative moments in our leakage traces are the mean
and variance. There is indeed a small amount of information in the skewness
and kurtosis. But by considering the classical rule–of–thumb that the number of
samples Ns required to perform a successful correlation-based attack is inversely
proportional to the square of its correlation coefficient, that is:
c
Ns ≈ 2 ,
ρ̂ M̂Yd , (Ly )d
with c a small constant, we can see that the additional information gain in these
higher-order moments is very limited in our context. For example the value of
the mean-based MCP-DPA distinguisher (for which no assumption errors are
detected) is worth ≈ 0.74 in the figure, and the value of the kurtosis-based
MCP-DPA distinguisher (for which assumption errors are detected) is worth
≈ 0.02. Considering these two moments as independent information channels,
the loss caused by the assumption errors on the kurtosis can be approximated as
0.742
0.742 +0.022 ≈ 0.999, meaning that improving the model so that the kurtosis is well
characterized could only (and ideally) lead to an attack requiring this fraction of
Ns to succeed (that is close to 1). This observation backs up the conclusions of the
generic leakage certification test in [8] that Gaussian templates are sufficiently
accurate for our evaluation set. Next, we see that TA-based and LR-based MCP-
DPA yield no information in the higher-order moments, which trivially derives
from the fact that they rely on a Gaussian assumption. Eventually, and quite
interestingly, we note that the information loss between LR-based models and
TA-based models can be approximated thanks to the correlation between their
moments. For example, and considering the means in Fig. 3, we can compute the
value of the LR-based MCP-DPA distinguisher – worth ≈ 0.48 in the figure – by
multiplying the value of the TA-based MCP-DPA distinguisher – worth ≈ 0.74 –
by ρ̂(M̂Yd,ta , M̂Yd,lr ) – worth ≈ 0.65 in our experiments (i.e. by taking advantage
of the “product rule” for the correlation coefficient in [24]).
Those last tools are admittedly informal. Yet, we believe they provide a useful
variety of heuristics allowing evaluators to analyze the results of their certifica-
tion tests. In particular, they lead to easy–to–exploit intuitions regarding the
impact of model errors detected in moments of a given order. As discussed in
the beginning of Sect. 4, further formalizing these findings, and possibly putting
forward relevant scenarios where our simplified approach leads to significant
shortcomings, is an interesting scope for further research. Meanwhile, the next
section describes an open source code to demonstrate the implementation effi-
ciency of our new certification tests, and Sect. 9 complements these findings by
showing that the proposed certification method applies too in the more chal-
lenging context of (unprotected and) masked hardware implementations.
Towards Easy Leakage Certification 55
9 Hardware Experiments
As usual in the evaluation of masked implementations, we first ran a preliminary
test by setting the masks to constant null values, which actually corresponds to
the case of an unprotected FPGA implementation of PRESENT. As mentioned
56 F. Durvaux et al.
in Sect. 2, the main difference between this hardware case study and the previous
software one is that the leakages now depend on transitions between consecutive
values in a target register. For the rest, the details about such attacks and their
relation with the underlying architecture (that can be found in [16,17]) are not
necessary to understand our following discussions.
As expected, the results of this preliminary test were essentially similar to
the ones of the unprotected software case. That is, we did not detect assumption
errors for the Gaussian templates with up to 256,000 measurements, while some
errors could be detected in the LR-based attacks. The only interesting bit of
information from this context is the lower MCP-DPA values observed (see the
Appendix in [9]), which can be associated to a higher noise level.
We next moved to the more meaningful case with random masks activated,
for which the leakage certification results are given in Fig. 4. Two main obser-
vations can be extracted from these plots. First, and as previously, LR-based
attacks exhibit model errors in the first two moments, that are not detected
with Gaussian templates. Second, and more importantly, we see that strong
errors are detected for the skewness and kurtosis, already quite early in our
evaluation set. This is expected since these two moments are not captured at all,
neither by our Gaussian templates, nor by LR-based attacks. However, since the
information in a (first-order) threshold implementation should lie in higher-order
(at least > 1) statistical moments, it naturally raises the question whether this
model imperfection is critical from a security evaluation point–of–view.
50 50 50 50
0 0 0 0
linear regression
50 50 50 50
Fig. 4. Results of the new leakage certification test for masked hardware.
indeed information in all the other moments. So we are actually in a case where
the leakage certification test suggests improvements, and tells the evaluator that
his (Gaussian) templates are not sufficient to extract all the information, while
LR-based attacks could not succeed at all (since they do estimate a single vari-
ance for all the profiled transitions). This raises interesting scopes for further
research, since profiling methods that easily incorporate such higher-moments
have not been much explored in the side-channel literature so far [3].
moments-correlating DPA
mean
real (value)
variance
skewness
kurtosis
Fig. 5. MCP-DPA results for masked hardware (with 256 × 50,000 traces)
Besides, another interesting observation arises if, rather than simply plotting
the asymptotic MCP-DPA values, we also plot the Relative Distinguishing Margin
(RDM), defined in [28] as the distance between the correct key distinguisher value
and the value for the highest ranked alternative. As illustrated by the lower plot
of Fig. 5, this RDM is larger for the skewness than for the variance. This means
that while the variance is the most informative moment overall (i.e. assuming some
enumeration is possible as a post-processing after the attack [26]), the skewness is
more useful in case the adversary has to recover the key thanks to side-channel
measurements exclusively (since the nearest rival captured by the RDM is usually
the most difficult to distinguish from the good key).
Summarizing, these experiments confirm the applicability of our easy leakage
certification tests in a practically-relevant case study (i.e. a threshold implemen-
tation that is representative of state–of–the–art masking schemes). They also
put forward that combining MCP-DPA evaluations with the estimation of a
RDM metric allows extracting additional intuitions regarding the information
vs. computation tradeoff that is inherent to any side-channel attack.
58 F. Durvaux et al.
10 Conclusion
The evaluation of leaking devices against DPA attacks exploiting statistical mod-
els of leakage distributions implies answering two orthogonal questions:
1. Is the model used in the attack/evaluation correct?
2. How informative is the model used in the attack/evaluation?
The second question is highly investigated. It relates to the concrete security level
of an implementation given a model, e.g. measured with a number of samples
needed to recover the key. The first question is much less investigated and relates
to the risk of a “false sense of security”, i.e. evaluations based on non-informative
models despite informative leakages. Leakage certification allows evaluators to
guarantee that the models used in their DPA attacks are sufficiently accurate.
The simple tests we provide in this paper makes it possible to easily integrate
leakage certification in actual toolchains. We hope these results open the way
towards globally sound evaluations for leaking devices, where one first guaran-
tees that the models used in the attacks are correct, and then evaluates their
informativeness, which boil downs to compute their corresponding PI [7].
Interesting scopes for further research include the extension of the tools in
this paper to more case studies of protected implementations with higher-order
and multivariate leakages, and the investigation of the profiling errors due to the
characterization of different devices, possibly affected by variability [19].
References
1. http://perso.uclouvain.be/fstandae/PUBLIS/171.zip
2. http://satoh.cs.uec.ac.jp/sakura/index.html
3. Batina, L., Gierlichs, B., Prouff, E., Rivain, M., Standaert, F.-X., Veyrat-
Charvillon, N.: Mutual information analysis: a comprehensive study. J. Cryptol.
24(2), 269–291 (2011)
4. Brier, E., Clavier, C., Olivier, F.: Correlation power analysis with a leakage model.
In: Joye, M., Quisquater, J.-J. (eds.) CHES 2004. LNCS, vol. 3156, pp. 16–29.
Springer, Heidelberg (2004)
5. Chari, S., Rao, J.-R., Rohatgi, P.: Template attacks. In: Kaliski Jr., B.S., Kaya
Koç, Ç.K., Paar, C. (eds.) CHES 2002. LNCS, vol. 2523, pp. 13–28. Springer,
Heidelberg (2003)
6. Dabosville, G., Doget, J., Prouff, E.: A new second-order side channel attack based
on linear regression. IEEE Trans. Comput. 62(8), 1629–1640 (2013)
7. Duc, A., Faust, S., Standaert, F.-X.: Making masking security proofs concrete -
or how to evaluate the security of any leaking device. In: Oswald, E., Fischlin, M.
(eds.) EUROCRYPT 2015. LNCS, vol. 9056, pp. 401–429. Springer, Heidelberg
(2015)
Towards Easy Leakage Certification 59
8. Durvaux, F., Standaert, F.-X., Veyrat-Charvillon, N.: How to certify the leakage
of a chip? In: Nguyen, P.Q., Oswald, E. (eds.) EUROCRYPT 2014. LNCS, vol.
8441, pp. 459–476. Springer, Heidelberg (2014)
9. Durvaux, F., Standaert, F.-X., Del Pozo, S.M.: Towards easy leakage certification.
Cryptology ePrint Archive, Report 2015/537 (2015). http://eprint.iacr.org/
10. Dziembowski, S., Pietrzak, K.: Leakage-resilient cryptography. In: 49th Annual
IEEE Symposium on Foundations of Computer Science, FOCS 2008, 25-28 October
2008, Philadelphia, PA, USA, pp. 293–302. IEEE Computer Society (2008)
11. Goodwill, G., Jun, B., Jaffe, J., Rohatgi, P.: A testing methodology for side
channel resistance validation. NIST Non-invasive Attack Testing Workshop
(2011). http://csrc.nist.gov/news events/non-invasive-attack-testing-workshop/
papers/08 Goodwill.pdf
12. Heuser, A., Rioul, O., Guilley, S.: Good is not good enough - deriving optimal dis-
tinguishers from communication theory. In: Batina, L., Robshaw, M. (eds.) CHES
2014. LNCS, vol. 8731, pp. 55–74. Springer, Heidelberg (2014)
13. Mangard, S., Oswald, E., Standaert, F.-X.: One for all - all for one: unifying stan-
dard differential power analysis attacks. IET Inf. Secur. 5(2), 100–110 (2011)
14. Mather, L., Oswald, E., Bandenburg, J., Wójcik, M.: Does my device leak infor-
mation? An a priori statistical power analysis of leakage detection tests. In: Sako,
K., Sarkar, P. (eds.) ASIACRYPT 2013, Part I. LNCS, vol. 8269, pp. 486–505.
Springer, Heidelberg (2013)
15. Mather, L., Oswald, E., Whitnall, C.: Multi-target DPA attacks: pushing DPA
beyond the limits of a desktop computer. In: Sarkar, P., Iwata, T. (eds.) ASI-
ACRYPT 2014. LNCS, vol. 8873, pp. 243–261. Springer, Heidelberg (2014)
16. Moradi, A., Standaert, F.-X.: Moments-correlating DPA. IACR Cryptology ePrint
Archive 2014:409 (2014)
17. Poschmann, A., Moradi, A., Khoo, K., Lim, C.-W., Wang, H., Ling, S.: Side-
channel resistant crypto for less than 2, 300 GE. J. Cryptol. 24(2), 322–345 (2011)
18. Prouff, E., Rivain, M., Bevan, R.: Statistical analysis of second order differential
power analysis. IEEE Trans. Comput. 58(6), 799–811 (2009)
19. Renauld, M., Standaert, F.-X., Veyrat-Charvillon, N., Kamel, D., Flandre, D.: A
formal study of power variability issues and side-channel attacks for nanoscale
devices. In: Paterson, K.G. (ed.) EUROCRYPT 2011. LNCS, vol. 6632, pp. 109–
128. Springer, Heidelberg (2011)
20. Schindler, W., Lemke, K., Paar, C.: A stochastic model for differential side channel
cryptanalysis. In: Rao, J.R., Sunar, B. (eds.) CHES 2005. LNCS, vol. 3659, pp.
30–46. Springer, Heidelberg (2005)
21. Schneider, T., Moradi, A.: Leakage assessment methodology. In: Güneysu, T.,
Handschuh, H. (eds.) CHES 2015. LNCS, vol. 9293, pp. 495–513. Springer, Heidel-
berg (2015)
22. Spanos, A.: Probability Theory and Statistical Inference: Econometricmodeling
with Observational Data. Cambridge University Press, Cambridge (1999)
23. Standaert, F.-X., Malkin, T.G., Yung, M.: A unified framework for the analysis of
side-channel key recovery attacks. In: Joux, A. (ed.) EUROCRYPT 2009. LNCS,
vol. 5479, pp. 443–461. Springer, Heidelberg (2009)
24. Standaert, F.-X., Peeters, E., Rouvroy, G., Quisquater, J.-J.: An overview of power
analysis attacks against field programmable gate arrays. Proc. IEEE 94(2), 383–
394 (2006)
25. Standaert, F.-X., Veyrat-Charvillon, N., Oswald, E., Gierlichs, B., Medwed, M.,
Kasper, M., Mangard, S.: The world is not enough: another look on second-order
60 F. Durvaux et al.
DPA. In: Abe, M. (ed.) ASIACRYPT 2010. LNCS, vol. 6477, pp. 112–129. Springer,
Heidelberg (2010)
26. Veyrat-Charvillon, N., Gérard, B., Renauld, M., Standaert, F.-X.: An optimal key
enumeration algorithm and its application to side-channel attacks. In: Wu, H.,
Knudsen, L.R. (eds.) SAC 2012. LNCS, vol. 7707, pp. 390–406. Springer, Heidel-
berg (2013)
27. Veyrat-Charvillon, N., Gérard, B., Standaert, F.-X.: Security evaluations beyond
computing power. In: Nguyen, P.Q., Johansson, T. (eds.) EUROCRYPT 2013.
LNCS, vol. 7881, pp. 126–141. Springer, Heidelberg (2013)
28. Whitnall, C., Oswald, E.: A fair evaluation framework for comparing side-channel
distinguishers. J. Cryptogr. Eng. 1(2), 145–160 (2011)
Simple Key Enumeration (and Rank Estimation)
Using Histograms: An Integrated Approach
1 Introduction
as [3,7,8] typically allow estimating the rank of a 128- or 256-bit key with an
accuracy of less than one bit, within seconds of computation. By contrast, effi-
ciency remained a concern for key enumeration algorithms for some time, in
particular due to the inherently serial nature of the optimal algorithm of Veryat
et al. [10]. This situation evolved with the recent (heuristic) work of Bogdanov
et al. [4] and the more formal solution of Martin et al. [8]. In these papers, the
authors exploit the useful observation that by relaxing (a little bit) the opti-
mality requirements of enumeration algorithms (as one actually does in rank
estimation), it is possible to significantly improve their efficiency, and to make
them parallelizable. Since this relaxation is done by rounding the key (log) prob-
abilities (or scores) output by a side-channel attack, it directly suggests to try
adapting the histogram-based rank estimation algorithm from Glowacz et al. to
the case of key enumeration based on similar principles.
In this paper, we follow this track, and describe a new enumeration algo-
rithm based on histogram convolutions. As for rank estimation, using such simple
tools brings conceptual simplicity as an important advantage. Interestingly, we
show next that this simplicity also leads to several convenient features and nat-
ural optimizations of the enumeration problem. First, it directly leads to simple
bounds on the rounding errors introduced by our histograms (hence on the addi-
tional workload needed to guarantee optimal enumeration up to a certain rank).
Second, it allows straightforward parallelization between cores, since the work-
load of each core is directly available as the number of elements in each bin of our
histograms. Third, it outputs the keys as factorized lists, such that by adequately
tuning the enumeration parameters (i.e. the number of bins, essentially), we are
able to use our enumeration algorithm for distributed key testing with minimum
bandwidth (which is typically desirable if hardware/FPGA implementations are
used). In this respect, our experiments show that the best strategy is not always
to maximize the accuracy of the enumeration (especially when enumerating up to
large key ranks). We note that such features could also be integrated to other
recent enumeration algorithms (i.e. [8], and to some extent [4]). Yet, this would
require some adaptations while it naturally comes for free in our histogram-based
case. Eventually, the same observation essentially holds for the performances of
our algorithm, which slightly improve the state-of-the-art.
In view of the consolidating nature of this work, an important additional
contribution is an open source implementation of our key enumeration algorithm,
combined with the histogram-based rank estimation algorithm of FSE 2015, that
we make available with this paper in order to facilitate the dissemination of these
tools for evaluation laboratories [1].
2 Background
2.1 Algorithms Inputs
Details on how a side-channel attack extracts information from leakage traces are
not necessary to understand the following analysis. We only assume that for a n-
bit master key k, an attacker recovers information on Ns subkeys k0 , ..., kNs −1 of
Simple Key Enumeration (and Rank Estimation) Using Histograms 63
length a = Nns bits (for simplicity, we assume that a divides n). The side-channel
adversary uses the leakages corresponding to a set of q inputs Xq leading to a set
of q leakages Lq . As a result of the attack, he obtains Ns lists of 2a probabilities
Pi = Pr[ki∗ |Xq , Lq ], where i ∈ [0, Ns − 1] and ki∗ denotes a subkey candidate
among the 2a possible ones. TA (Template Attacks) and LR (Linear Regression)-
based attacks directly output such probabilities. For other attacks such as DPA
(Differential Power Analysis) or CPA (Correlation Power Analysis), one can use
Bayesian extensions [10] or perform the enumeration directly based on the scores.
Note that in this last case, the enumeration result will be correct with respect
to the scores, but the corresponding side-channel attack is not guaranteed to be
optimal [9]. For simplicity, our following analyses are based only on the optimal
case where we enumerate based on probabilities. We leave the investigation of
the overheads due to score-based enumeration as an interesting scope for further
investigation. Eventually, the lists of probabilities are turned into lists of log
probabilities, denoted as LPi = log(Pi ). This final step is used to get an additive
relation between probabilities instead of a multiplicative one.
2.2 Preprocessing
Key enumeration (and rank estimation) algorithms generally benefit from the
preprocessing which consists of merging m lists of probabilities Pi of size 2a
in order to generate a larger list Pi = merge(P0 , P1 , . . . , Pm−1 ), such that Pi
contains the 2m·a product of probabilities of the lists P0 , P1 , . . . , Pm−1 . Taking
again the previous notations where the n bits of master key are split in Ns
subkeys of a bits, it allows to split them into Ns = Ns /m subkeys of m · a bits
(or close to it when m does not divide Ns ). We denote the preprocessing merging
m lists as mergem , with merge1 meaning no merging. In the following, we assume
that such a preprocessing is performed by default and therefore always use the
notation Ns for the number of subkeys.
2.3 Toolbox
We now introduce a couple of tools that we use to describe our algorithms, using
the following notations: H will denote an histogram, Nb will denote a number
of bins, b will denote a bin and x a bin index.
Linear histograms. The function H = hist lin(LP, Nb ) creates a standard his-
togram from a list of (e.g.) log probabilities LP and Nb linearly-spaced bins.
This is the same function as introduced in [9].
Convolution. This is the usual convolution algorithm which from two histograms
H1 and H2 of sizes n1 and n2 computes H1,2 = conv(H1 , H2 ) where H1,2 [k] =
k
i=0 H1 [i]×H2 [k−i]. It is efficiently implemented with a FFT in time O(n log n).
In the rest of the paper we consider that the indexes start at 0.
Getting the size of a histogram. We defined by size of(H) the function that
returns the number of bins of an histograms H.
64 R. Poussier et al.
3 Enumeration Algorithm
In this section, we describe our new key enumeration algorithm. Since we join an
open source code of this algorithm to the paper, our primary goal is to explain
its main intuition. For this purpose, we combine a specification of the different
enumeration steps with simple examples to help their understanding.
Concretely, our new key enumeration algorithm is an adaptation of the rank
estimation algorithm of Glowacz et al. [7]. As in this previous work, we use
histograms to efficiently represent the key log probabilities, and the first step
of the key enumeration is a convolution of histograms modeling the distribution
of our Ns lists of log probabilities. This step is detailed in Algorithm 1. In the
rest of the paper we will denote the initial histograms H0 , ..., HNs −1 and the
convoluted histograms H0:1 , ..., H0:Ns −1 as written in the output of Algorithm 1.
For illustration, Fig. 1 shows an example of its application in the case of two 4-
bit subkeys of which the log probabilities are represented by a 7-bin histogram,
which are convoluted in the lower part of the figure.
Algorithm 1. Convolution.
Input. Ns lists of log probabilities LPi ’s, and number of bins Nb .
Output. Histograms of the log probabilities of each sub-key: H0 , . . . , HNs −1 ,
and their convolutions H0:1 , . . . , H0:Ns −1 .
H0 ← hist lin(LP0 , Nb );
H1 ← hist lin(LP1 , Nb );
H0:1 ← conv(H0 , H1 );
for i = 2 to Ns − 1 do
Hi ← hist lin(LPi , Nb );
H0:i ← conv(Hi , H0,i−1 );
end forreturn H = [H0 , . . . , HNs −1 , H0:1 , . . . , H0:Ns −1 ].
Based on this first step, our algorithm allows to enumerate keys that are
ranked between two bounds Bstart and Bstop . In the standard situation where
the adversary wants to enumerate starting from the most likely key, we set
Bstart = 0. However, there are at least two cases where other starting bounds
can be useful. First, it is possible that one wishes to continue an enumeration
that has been started previously. Second, and more importantly, the selection of
these bounds directly allows efficient parallel key enumeration, where the amount
of computation performed by each core is well balanced.
In order to enumerate all keys ranked between the bounds Bstart and Bstop ,
the corresponding indexes of H0:Ns −1 have to be computed, as described in
Algorithm 2. It simply sums the number of keys contained in the bins starting
Simple Key Enumeration (and Rank Estimation) Using Histograms 65
Fig. 1. Histograms representing the log probabilities of two 4-bit subkeys and their
convolution. Upper left: H0 = [0, 3, 2, 1, 7, 2, 1]. Upper right: H1 = [3, 0, 4, 5, 0, 3, 1].
Bottom: H0:1 = [0, 9, 6, 15, 44, 20, 45, 52, 19, 27, 13, 5, 1].
from the most likely one, until we exceed Bstart and Bstop , and returns the
corresponding indexes xstart and xstop . That is, xstart (resp. xstop ) refers to the
index of the bin where Bmin (resp. Bmax ) is achieved (thus xstart ≥ xstop ).
As in [7], a convenient feature of the histograms we use to represent the
key log probabilities is that they lead to simple bounds on the “enumeration
error” that is due to their rounding, hence on the additional workload needed
to compensate this error. Namely, if one wants to be sure to enumerate all the
N
keys of which the rank is between the bounds, then he should add 2s to xstart
and substract it to xstop .1
Figure 2 illustrates the computation of these indexes using the same example
as in Fig. 1. In this case, the user wants to find the bins where the keys are
ranked between 10 and 100. By summing up the number of keys contained in
the bins of H0:1 from the right to the left, we find that the bin indexed 10 starts
with the rank 7 and the bin indexed 7 ends with the rank 117. Since the bin
indexes 11 and 6 are out of the bounds (10 and 100), we know that the dark
grey bins approximately contains the keys we want to enumerate, up to rounding
errors. Furthermore, by adding the light grey bins, we are sure that all the keys
1 N
The authors of [7] had a slightly worst bound of Ns instead of 2s . Indeed, they
rounded the sum of all the subkeys’ log probabilities, instead of summing the rounded
subkeys’ log probabilities.
66 R. Poussier et al.
Fig. 2. Computation of the indexes’ bounds for Bmin = 10 and Bmax = 100.
Given the histogram of the key log probabilities and the indexes of the bounds
between which we want to enumerate, the enumeration simply consists in per-
forming a backtracking over all the bins between xstart and xstop . More precisely,
during this phase we recover the bins of the initial histograms (i.e. before con-
volution) that we used to build a bin of the convoluted histogram H0:Ns −1 . For
a given bin b with index x of H0:Ns −1 corresponding to a certain log probability,
we have to run through all the non-empty bins b0 , ...bNs −1 of indexes x0 , ...xNs −1
of H0 , ..., HNs −1 such that x0 + ... + xNs −1 = x. Each bi will then contain at
least one and at most 2m·a subkey(s) that we must enumerate. This leads to a
keyf actorization which is a table containing Ns subkey lists, such that each of
these lists contains up to 2m·a subkeys associated to the bin bi of the histogram
Hi . Any combination of Ns subkeys, each one being picked in a different list,
results in a master key having the same rounded probability. Eventually, each
time a factorization is completed, we call a fictive function process key which
Simple Key Enumeration (and Rank Estimation) Using Histograms 67
takes as input the result of this factorization. This function could test the keys
on-the-fly or send them to a third party for testing (this function is essentially
independent of the enumeration process).
Algorithm 3 describes more precisely this bin decomposition process. From
a bin index x0:i of H0:i , we find all the non empty bins of indexes xi of Hi such
that the corresponding bin of index x0:i − xi of H0:i−1 is non empty as well.
All the bins bi following this property will lead to valid subkeys for ki that we
add to the key factorization using the function get(Hi , bi ). This is done for all
convolution results from the last histogram H0:Ns −1 to the first H0:1 , which then
leads to a full key factorization.
– All the inputs of the rank estimation algorithm (with real key being optional).
– bound start: the starting bound of the enumeration. If this is e.g. set to 210 ,
the enumeration will start from the closest bin of H0:Ns such that at most
210 keys are contained in the next bins.
– bound stop: the ending bound of the enumeration. If this is e.g. set to 232 ,
the enumeration will start from the closest bin of H0:Ns such that at least 232
keys are contained in the next bins.
– test key: this is a boolean value. If set to 1, the enumeration algorithm will
test the keys on-the-fly using an AES implementation, by recombining them
from the factorizations (and stop when the key is found); if set to 0, it will
keep the keys factorized, and the user should implement himself the way he
wants to test the keys in the process key function.
– texts: a 4 × Ns matrix containing two plaintexts and their associated cipher-
texts. These two plaintexts/ciphertexts are used to test on-the-fly if the cor-
rect key is found. This parameter does not have to be initialized if test key is
set to 0.
– to bound: This is a boolean value. If set to 1, the enumeration algorithm will
N
remove (resp. add) 2s to index max (resp. index min) as described in the
previous section, to ensure that we enumerate all the keys between bound start
and bound stop.
– to real key: additional parameter for comparisons with previous works, that
can take 4 values in [0, 4]. If set to 0, this parameter is ignored. If set to
to 1, 2, 3, it allows the user to measure the timing of enumerating up to the
real key in different settings, ignore the value of bound start and test key and
enumerate up to the bin that contains the real key. It then requires real key
to be initialized. If set to 1, the keys will neither be recombined nor tested.
If set to 2, the keys are recombined but not tested with AES (it simply tests
if the key is equal to the real one provided by the user). If set to 3, the keys
are recombined and tested with the AES. If the real key rank is bigger than
bound end, the enumeration is aborted.
Algorithms Outputs.
– Rank estimation informations: returns the rank of the real key accord-
ing to its rounded log probabilities and the min and max bounds on the
actual rank of the real key. Also returns the time needed for rank estimation
(including the preprocessing time).
Simple Key Enumeration (and Rank Estimation) Using Histograms 71
– Enumeration informations: If the key has been found, returns the rank of
the real key according to its rounded log probabilities and the min and max
bounds on the actual rank of the real key. Also returns the time needed for
preprocessing and the time needed for enumeration.
Examples. Together with our code, we provide different examples of key enu-
meration which are written in a file main example.cpp and listed in Table 2. The
first example (first line in the table) enumerates all the keys of rounded rank
between 210 and 240 (taking the rounding bounds into account) and tests them
using a reference AES-128 software implementation. The second example enu-
merates all the keys of rounded rank between 20 and 240 without testing them.
A user would then have to define the way he wants to implement the process key
function (e.g. by sending the factorized lists to a powerful third testing party).
The last three examples enumerate all the keys up to the real one if its rounded
rank is lower than 232 . For the third one, the recorded timing will correspond
to the enumeration time with factorization. For the fourth one, the recorded
timing will correspond to the enumeration time including the recombination of
the factorized lists. For the last one, the recorded timing will correspond to the
enumeration time with key testing (with our reference AES-128 implementation)
and thus with recombination.
to real key real key test key to bound texts bound start bound stop
0 optional 1 1 given 210 240 (1)
0
0 optional 0 0 − 2 240 (2)
32
1 needed − − − − 2 (3)
2 needed − − − − 232 (4)
32
3 needed − − − − 2 (5)
5 Performance Evaluations
In this section we evaluate the performances of our enumeration algorithm and
discuss its pros and cons compared to previous proposals. For this purpose, we
consider a setting of simulated leakages for an AES-128 implementation, which
has been previously used for comparison of other enumeration algorithms [4,8,
10]. Namely, we target the output of an AES S-box, leading to 16 leakages of the
form li = HW(S(xi , ki ))+N for i ∈ [0, 15], with HW the Hamming weight leakage
function and N a random noise following a Gaussian distribution. We stress that
the main experimental criteria influencing the complexity of an enumeration is
the rank of the key (that we can control thanks to the noise variance). So other
experimental settings would not lead to significantly different conclusions with
respect to the performances of the enumeration algorithm.
72 R. Poussier et al.
Besides, the two main parameters of our algorithm are the number of bins
and the amount of merging. Intuitively, a smaller number of bins leads to a
faster execution time at the cost of an increased quantization error, and merging
accelerates the enumeration at the cost of more memory requirements and pre-
processing time. All the following experiments were performed with 256, 2048
and 65536 bins, and for an amount of merging of 1, 2 and 3. These values were
chosen to allow comparisons with the results of [8]. That is, 256 (resp. 2048 and
65536) bins is similar to choosing a precision of 8 (resp. 11 and 16) bits for
their algorithm. We limited the amount of merging to 3 because the memory
requirements of this preprocessing then becomes too large for our AES-128 case
study (a merging of 4 would require to store 4 × 232 × 8 bytes for the lists of log
probabilities in double precision).
One convenient feature of our algorithm is its ability to compute easily the quan-
tization bounds related to the mapping from floating to integers. Since accuracy
is usually the main concern when enumerating keys, we start our evaluations
by analyzing the impact of the number of bins on these quantization bounds.
For this purpose, we first recall that these quantization errors are related to the
rounding, which was the key idea to improve the performance and parallelism of
recent works on enumeration. Hence, our goal is to find the level of quantization
errors that are acceptable from the enumeration accuracy point-of-view.
Figure 4 illustrates this impact for a precision of 256, 2048 and 65536 bins.
Since the impact of merging is minor for such experiments, we only report the
results with a merge1 preprocessing. The Y-coordinate represents the number of
keys one has to enumerate in order to guarantee an enumeration up to an exact
key rank given by the X-coordinate. Optimal enumeration is shown in black
(for which X = Y ) and corresponds to the application of the algorithm in [10].
The red, blue and green curves respectively represent the maximum, average and
minimum results we found based on a sampling of 1000 enumeration experiments.
These experiments lead to two interesting observations. First, a lower precision
(e.g. 256 bins) leads to larger enumeration overheads for small key ranks, but
these overheads generally vanish as the key ranks increase. Second, increasing the
number of bins rapidly makes the enumeration (rounding) error low enough (e.g.
less than one bit) which is typically observed for the 2048- and 65536-bin cases,
especially for representative ranks (e.g. beyond 232 ) where the enumeration cost
becomes significant. This is in line with the observations made with histogram-
based rank estimation [7].
Note that other algorithms such as [4,8] lead to similar accuracies with similar
parameters (e.g. our 2048-bin case roughly corresponds to their 11-bit precision
case). Besides, finding bounds on the rounding error should be feasible for [8]
too, despite probably more involved than with histograms for which such bounds
come for free.
Simple Key Enumeration (and Rank Estimation) Using Histograms 73
50
40
30
20
10
0
0 10 20 30 40 50 60
0 10 20 30 40 50 60
key rank (log2) key rank (log2)
60
number of key to enumerate (log2)
50
40
30
20
max bound
mean bound
10
min bound
optimal (x=y)
0
0 10 20 30 40 50 60
key rank (log2)
Fig. 4. Enumeration overheads due to rounding errors with merge1 (i.e. no merging).
Upper left: 256 bins. Upper right: 2048 bins. Bottom: 65536 bins. (Color figure online)
5.2 Factorization
Another important feature of our method is its intrinsic ability to output fac-
torized keys instead of a single key at a time. Studying why and how this factor-
ization evolves with our main parameters is important for two reasons. Firstly,
it allows a better understanding of how our main parameters affect the per-
formances of histogram-based enumeration, since a better factorization always
reduces its amount of processing. Secondly, the number of keys per factorization
may be important for the key testing phase, e.g. in case one wants to distribute
the lists of key candidates to multiple (hardware) devices and therefore minimize
the bandwidth of this distributed part of the computations. This second point
will be discussed in Sect. 6.
Intuitively, increasing the amount of merging or decreasing the number of
bins essentially creates more collisions in the initial histograms, thus increases
the size of the factorized keys, and thus accelerates the enumeration process.
74 R. Poussier et al.
Interestingly, increasing the merging does not decrease the accuracy (by contrast
with decreasing of the number of bins). Hence, this type of preprocessing should
(almost) always be privileged up to the memory limits of the device on which
the enumeration algorithms is running.
To confirm this intuition, Fig. 5 illustrates an evaluation of the factorization
results for 256 (left) and 2048 (right) bins, and merging values from 1 to 3.
The top figures represent the number of keys per factorization (Y-coordinate).
The bottom figures represent the memory cost of the corresponding lists in
bytes (Y-coordinate). The dashed curves represent the average value (over 1000
experiments) and the plain curves represent the maximum that occurred on our
1000 experiments. As we can see, using 256 bins leads to a lot of collisions and
the merging value always amplifies the number of collisions. This increases the
number of keys per factorization along with the memory size of the corresponding
lists. The memory cost is anyway bounded by Ns × 2m·a , and the number of
keys per factorization by 2n = 2Ns ·m·a (this extreme case would occur if all the
subkeys have the same rounded probability and thus are within the same bins for
all histograms Hi ). We did not plot the results for 65536 bins since few collisions
appear (and thus not many of factorizations).
Note that the algorithm in [8] has a similar behavior as it stores the keys
having the same probabilities within a tree. So although the open source imple-
mentation joined to this previous work recombines the keys, it could also convert
this tree representation into a factorized representation that is more convenient
for distributed key testing with limited bandwidth.
106
105
nb key per factorization
104
103
102
101
100
0 5 10 15 20 25 0 5 10 15 20 25
mean merge1
103
102
101
0 5 10 15 20 25 0 5 10 15 20 25
key rank (log2) key rank (log2)
Fig. 5. Key factorization for different levels of merging and number of bins. Left:
number of keys per factorization (top) and memory cost of the associated list in bytes
(bottom) for 256 bins. Right: same plots for 2048 bins.
Hence, we loose more by iterating over all the non-empty bins than what we win
from the collisions. Additional results for 2048 bins and other merging values
are given in Appendix A.
We next discuss a number of additional issues related to these performances.
Veyrat et al.
103
Our’s samples.
Our’s mean
102
time in seconds (log)
101
100
10−1
10−2
0 5 10 15 20 25 30 0 5 10 15 20 25 30
104
103
time in seconds (log)
102
101
100
10−1
10−2
0 5 10 15 20 25 30 0 5 10 15 20 25 30
key rank (log2) key rank (log2)
Fig. 6. Execution time for 256 and 65536 bins with factorized lists. The blue stars are
the samples for our algorithm, the red curve is the corresponding mean, and the black
curve is for the optimal enumeration algorithm in [10]. Upper left: 256 bins / merge1 .
Upper right: 256 bins / merge2 . Bottom left: 65536 bins / merge1 . Bottom right: 65536
bins / merge2 . (Color figure online)
6 Application Scenarios
In this section we finally discuss the impact of our findings for an adversary
willing to exploit enumeration in concrete application scenarios, which comple-
ments the similar discussion that can be found in [4]. Without loss of generality
we focus on the case of the AES. We separate this discussion in the cases of
adversaries having either a small or big computing infrastructure to mount an
attack.
78 R. Poussier et al.
104
Martin et al.
Veyrat et al.
3
10 Our’s recombined merge1
Our’s factorized merge1
time in seconds (log)
102
101
100
10−1
10−2
0 5 10 15 20 25 30 350 5 10 15 20 25 30 35
key depth (log2) key depth (log2)
Fig. 7. Execution time for the java implementation of Veyrat et al., Martin et al. and
ours with 8 bits of precision (left) and 11 bits of precision (right). (Color figure online)
In the first case we assume a local attacker with only one “standard” com-
puter. Roughly, his computing power will be bounded by 240 . In that case, he
will simply use all his available cores to launch the key enumeration with key
testing on-the-fly. Since it is likely that it will take more time to compute an
AES encryption than to output a key candidate, this adversary will prefer a
higher precision than a higher number of collisions. In that respect, and depend-
ing on the AES implementation’s throughput, using 2048 bins could be a good
tradeoff. Indeed, as the adversary’s computing power is bounded and as the AES
computation is the costly part, he should minimize the bounds overhead as seen
in Sect. 5.1. Since the merging value has no impact on the accuracy, this value
should always be maximized (ensuring we do not fall in a case where it slows
down the enumeration process as shown in Sect. 5.3).
By contrast, the strategy will be quite different if we assume the adversary
is an organization having access to a big computing infrastructure. For exam-
ple, let assume that this organization has powerful computer(s) to launch the
key enumeration along with many hardware AES implementations with limited
memory. The adversary’s computing power is now bounded by a much higher
capability (e.g. 264 ). As we saw in Sect. 5.1, the gap between the optimal enumer-
ation and the efficient one (using less bins) vanishes as we consider deeper key
ranks. In that case, the attacker should maximize the enumeration throughput
and minimize the bandwidth requirement (per single key), which he can achieve
by decreasing the number of bins and increasing the merging value as much as
possible (e.g. 256 bins with merge3 ). All the key factorizations would then be
sent to the hardware devices for efficient key testing. This could be done easily
since a factorized key can be seen as a small effort distributor as in [9,12].
Simple Key Enumeration (and Rank Estimation) Using Histograms 79
7 Related Work
A recent work from David et al. available on ePrint [6] allows one to enumerate
keys from real probabilities without the memory issue of the original optimal
algorithm from [10]. This gain comes at the cost of a loss of optimality which is
different from the one introduced by the rounded log-probabilities.
8 Conclusion
This paper provides a simple key enumeration algorithm based on histograms
along with an open source code implementing both the new enumeration method
and the rank estimation algorithm from FSE 2015. In addition to its simplicity,
this construction allows a sound understanding of the parameters influencing
the performances of enumeration based on rounded probabilities. Additional
convenient features include the easy computation of bounds for the rounding
errors, and easy to balance parallelization. Our experiments also illustrate how
to tune the enumeration for big distributed computing efforts with hardware
co-processors and limited bandwidth. We believe the combination of efficient
key enumeration and rank estimation algorithms are a tool of choice to help
evaluators to understand the actual security level of concrete devices, and the
actual capabilities of computationally enhanced adversaries.
Figure 8 shows timing results for different number of bins and amounts of merg-
ing. The two figures on the top are the results for 256 (left) and 65536 (right)
bins with merge3 which are lacking in Fig. 6. As for the 65536-bin case, we saw
in Fig. 6 that the merging can be detrimental (e.g. using merge1 was better than
using merge2 ) when not enough collision are occur. However we see that we still
benefit from using merge3 in that case. The 3 other figures show the results of
experiments with 2048 bins and a merge1 preprocessing (middle left), merge2
preprocessing (middle right) and merge3 preproicessing (bottom).
80 R. Poussier et al.
102
101
100
10−1
10−2
0 5 10 15 20 25 30 0 5 10 15 20 25 30
103
102
time in seconds (log)
101
100
10−1
10−2
0 5 10 15 20 25 30 0 5 10 15 20 25 30
3
10
102
time in seconds (log)
101
100
10−1
10−2
0 5 10 15 20 25 30 35
key rank (log2)
Fig. 8. Additional execution times with factorized lists. Upper left: 256 bins / merge3 .
Upper right: 65536 bins / merge3 . Middle left: 2048 bins / merge1 . Middle right: 2048
bins / merge2 . Bottom: 2048 bins / merge3 .
Simple Key Enumeration (and Rank Estimation) Using Histograms 81
References
1. http://perso.uclouvain.be/fstandae/PUBLIS/172.zip
2. http://www.shoup.net/ntl/
3. Bernstein, D.J., Lange, T., van Vredendaal, C.: Tighter, faster, simpler side-channel
security evaluations beyond computing power. IACR Cryptol. ePrint Arch. 2015,
221 (2015)
4. Bogdanov, A., Kizhvatov, I., Manzoor, K., Tischhauser, E., Witteman, M.: Fast
and memory-efficient key recovery in side-channel attacks. IACR Cryptol. ePrint
Arch. 2015, 795 (2015)
5. Clavier, C., Danger, J.-L., Duc, G., Elaabid, M.A., Gérard, B., Guilley, S., Heuser,
A., Kasper, M., Li, Y., Lomné, V., Nakatsu, D., Ohta, K., Sakiyama, K., Sauvage,
L., Schindler, W., Stöttinger, M., Veyrat-Charvillon, N., Walle, M., Wurcker, A.:
Practical improvements of side-channel attacks on AES: feedback from the 2nd
DPA contest. J. Cryptograph. Eng. 4(4), 259–274 (2014)
6. David, L., Wool, A.: A bounded-space near-optimal key enumeration algorithm for
multi-dimensional side-channel attacks. IACR Cryptol. ePrint Arch. 2015, 1236
(2015)
7. Glowacz, C., Grosso, V., Poussier, R., Schüth, J., Standaert, F.-X.: Simpler and
more efficient rank estimation for side-channel security assessment. In: Leander,
G. (ed.) FSE 2015. LNCS, vol. 9054, pp. 117–129. Springer, Heidelberg (2015)
8. Martin, D.P., O’Connell, J.F., Oswald, E., Stam, M.: Counting keys in parallel
after a side channel attack. In: Iwata, T., et al. (eds.) ASIACRYPT 2015. LNCS,
vol. 9453, pp. 313–337. Springer, Heidelberg (2015)
9. Poussier, R., Grosso, V., Standaert, F.-X.: Comparing approaches to rank esti-
mation for side-channel security evaluations. In: Homma, N. (ed.) CARDIS
2015. LNCS, vol. 9514, pp. 125–142. Springer, Heidelberg (2016). doi:10.1007/
978-3-319-31271-2 8
10. Veyrat-Charvillon, N., Gérard, B., Renauld, M., Standaert, F.-X.: An optimal key
enumeration algorithm and its application to side-channel attacks. In: Knudsen,
L.R., Wu, H. (eds.) SAC 2012. LNCS, vol. 7707, pp. 390–406. Springer, Heidelberg
(2013)
11. Veyrat-Charvillon, N., Gérard, B., Standaert, F.-X.: Security evaluations beyond
computing power. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013.
LNCS, vol. 7881, pp. 126–141. Springer, Heidelberg (2013)
12. Ye, X., Eisenbarth, T., Martin, W.: Bounded, yet sufficient? How to determine
whether limited side channel information enables key recovery. In: Joye, M.,
Moradi, A. (eds.) CARDIS 2014. LNCS, vol. 8968, pp. 215–232. Springer, Hei-
delberg (2015)
Automotive Security
Physical Layer Group Key Agreement
for Automotive Controller Area Networks
1 Introduction
connectivity, there is a need to secure the internal network from external attack-
ers. The current automobile manufacturers utilize traditional network security
principles at the periphery (firewalls, access control), to secure the CAN access.
However, as demonstrated recently in [28], these techniques may not offer suffi-
cient protection. Further, such methods do not address the fundamental lack of
security in CAN messages.
Generally speaking, the attacks demonstrated thus far may be roughly
divided into two stages. First, the attackers compromise an ECU with a remote
interface and the ability to inject arbitrary messages on the CAN bus. Secondly,
the attackers communicate with a critical ECU over the CAN bus and influence
its behavior. The second stage is enabled by the broadcast nature of the internal
network and the lack of authentication. Typically, any operation in the second
stage requires knowledge of the internal bus protocol and message structure,
which has been simplified by the lack of encryption on the network.
It is clear that any security solution for CAN should include fundamental
protections such as source authentication and packet level encryption. Several
researchers, e.g. [8,14,26], have proposed methods to include these primitives in
the current CAN architecture. One of the fundamental requirements to enable
these primitives is the existence of cryptographic keys shared between the com-
municating ECUs. However, it is challenging to pre-install group keys during
production of the ECU or securely manage the keys over the long lifetime of a
vehicle. Thus, we require an efficient key generation and exchange protocol that
can be executed during the operation of the car to agree on secret keys.
To ensure minimal disturbance to critical operations on the CAN bus, the
key exchange protocol must be bandwidth efficient. Further, it must incur a low
computational overhead to accommodate a variety of ECU capabilities. Since
CAN messages are multicast, it is necessary for the protocol to support the
generation and update of group keys. In this paper, we propose such a protocol
by utilizing the physical properties of the CAN bus.
CAN Security: In this work, we utilize the physical properties of the CAN
bus for exchange of keys. To the best of our knowledge, the first to utilize such
properties for key agreement are Müller and Lothspeich in [23]. Their work forms
the basis for our constructions and it will be reviewed in detail in Sect. 3. Security
for CAN networks, particularly authentication and integrity of messaging, has
been considered previously in [8,11,14,26,27]. However, this line of work assumes
that a shared key already exists.
(Group) Key Agreement: Distribution of group keys for both authenticated
and unauthenticated scenarios has been explored in literature for well over three
decades. Several schemes have been proposed based on varying assumptions of
adversarial behavior and initial setup. One of the earliest results in this direction,
Diffie-Hellman (DH) key exchange [7], uses the hardness of computing discrete-
log over prime order groups to generate keys between a pair of nodes. Steiner
et al. in [25] proposed an extension of DH to groups that uses a mixture of
point-to-point messages and broadcast messages. This was modified by authors
in [16–18,25], who utilize a tree based structure to improve communication effi-
ciency and support efficient addition/deletion of nodes. Authors in [29] reduce
communication and storage overhead by performing these group operations over
elliptic curves.
Several methods have also been proposed to generate authenticated group
keys, either by extension of the two-party protocols to groups or by using ideas
based on secret-sharing, e.g. [2,4,12,15]. These schemes have several desirable
properties such as provable security, perfect forward secrecy (PFS) and key inde-
pendence. Most schemes, e.g. [2,4,15,25,29], involve expensive group operations
over prime fields, and thus are not suitable for computationally constrained
devices on the CAN bus. Other protocols, e.g. [12], fail to provide security against
an adversary that can compromise the pre-shared secrets. This property is desir-
able for automotive networks, where some nodes may be easily compromised due
to open accessibility or lack of protections. Our protocol provides these security
properties. Our main differentiation from these lies in utilization of the physical
properties of the CAN bus as a substitute for the expensive operations.
1.3 Organization
2 Preliminaries
2.1 Notation
We adhere to the following notation for the paper. We denote a random n bit
value x sampled uniformly from the set {0, 1}n , consisting of all possible binary
strings of n bits, as x ← {0, 1}n . We denote by x := y, the assignment of the
value y to x.
For a binary string x ∈ {0, 1}∗ , |x| represents the length of string and x
represent the complement of the string. For an index set L ⊆ {1, . . . , |x|}, x(L)
refers to the substring with indices in L. If L consists of a single element, x(L)
simply refers to the Lth bit. Given two strings x, y ∈ {0, 1}∗ , x || y denotes the
concatenation of the strings.
We denote by I(X ∧ Y ), the mutual information between random variables
X and Y . I(X ∧ Y ) = H(X) − H(X|Y ), where H(X) is the entropy of X.
Several adversarial models have been proposed for key exchange protocols in
literature, e.g. CK model [5], or BR model [3]. It is typically assumed that the
adversary can record all messages transmitted on the bus, modify them, or insert
its own messages.
Here, we consider two adversarial scenarios. For the schemes in Sect. 4, we
restrict the adversarial behavior to passive observations. This model, though
unrealistic for typical networks, can be sufficient for all attacks on the CAN
bus. This is due to the inherent robustness that the CAN bus provides to active
adversaries. Detailed analysis for this is presented in Sect. 6.1.
For schemes in Sect. 5, we consider a powerful adversarial scenario, wherein
the adversary has complete control over the protocol execution. There, we argue
that our schemes provides cryptographic guarantees against such adversaries.
Due to the dependence of our scheme on physical properties of the bus, an
adversary with a high resolution oscilloscope may be able to obtain the keys
by probing the bus. Further, since our scheme does not have a practical imple-
mentation yet, it has not been analyzed for timing or power side-channels. We
consider such attacks outside the scope of this paper.
90 S. Jain and J. Guajardo
r ← {0, 1}n
RetV al := r
Proof. The proof follows from the proof of Theorem 1, by applying the compu-
tational definition of to conditional entropy. It will be included in the extended
version of this paper.
In this section we introduce two new group key agreement protocols without
authentication. Though these can be viewed as a special case of the authenti-
cated protocol, they warrant separate treatment due to different complexity and
security properties.
The protocols presented here require a linear (in size of the group) number
of interactions for initial key establishment. Intuitively, the broadcast nature of
the CAN bus allows pairwise PnS interaction between successive nodes to be
sufficient for global key agreement. Once two nodes execute the PnS protocol,
they may be viewed as a single logical entity for any further transmissions by one
of these nodes, based the PnS output. Thus, each successive interaction increases
the size of the logical entity by one, until the whole group is created.
For the remainder of this paper, we assume that the group consists of M
nodes, {nodeN1 , . . . , nodeNM }. For simplicity, we assume the communica-
tion sequence to be based on the lexicographic order, i.e. nodeN1 -nodeN2 -. . . -
nodeNM . We assume that the protocol initiation is triggered by the gateway
node with information about the group members and parameters. The ECUs
can determine their communication priority in a distributed manner based on
the group configuration.
94 S. Jain and J. Guajardo
if s = ∅, RetV al := s
otherwise r ← {0, 1}n , RetV al := r
tN1,2 := PnS(1l , nodeN1 , nodeN2 , fN1 (1l , ∅), fN2 (1l , ∅)).
Each node maintains the temporary value of the PnS as tN1 = tN2 := tN1,2
respectively.
3. The next pair of nodes (nodeN2 , nodeN3 ) executes PnS with the target length
l = |tN2 | to obtain the private results (or update the private results) as
tN2,3 := PnS(1l , nodeN2 , nodeN3 , fN2 (1l , tN2 ), fN3 (1l , ∅)).
4. All nodes prior to the currently active nodes update their private strings as the
output of the PnS result. In this case, as nodeN1 is the only node preceding
(nodeN2 , nodeN3 ), it updates its private string tN1 as the output of the PnS
protocol between nodeN2 and nodeN3 . Thus tN1 = tN2 = tN3 .
5. Protocol is repeated from Step (3) for each successive pair of nodes (nodeN3 ,
nodeN4 ), . . . , (nodeNM −2 , nodeNM −1 ), (nodeNM −1 , nodeNM ).
6. All nodes update the shared keys as sNi = sNi || tNi , 1 ≤ i ≤ M .
7. If |sNM | < n, the protocol is repeated from Step (2) using l = (n−|sNM |)·2(M −1) .
For each repetition of this step between successive pairs of nodes, all nodes
prior to the active pair can derive the result of the protocol. Thus, once nodeNM
completes execution, all nodes share a common string. Though the implicit
backward-sharing of keys is a desirable property, the overall communication
efficiency of the protocol is low. To see this, observe that at each successive
Physical Layer Group Key Agreement 95
Node Arrival and Departure: At the end of the protocol, each node knows
all random bits selected during the protocol. Thus, the departure of any node
requires re-execution of the complete protocol. For node arrival, it may appear
that the new node can simply be appended to the end of the chain. However, the
execution of PnS with the new node would leak several bits (half on average).
Thus, the whole protocol needs to be re-executed to compensate for the lost bits.
Thus, both addition and deletion operations incur exponential communication
cost, i.e. O(n · 2(M −1) ). However, the new key maintains the property of key
independence.
Note that an alternative, more efficient protocol using PnS with information
theoretic security guarantees could be envisaged if we do not require the protocol
to be contributory, i.e. each node contributes to the randomness of the key. A
selected leader can simply engage in pairwise PnS with all other nodes and use
the derived keys as one-time pads to distribute a secret value. It is easy to
see that the computational complexity for key generation, node departure and
arrival for such a scheme would be linear, i.e. O(n · M ). However all protocols
presented here are ‘contributory’ protocols.
The scheme presented in Sect. 4.1 provides ideal security guarantees at the cost
of efficiency. However, security against computationally bounded adversaries is
sufficient for practical systems. This relaxation enables the utilization of efficient
topologies for key agreement.
For key generation, the nodes are organized in a binary tree structure, e.g. as
shown in Fig. 2. The physical nodes (ECUs) are assigned to the leaf nodes of the
tree. The virtual nodes correspond to logical entities that can be emulated by
any physical leaf node in the subtree rooted at that node. For the algorithms in
this paper, we assume that the physical messages triggered by the virtual node
are sent by the leaf node in the subtree with the highest priority (leftmost node
of the tree in our model). The message flow for the key generation scheme is
detailed in Protocol 4.
96 S. Jain and J. Guajardo
Node Departure: A node in the network has knowledge of all the random
values generated and exchanged along the path, denoted as Pdr , from the node
to the root. Thus deletion of a node involves updating all the values known to
the node and re-execution of PnS with the updated values. For example in Fig. 2,
if nodeN4 departs the network, it is sufficient to update the random values at
nodeN3 and the virtual nodes V3,4 1
, V1,2
2
, root.
We assume that the departing node broadcasts its identity to the group.
Thus the nodes along Pdr and their siblings flag their values for updating. The
update progresses upwards from the affected leaf node. If a node lies directly
along Pdr , it uses the new PnS result from the child node for all future protocol
execution. All other nodes simply execute the PnS protocol with updated index
values (in f (·)).
At the end of the protocol, the value of the final PnS interaction is used as
the group secret shared by all nodes. The statistical independence of the output
of g(·, ·) for different inputs ensures that the new key is independent of the prior
shared sequence and unknown to the departing node. Further, it can be observed
that the computational complexity of this stage is simply O(n · log M ).
Physical Layer Group Key Agreement 97
1. Each leaf node initializes the private string tNi ← {0, 1}n , 1 ≤ i ≤ M .
2. The process starts at the leaf nodes. Each pair of siblings execute the complete
PnS protocol with target string length n, and the result is assigned to the private
string of the parent as
tV 1 := PnS(1n , nodeNi , nodeNi+1 , fNi (1n , tNi ), fNi+1 (1n , tNi+1 )),
i,i+1
where i = 1, 3, . . .. Note: Here, we execute the complete PnS protocol. Thus the
output is of length n, i.e. |tV 1 | = n.
i,i+1
3. Step (2) is repeated at the next level of hierarchy, i.e. first level of virtual nodes
1
here, Vi,i+1 , to generate the private strings for their parents.
4. The process of Step (3) continues till the virtual root node is reached. The
private string of the root node troot is the shared secret key between all nodes.
Node Arrival: Similar to the node departure scenario, a node arrival requires
creation of a path to the root and executing the PnS protocol with siblings of
the nodes along the new path.
For simplicity, we assume that the new node is temporarily assigned the
priority equivalent to a recently departed node or the lowest priority among
existing nodes in the group. This minimizes the changes to the tree structure
and the re-computations required to add a node. In cases where this is not
possible, we may add a node in the ‘pre-assigned’ order and modify the tree
hierarchy accordingly.
Consider the example in Fig. 2, where nodeNM +1 joins the network. This
requires updating the random values at nodeNM and the virtual nodes
VM,M
1
+1 , V3,4 , root. This may be performed in a manner identical to the depart-
2
ing scenario. Thus, it can be observed that the new key will be independent of
the old key and the computational complexity is simply O(n · log M ).
1. GW select a random sequence of n bits, i.e. tgw ← {0, 1}n and broadcasts it to
all group members.
2. Each leaf node of the tree initializes the private string tNi = g(Ki , tgw ), 1 ≤ i ≤
M . Here, Ki is the key shared between nodeNi and the GW.
3. The process starts at the leaf nodes. Each pair of siblings execute the PnS
protocol with target string length n, and the result is assigned to the private
string of the parent as
tV 1 := PnS(1n , nodeNi , nodeNi+1 , fNi (1n , tNi ), fNi+1 (1n , tNi+1 )),
i,i+1
where i = 1, 3, . . .. Note that we execute the complete PnS protocol here so that
the output is of length n, i.e. |tV 1 | = n.
i,i+1
4. Step (3) is repeated at the next level of hierarchy, i.e. first level of virtual nodes
1
here, Vi,i+1 , to generate the private strings for the parents of the virtual nodes.
5. The process of Step (4) continues till the virtual root node is reached. The
private string of the root node troot is the shared secret key between all nodes.
6. The gateway monitors the broadcast messages and verifies the correctness. It
transmits an error message if the verification fails at any stage.
Whenever two nodes engage in the PnS protocol, the first node uses a function
of the random value from the previous stage, while the second node uses a fresh
random value concatenated with some authentication credentials. The value used
by the first node ensures that all nodes prior to it can re-create the PnS execution
and learn its outputs. The value of the second node will be authenticated by the
passively monitoring GW, before it is included in the chain.
To ensure security against compromise of Ki and still ensure verifiability, it is
required that the second node use some fresh randomness, unknown to everyone
else and the key Ki . It should be observed that successful authentication of the
messages of the second node requires the PnS protocol to be internally executed
atleast twice. In the first round, the fresh random value is extracted by GW
and in the second round it is authenticated. We argue that this will always be
the case as the probability that the PnS protocol is executed only once is 2−n ,
i.e. when all bits of both the parties are complements of each other. Thus the
authentication process does not add communication overhead. Similar to the
previous schemes, the initial key generation has linear complexity, i.e. O(n · M ).
100 S. Jain and J. Guajardo
1. The GW is begins the protocol by acting as the first link in the PnS chain. The
GW chooses tGW ← {0, 1}n and nodeN1 chooses tN1 ← {0, 1}n to execute the
PnS protocol as
tGW,N1 = PnS(1n , GW, nodeN1 , fGW (1n , tGW , 0, 1), fN1 (1n , KN1 , tN1 , 2)).
2. Next, nodeN2 chooses a random value tN2 ← {0, 1}n . nodeN1 performs PnS
with nodeN2 as
tN1 ,N2 = PnS(1n , nodeN1 , nodeN2 , fN1 (1n , tGW,N1 , 0, 1), fN2 (1n , KN2 , tN2 , 2)).
3. Step (2) is repeated between successive pairs of nodes till the final node is
reached. Denote by tNM −1 ,NM , the result of the final PnS operation. This is the
group key shared by all nodes.
4. The gateway monitors the broadcast messages and verifies the correctness. It
transmits an error message if the verification fails at any stage.
6 Discussion
6.1 Security Properties
Though Sect. 5 describes schemes that are robust against arbitrary active adver-
saries, we argue that such an adversarial model is too restrictive for the automo-
tive scenarios. Operations of our protocol and the architecture of the CAN bus
restrict the actions of the adversary in our system. We argue that an active adver-
sary cannot successfully perform any operation, except eavesdropping, without
detection. Consider the following
1. Modification of a packet - The properties of the CAN bus allow only one
type of modification to the messages transmitted by the nodes. An adversary
can flip a recessive bit ‘1’ to a dominant bit ‘0’ by transmitting a voltage,
however not vice-versa. It can be verified that this simply results in a mis-
matched key at both parties. This can easily be detected by any key verifica-
tion method.
2. Inserting messages for active nodes - An active node, executing a pair-
wise session of the protocol, only accepts outputs on the bus that result
from superposition of its own signals with that of the partner. Thus consider
an adversary that attempts to compromise a session between nodeN1 and
nodeN2 by inserting a ‘specific’ message for nodeN2 . However, this requires
that the adversary initiate a transmission from nodeN2 . Assume that the
message transmitted by the adversary is madv , and that by nodeN2 is mN2 .
Thus the message recorded by nodeN2 is the logical AND of these messages,
i.e. madv ∧mN2 . However, as the adversary has no control over mN2 , it cannot
insert a ‘specific’ packet. It can however choose and force bits to be 0. This
can be detected by key verification.
3. Inserting messages for passive nodes - In the group protocols, nodes that
have engaged in one pairwise session may update their local parameters based
on the output of the future sessions. An adversary may falsely emulate such
sessions. However, it can be demonstrated that the probability of ‘successfully’
inserting a n bit packet, i.e.
anpacket that is accepted as a valid input by the
passive node, is less than 34 .
Theorem 3. Let the adversary activate the protocol of a passive node by insert-
ing an arbitrary pair of strings b1 , b2 , where |b1 | = |b2 | = n, marked with the
session identifier of the currently active nodes. nThe passive nodes detect the
adversary with a probability greater than 1 − 34 .
Proof. Consider the scenario where nodeN2 and nodeN3 are actively engaging
in PnS and nodeN1 is the passive observer. Let tN1,2 be the string at nodeN1
as a result of its interaction with nodeN2 . As described in Protocol 1, nodeN2
uses that string for interaction with nodeN3 . Thus nodeN1 can simply verify the
bus output to and identify ‘unexpected’ behavior of the adversary as follows.
Consider the set of indices L where tN1,2 = 0, L = {l ≤ |tN1,2 ||tN1,2 (i) = 0}.
The output on the bus as a result of the first PnS operation, corresponding
to indices in the set L, should be 0. This results simply from the AND operation
102 S. Jain and J. Guajardo
of the bus. Any deviation from this results in an error by nodeN1 . Thus for the
message by adversary to be accepted, b1 (L) should be 0, i.e. the adversary should
be able to estimate the position of all 0s in the string tN1,2 . Thus we obtain
P r ({b1 , b2 } accepted) = P r (Adv covers all 0 positions | |L| = k) ·
k
P r(|L| = k)
n
k k n−k n
1 n 1 1 3
= · =
2 k 2 2 4
k=0
4. Impersonation - The broadcast nature of the CAN bus ensures that any
transmitted message is delivered to all the nodes. Thus any spoofed or
replayed message by the adversary can be detected by the victim node and
an error flag can be raised. We assume that such detection can occur due to
the session IDs described earlier.
It is clear that Properties 1, 2, 3 are guaranteed by any PnS based key agreement
scheme for the CAN bus. A cryptographic method to guarantee Property (4) is
by utilizing the trust relation established with the gateway. An alternate way is to
increase ECU robustness and include a mechanism to identify spoofed messages
in the individual ECUs. For such cases, schemes that are secure against a passive
eavesdropper would also be secure against an active adversary. Thus the efficient
tree-based structure of Sect. 4 can be utilized to provide security against active
adversaries.
6.2 Performance
One of the main benefits of the our approach is its computational advantage
over the modular multiplications, as required for group schemes based on DH or
ECDH. The variants of the our protocols allow a tradeoff between the complexity
and bandwidth. Further, our schemes are based on pseudorandom functions,
which can be practically implemented either via the SHA family of hash functions
or a block cipher such as AES. Both these primitives are better suited for resource
constrained devices, compared to modular multiplication.
To understand the performance comparison of our scheme, consider the sce-
nario where the M nodes wish to generate a n bit key. Clearly, Protocol 3 requires
no cryptographic primitives, but has a high bandwidth overhead. Each round of
PnS using n bit inputs requires transmission of 2n bits on the bus (normal and
the complement). Further, scenarios that use the cryptographic primitives use 2
invocations of the function for each round. We summarize the overhead and some
properties of the protocols in Table 1. Authors in [10] evaluate the performance of
various cryptographic primitives on various automotive microcontrollers, namely
the S12X, a low end 16-bit automotive microcontroller from Freescale and the
TriCore chip, a high end 32-bit microcontroller from the AUDO family of Infi-
neon. The S12X family operates at 40 MHz while the Tricore chips can operate
Physical Layer Group Key Agreement 103
up to 180 MHz. For generating a key of length n = 128, we may utilize the SHA-
256 hash function in place of the PRF. It can be seen that performance of the
PRFs adds very little overhead of 3.145 ms and 0.045 ms respectively for each
invocation for our target input lengths.
Property Protocol
3 4 5 6
Simple unauth. Tree-based unauth. Tree based auth. Linear auth.
Avg. no. of bits 4n(2M −1 − 1) 4n(M − 1) 4n(M − 1) 4nM
Tx on bus
Avg. of PRF 0 4(M − 1) 5M − 4 4M − 2
invocations
Node addition O(n2M ) O(n log M ) O(n log M ) O(n)
M
Node deletion O(n2 ) O(n log M ) O(n log M ) O(nM )
6.3 Conclusion
References
1. Cryptographic key length recommendations. http://www.keylength.com. Accessed
09 Feb 2016
2. Ateniese, G., Steiner, M., Tsudik, G.: Authenticated group key agreement and
friends. In: Proceedings of Conference on Computer and Communications Security,
pp. 17–26. ACM, New York (1998)
3. Bellare, M., Rogaway, P.: Entity authentication and key distribution. In: Stinson,
D.R. (ed.) CRYPTO 1993. LNCS, vol. 773, pp. 232–249. Springer, Heidelberg
(1994)
4. Bresson, E., Chevassut, O., Pointcheval, D.: Provably secure authenticated group
Diffie-Hellman key exchange. ACM Trans. Inf. Syst. Secur. 10(3), July 2007
5. Canetti, R., Krawczyk, H.: Analysis of key-exchange protocols and their use for
building secure channels. In: Pfitzmann, B. (ed.) EUROCRYPT 2001. LNCS, vol.
2045, pp. 453–474. Springer, Heidelberg (2001)
6. Checkoway, S., McCoy, D., Kantor, B., Anderson, D., Shacham, H., Savage, S.,
Koscher, K., Czeskis, A., Roesner, F., Kohno, T.: Comprehensive experimental
analyses of automotive attack surfaces. In: Proceedings of the USENIX Security
Symposium, August 2011
7. Diffie, W., Hellman, M.: New directions in cryptography. IEEE Trans. Inf. Theor.
22(6), 644–654 (1976)
8. Glas, B., Guajardo, J., Hacioglu, H., Ihle, M., Wehefritz, K., Yavuz, A.: Signal-
based automotive communication security and its interplay with safety require-
ments. In: Embedded Security in Cars (ESCAR), Europe, November 2012
9. Goldreich, O., Goldwasser, S., Micali, S.: How to construct random functions. J.
ACM 33(4), 792–807 (1986)
10. Groza, B., Murvay, S.: Efficient protocols for secure broadcast in controller area
networks. IEEE Trans. Ind. Inf. 9(4), 2034–2042 (2013)
11. Groza, B., Murvay, S., van Herrewege, A., Verbauwhede, I.: LiBrA-CAN: a light-
weight broadcast authentication protocol for controller area networks. In: Pieprzyk,
J., Sadeghi, A.-R., Manulis, M. (eds.) CANS 2012. LNCS, vol. 7712, pp. 185–200.
Springer, Heidelberg (2012)
12. Harn, L., Lin, C.: Authenticated group key transfer protocol based on secret shar-
ing. IEEE Trans. Comput. 59(6), 842–846 (2010)
13. Hastad, J., Impagliazzo, R., Levin, L.A., Luby, M.: A pseudorandom generator
from any one-way function. SIAM J. Comput. 28(4), 1364–1396 (1999)
14. Herrewege, A.V., Verbauwhede, I.: CANAuth - a simple, backward compatible
broadcast authentication protocol for CAN bus. In: ECRYPT Workshop on Light-
weight Cryptography 2011, Louvain-la-Neuve, BE, pp. 229–235 (2011)
15. Katz, J., Yung, M.: Scalable protocols for authenticated group key exchange.
In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol. 2729, pp. 110–125. Springer,
Heidelberg (2003)
Physical Layer Group Key Agreement 105
16. Kim, Y., Perrig, A., Tsudik, G.: Group key agreement efficient in communication.
IEEE Trans. Comput. 53(7), 905–921 (2004)
17. Kim, Y., Perrig, A., Tsudik, G.: Communication-efficient group key agreement.
In: Proceedings of the Annual Working Conference on Information Security, pp.
229–244 (2001)
18. Kim, Y., Perrig, A., Tsudik, G.: Tree-based group key agreement. ACM Trans. Inf.
Syst. Secur. 7(1), 60–96 (2004)
19. Koscher, K., Czeskis, A., Roesner, F., Patel, S., Kohno, T., Checkoway, S., McCoy,
D., Kantor, B., Anderson, D., Shacham, H., Savage, S.: Experimental security
analysis of a modern automobile. In: Proceedings of the Symposium on Security
and Privacy, pp. 447–462, May 2010
20. Law, L., Menezes, A., Qu, M., Solinas, J., Vanstone, S.: An efficient protocol for
authenticated key agreement. Des. Codes Crypt. 28(2), 119–134 (2003)
21. Maurer, U.M.: Information-theoretically secure secret-key agreement by NOT
authenticated public discussion. In: Fumy, W. (ed.) EUROCRYPT 1997. LNCS,
vol. 1233, pp. 209–225. Springer, Heidelberg (1997)
22. Miller, C., Valasek, C.: A survey of remote automotive attack surfaces. Technical
report, IOActive Inc., Online Whitepaper: Accessed 09 Feb 2016
23. Müller, A., Lothspeich, T.: Plug-and-secure communication for CAN. CAN
Newsletter, pp. 10–14, December 2015
24. Rouf, I., Miller, R.D., Mustafa, H.A., Taylor, T., Oh, S., Xu, W., Gruteser, M.,
Trappe, W., Seskar, I.: Security and privacy vulnerabilities of in-car wireless net-
works: a tire pressure monitoring system case study. In: Proceedings of the USENIX
Security Symposium, pp. 323–338, August 2010
25. Steiner, M., Tsudik, G., Waidner, M.: Key agreement in dynamic peer groups.
IEEE Trans. Parallel Distrib. Syst. 11(8), 769–780 (2000)
26. Szilagyi, C., Koopman, P.: Low cost multicast authentication via validity voting
in time-triggered embedded control networks. In: Proceedings of the Workshop on
Embedded Systems Security. ACM, New York (2010)
27. Szilagyi, C., Koopman, P.: Flexible multicast authentication for time-triggered
embedded control network applications. In: Proceedings of the International Con-
ference on Dependable Systems and Networks, pp. 165–174. IEEE, June 2009
28. Valasek, C., Miller, C.: Remote exploitation of an unaltered passenger vehicle.
Technical report, IOActive Inc., Online Whitepaper: Accessed 09 Feb 2016
29. Wang, Y., Ramamurthy, B., Zou, X.: The performance of elliptic curve based group
Diffie-Hellman protocols for secure group communication over ad hoc networks. In:
Proceedings of the International Conference on Communications, vol. 5, pp. 2243–
2248 (2006)
– vatiCAN –
Vetted, Authenticated CAN Bus
1 Introduction
In the highly competitive field of automobile manufacturing, only those have
survived who have adopted the art of extreme cost savings by establishing a well-
coordinated concert of manufacturers, suppliers and assemblers. It is this fragile
chain, which now turns out to be too static when it comes to cross-sectional
changes as would be needed by a radically new, secure architecture.
Even though security experts agree that an overhauled, security-focused
architecture is much-needed [2,10,16,17], carmakers simply cannot easily change
established designs. Arguably, two major obstacles are (1) the industry-wide
“never touch a running system” attitude, which originates in legislative burdens
and safety concerns, and (2) the overwhelming complexity of regulations in dif-
ferent jurisdictions of the world, which have fostered the outsourcing to highly
specialized suppliers. This effect is even more amplified due to the tendency of
acquisition rather than in-house innovation. As a result, desired functionalities
are put out to tender and the hardware and software is instead developed by a
long chain of suppliers. For example, Porsche claims to have the lowest manu-
facturing depth in the automotive industry with more than 80 % of production
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 106–124, 2016.
DOI: 10.1007/978-3-662-53140-2 6
– vatiCAN – Vetted, Authenticated CAN Bus 107
cost spent for supplier’s parts, while the remaining 20 % are spent on engine
production, the assembly, quality control and sale of their vehicles [3].
This is in contrast with the needed extensive architectural changes to imple-
ment at least some level of security. The lack of automotive security engineering
principles as opposed to the desktop computer world is not surprising. The most
widely used automotive communication protocol CAN 1 was designed to run in
isolation stowed away behind panels. Faulty hardware or damaged wires were the
only likely threat to such an isolated system. A deliberate manipulation could
only happen with physical access to the inside of the car. While these design prin-
ciples were absolutely adequate for safety requirements back then, modern cars
have meanwhile reached an almost incomprehensive complexity and moreover
violate the ancient isolation assumptions due to their promiscuous connectivity
such as Bluetooth audio, 3G Internet, WiFi, wireless sensors, RDS2 , and TMC3 .
It is not only potentially possible but it has been practically shown that vul-
nerabilities in these wireless connections exist [2]. An attacker can then write
arbitrary messages on the CAN bus, which connects the car’s computers, the so-
called Electronic Control Units (ECUs). While the culprit is indeed a vulnerable
ECU that can be compromised, the exploited fact is that the CAN topology is
a bus. This broadcast topology allows any connected device, including a com-
promised ECU, to send arbitrary control messages. The receivers have no way
of verifying the authenticity of the sender or the control data.
2 Background
To address the need to connect different sensors, actuators and their controllers
with each other so that they can make informed decisions, BOSCH developed a
new communication bus in 1983 [7,13]. For example, the widespread traction
control system (TCS4 ) could use CAN to connect the necessary sensors (wheel
rotation) and actuators (brakes). The TCS monitors the wheel spin on each of
the four wheels and intentionally brakes individual wheels to get traction back
(see Fig. 1).
RPM 4
RPM
RPM 23
Brake 1 Brake 3 RPM 1
TCS
RPM 1 RPM 3
CAN TCS
RPM 2 RPM 4
Brake 2 Brake 4
Brake
Brake 34
Brake
Brake 12
Fig. 1. Bus Topology on the example of the Traction Control System (TCS): Sensors
(wheel RPM) are read as input while actuators (brakes) work as output.
Meta Data
Priority Length Payload CRC ACK
Data
4
Also known as ESP - Electronic Stability Program.
– vatiCAN – Vetted, Authenticated CAN Bus 109
Priority/
Sender ID Length Data
Airbag 0x050 4 E0 F0 00 FF
Brake (Front Left) 0x1A0 8 10 F0 01 00 00 30 00 E1
Fig. 3. Example CAN messages from the airbag and brake captured on a 2005 Volk-
swagen Passat B6.
Instrument Cluster
(500 kBit/s)
Diagnostics CAN (OBD) CAN Infotainment CAN
(500 kBit/s) Gateway (100 kBit/s)
The most prominent reasons for having more than one CAN bus are (1) Clear
separation for safety reasons, (2) fault-tolerance in case one bus fails, (3) cost
reduction due to lower speed CAN buses where high-speed CAN buses are not
needed. An exemplary CAN bus network and its interconnectedness is depicted
in Fig. 4.
3 Design
The steadily increasing number of components that are connected on the CAN
bus introduce a high likelihood that any of such components may be compro-
mised [4]. Unfortunately, such a compromise may have severe security (and there-
fore also safety) implications to the automotive network. Of all possible threats,
message spoofing remains one of the largest unsolved issued on the current CAN
bus designs. In the worst case, a compromised component may inject fake CAN
messages, e.g., messages that make the parking assistant turn the steering wheel.
In the current CAN design, there is no protection against these threats. First
of all, CAN has no scheme to verify the authenticity of the messages, i.e., neither
the sender information, nor the actual message payload. In principle, an attacker
that controls any component on the CAN bus, can thus
(a) spoof the identity of any other component (e.g., to escalate privileges), or
(b) send arbitrary payload (e.g., to perform malicious actions).
We assume an attacker who does not have physical access to the car but she can
fully compromise one (or a few) wireless ECUs that usually use several IDk k ∈
{0, . . . , 211 } to send on the CAN bus. The attacker’s goal is to impersonate
another ECU with IDx with k = x. After the compromise of ECU with IDk , the
attacker has full flexibility in sending arbitrary messages to the CAN bus, i.e.,
she can fake sender identities and chose any message payloads.
The attacker is assumed not to compromise the ECU for which she intends
to fake the packets—otherwise the attacker would already be using the genuine
sender and can likely extract any cryptographic key material from the compro-
mised devices and thus fake the identity regardless of any cryptographic scheme.
Instead, we protect the identities of critical devices that might be impersonated
(and not compromised) by an attacker.
In addition, we consider an attacker that can passively monitor the CAN
bus. She can observe and record all messages that have been broadcasted on the
CAN bus. This way, the attacker can also learn about components’ identities.
In this section, we describe the individual features of vatiCAN that address the
challenges C1 through C5.
Key Distribution (C4). According to the HMAC construction, the used cryp-
tographic key is either padded to the length of the hash function’s input block
size, or it is hashed if it is longer than the block size. To avoid an additional
hashing operation, we recommend setting the length of the cryptographic key
exactly to the length of the input block size of the hash function: 128 bits.
We also chose not to use one global key, but individual keys per ECU. Of
course, it is also possible to group ECUs that share the same key. This saves
precious flash memory at the expense of reduced security. Typically, ECUs form
logical clusters, e.g., all four wheel rotation sensor ECUs, and all four brake ECUs
form a logical cluster of traction control. vatiCAN leverages this and supports
assigning sets of IDs the same cryptographic key, bonding them to a group.
The most critical aspect is the key provisioning: the process of getting keys
into ECUs in the first place. The cryptographic key for each ECU or group needs
to be provisioned to each ECU that is part of that logical cluster. Generally, two
possibilities exist:
(a) During Assembly. The keys could be generated randomly during produc-
tion and automatically be injected into the flash memory of the correspond-
ing ECUs. However, this makes replacing ECUs after a fault or accident
more involved, as either keys have to be extracted from other ECUs or new
keys need to be generated and distributed to all clusters that the faulty
ECU communicates with.
(b) Key Agreement. Alternatively, keys can also be agreed upon using Diffie-
Hellman key exchange every time the car is turned on. However, this option
has several disadvantages: ECUs that switch on on-demand have to re-run
the key agreement. Moreover, man-in-middle attacks are possible without
certificates stored in the ECUs, which is not practical. And lastly, multi-
party key exchange is non-trivial for an embedded microcontroller.
This is why we chose option (a) to provision the keys during production of an
ECU. In case an ECU needs to be replaced, all other ECUs need to be updated
with the new key. Luckily, software updates through the on-board diagnostics
(OBD) port are commonplace and supported by most ECUs. This allows for
re-programming of keys without physically removing the ECUs from the car.
To protect against malicious key updates by compromised ECUs, the key provi-
sioning could be protected using asymmetric cryptography. For example, signed
updates are a viable option, despite the fact that they are relatively slow.
4 Implementation
We implemented a proof-of-concept of vatiCAN on the popular Atmel AVR
microcontrollers and used off-the-shelf automotive components, such as an
116 S. Nürnberger and C. Rossow
instrument cluster, that act as legacy devices. Our implementation is also avail-
able as download in the form of a library for the popular Arduino development
environment (see Appendix A).
Fig. 6. Bench HIL setup with original instrument cluster ECU and re-engineered ETC
and PCM ECUs.
– vatiCAN – Vetted, Authenticated CAN Bus 117
The corresponding speed that the speedometer dial shows is being sent by the
AS and calculated from the engine RPM and currently selected (simulated) gear.
The vatiCAN library abstracts CAN bus access to sending and receiving mes-
sages, while received messages incorporate the notion of being authenticated or
not. The application using vatiCAN registers known sender IDs for authenti-
cated messages and two callbacks. One callback for receiving messages (legacy
and authenticated) and one for errors (authentication mismatch, timeouts).
For this purpose, the vatiCAN library keeps a list of authenticated sender
IDs i and thus can perform a look-up based on the sender ID for every received
CAN frame. Then, vatiCAN knows whether to expect an additional authenti-
cation CAN frame from j. All CAN frames are delivered immediately to the
application using the provided call-backs. However, CAN bus frames originat-
ing from senders not in the list of authenticated senders are flagged insecure
while frames originating from senders that should authenticate their messages
are flagged as not authenticated yet. If authentication messages are expected,
the HMAC calculation is started in the background. The application code using
vatiCAN can then decide whether to prepare or pre-compute intermediate steps
until the authentication message arrived and was verified. If the authentica-
tion message arrives, vatiCAN automatically compares the computed HMAC to
the received authentication message and either invokes the message reception
118 S. Nürnberger and C. Rossow
Fig. 8. CAN bus frame reception, message processing and the application.
5 Performance Evaluation
takes about 47,600 clock cycles or 2.95 ms for the used clock speed of 16 Mhz.
The total time the reception of a message is deferred due to the calculation of the
HMAC and comparison with the received authentication CAN frame is 3.3 ms.
That means, the look-up if the sender ID is in the list of secure senders plus the
string comparison of calculated and received HMAC make up for 0.35 ms. Note
that the application gets notified immediately after reception of the payload and
can start precomputations. This means that both the sender and receiver can
compute the HMAC of the payload in parallel.
Figure 9 illustrates the parallel computation. HM ACS is the sender’s com-
putation of the HMAC including the currently valid nonce, while HM ACR is the
receiver’s computation of the received message M sg. The HMAC computations
take place simultaneously on the receiver’s and sender’s side, as the receiver
starts computing the HMAC as soon as the plain text message M sg arrives.
The receiver then compares HM ACR against HM ACS to check if they match.
This parallel computation is a major benefit of HMAC compared to asymmet-
ric cryptographic message signatures, for which the receiver has to wait for the
signature before further validations.
Next, we measure the round-trip time for legacy vs. vatiCAN-secured CAN
frames. In case of legacy frames, one microcontroller broadcasts an 8-byte CAN
frame and another microcontroller receives the message and immediately broad-
casts another message. The time measured is the interval between sending the
first message and after receiving the response. For plain, unauthenticated 8-byte
CAN frames, the ping-pong time interval is 1.08 ms and consequently 2 mes-
sages were exchanged in total. For vatiCAN, 2 messages have to be sent and 2
messages have to be received. Both microcontrollers must calculate 2 HMACs
(one for sending, one for verification). The total time between sending the secure
message until after reception of the secure response is 4.5 ms.
Please note that the used ATmega 8 bit microcontrollers represent the lower
bound of an automotive performance evaluation. The common v850 32 bit micro-
controllers offer ≈2.6× the performance.
be transmitted. Hence, per second a total of 33, 840 + 560 · 47 = 60, 160 bits
are transmitted. We tested the maximum possible bandwidth under realistic
conditions by flooding the bus with 8-byte CAN frames. Counting whole CAN
frames (payload + header bits) we achieved a throughput of 448 kBit/s. Thus,
the measured utilization of 60.2 kBit/s corresponds to 13.4 % utilization. With
3 out of 13 senders protected by vatiCAN, per second 110 messages of the 560
total messages are protected. This accounts for additional 110 · (47 + 64) bits =
12, 210 bits. Thus, the total bus utilization increases to 72.4 kBit/s (16.2 %).
The total vatiCAN library size is 2152 bytes of AVR instructions of which
678 bytes are attributed to the Keccak implementation and the remaining
1474 bytes are the surrounding vatiCAN message verification, HMAC and inter-
rupt logic. In addition, vatiCAN has to store an additional 32-bit word for the
sender’s nonce (4 byte) per sender ID. Even in the unlikely case that an ECU
expects 100 different vatiCAN sender IDs, this would result in mere 100*4=400
bytes.
6 Security Evaluation
fast enough. Even though a dual-core 32-bit ARM 1 GHz (e.g., the infotainment
system) would be about 100x faster, it still takes 24 h to brute-force for a nonce
update interval of 50 ms.
It should be considered that an attacker might successfully compromise an
ECU on which a key is stored that is used for vatiCAN. If keys are grouped and
used on multiple ECUs, the attacker can use this key to generate valid HMACs
for any sender to which the group key belongs.
7 Related Work
TESLA [14] supports immediate disclosure of the key, each packet incorporates
a hash of the succeeding packet to build a chain. This is clearly unsuited for
highly lively but predictable CAN bus traffic.
Finally, the AUTOSAR standard [6] also supports an HMAC-based message
authentication scheme. In contrast to vatiCAN, Autosar is not backward com-
patible, as Autosar uses higher level communication (PDUs) to which an HMAC
is appended. Moreover, Autosar does not support spoofing prevention.
9 Conclusion
The adaptation of new technology in the automobile sector is a cautious and slow
process. It is therefore important to change only a few parts, while the estab-
lished and reliable majority of components can be re-used. Therefore, vatiCAN
is designed to be backward-compatible to allow tried and trusted components to
rely on the same CAN messages without need for modification. However, those
parts for which a manufacturer decides to enhance security can be easily pro-
tected by means of a software upgrade, which uses vatiCAN instead of another
CAN bus interface library. Our vatiCAN implementation is able to deliver real-
time protection to ensure that a compromised ECU cannot be leveraged to fake
messages, which are potentially life-threatening. The induced latency of 3.3 ms
for authenticated messages is fast enough for most situations and shows the
practicality and feasibility of the approach. However, for highly timing-critical
functions, such as brakes, a millisecond delay might be unacceptable.
While the presented results should encourage automakers to implement what
is currently possible given a dated CAN bus architecture, it also shows the need
for a novel design to achieve stronger security claims and better performance.
– vatiCAN – Vetted, Authenticated CAN Bus 123
Acknowledgments. This work was supported by the German Ministry for Education
and Research (BMBF) through funding for the Center for IT-Security, Privacy and
Accountability (CISPA).
A Availability
Our vatiCAN implementation is available as free software download published
under the LGPL v2. We provide a library for the popular Arduino develop-
ment environment for Atmel’s AVR microcontrollers. Its source code is publicly
available at http://automotive-security.net/securecan
References
1. Balasch, J., et al.: Compact implementation and performance evaluation of hash
functions in ATtiny devices. In: Mangard, S. (ed.) CARDIS 2012. LNCS, vol. 7771,
pp. 158–172. Springer, Heidelberg (2013)
2. Checkoway, S., McCoy, D., Kantor, B., Anderson, D.,Shacham, H., Savage, S.,
Koscher, K., Czeskis, A., Roesner, F., Kohno, T., et al.: Comprehensive experi-
mental analyses of automotive attack surfaces. In: USENIX Security Symposium
(2011)
3. Dr. Ing. h.c. F. Porsche Aktiengesellschaft: Annual report 2004/2005.
http://www.porsche.com/filestore.aspx/default.pdf?pool=uk&type=download&
id=annualreport-200405&lang=none&filetype=default
4. Ebert, C., Jones, C.: Embedded software: facts, figures, and future. Computer 4,
42–52 (2009)
5. Hanselmann, H.: Hardware-in-the loop simulation as a standard approach for devel-
opment, customization, and production test of ECUs. Technical report, SAE Tech-
nical Paper (1993)
124 S. Nürnberger and C. Rossow
1 Introduction
Outsourced fabrication of integrated circuit (IC) enables IC design companies
to access advanced semiconductor technology at a low cost. Although it is cost-
effective, the outsourced design faces various security threats since the offshore
foundry might not be trustworthy. Without close monitoring and direct control, the
outsourced designs are vulnerable to various attacks such as Intellectual Property
(IP) piracy [10] and counterfeiting [3]. The malicious foundry can reverse-engineer
a GDSII layout file to obtain its gate-level netlist and claim the ownership of the
hardware IP design, or it can overbuild the IC and sell illegal copies into the mar-
ket. These security threats (also known as supply chain attacks) pose a significant
economic risk to most commercial IC design companies.
Logic locking is a technique that is proposed to thwart the aforementioned
supply chain attacks. The basic idea is to insert additional key-controlled logic
gates (key-gates), key-inputs and an on-chip memory into an IC design to hide
its original functionality, as shown in Fig. 1. The key-inputs are connected to the
on-chip memory and the locked IC preserves the correct functionality only when
a correct key is set to the on-chip memory. To prevent the untrusted foundry
from probing internal signals of a running chip, a tamper-proof chip protection
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 127–146, 2016.
DOI: 10.1007/978-3-662-53140-2 7
128 Y. Xie and A. Srivastava
Fig. 1. Logic locking techniques: (a) Overiew; (b) An original netlist; (c) XOR/XNOR
based logic locking; (d) MUX based logic locking; (e) LUT based logic locking.
shall be implemented. Recent years have seen various logic locking techniques
based on different key-gate types and key-gate insertion algorithms. Accord-
ing to the key-gate types, they can be classified into three major categories:
XOR/XNOR based logic locking [8,9,11], MUX based logic locking [7,9,13]
and Look-Up-Table (LUT) based logic locking [1,5,6], as shown in Fig. 1 (b-e).
Among all, the XOR/XNOR based logic locking has received the most atten-
tion mainly due to its simple structure and low performance overhead. Various
XOR/XNOR based logic locking algorithms have been proposed to identify the
optimal locations for inserting the key-gates, such as fault-analysis based inser-
tion [9] and interference-analysis based insertion [8]. The security objective of
these logic locking techniques is to increase the output corruptibility ( i.e., pro-
duce more incorrect outputs for more input patterns) given an incorrect key, and
to prevent effective key-learning attacks.
The security of logic locking is threatened if the correct key values into the
key-gates are accessible to or can be learned within a practical time by the
malicious foundries. To learn the correct key, Subramanyan et al. [12] proposed
a satisfiability checking based attack (SAT attack ) algorithm that can effectively
break most logic locking techniques proposed in [1,2,8,9,11] within a few hours
even for a reasonably large number of keys. The insight of SAT attack is to infer
the correct key using a small number of carefully selected input patterns and
their correct outputs observed from an activated functional chip (which can be
obtained from the open market). This set of correct input/output pairs together
ensures that only the correct key will be consistent with these observations.
The process of finding such input/output pairs is iteratively formalized as a
sequence of SAT formulas that can be solved by state-of-the-art SAT solvers. In
each of these iterations, the SAT formulation rules out a bunch of wrong key
combinations till it reaches a point where all the wrong keys have been removed.
The SAT attack is powerful as it guarantees that upon termination it can always
reveal the correct key. This guarantee can’t be achieved by other attacks on logic
locking such as the EPIC attack [7]. Hence in this paper we focus on the SAT
attack on logic locking.
Mitigating SAT Attack on Logic Locking 129
Fig. 2. SAT attack mitigation techniques: (a) Adding an AES circuit to increase the
time for solving a SAT formula [14]; (b) Adding our proposed Anti-SAT circuit block
to increase the number of SAT attack iterations.
Since the SAT attack needs to iteratively solve a set of circuit-based SAT
formulas to reveal the correct key, its efficiency is determined by two aspects:
(a) the execution time for solving a SAT formula in one iteration and (b) the
number of iterations required to reveal the correct key. The first aspect depends
on whether a locked circuit is easily solvable by a SAT solver (i.e., finding a
satisfiable assignment for the SAT formula based on this circuit). Based on this
idea, Yasin et al. [14] proposed adding an AES circuit (with a fixed AES key) to
enhance a locked circuit’s resistance to the SAT attack. The insight underlying
this proposal is shown in Fig. 2(a). A portion of key-inputs is firstly connected
to the AES inputs and the outputs of the AES are the actual key-inputs into
the locked circuit. As the AES circuit is hard to be solved by a SAT solver, the
SAT attack will fail to find a satisfiable assignment for the SAT attack formula
within a practical time limit. Although this approach can effectively increase the
SAT attack execution time, the AES circuit results in a significant performance
overhead since a standard AES circuit implementation requires a large number
of gates [4]. This makes the approach in [14] impractical.
In this paper, we propose a relatively lightweight circuit block (referred to as
Anti-SAT block) that can be embedded into a design to efficiently mitigate the
SAT attack. The basic structure of our Anti-SAT block is shown in Fig. 2(b).
While a portion of keys (key-inputs A) is connected to the original circuit to
obfuscate its functionality, another portion of keys (key-inputs B) is connected
to the Anti-SAT block to thwart the key-learning of SAT attack. The Anti-SAT
block is designed in a way that the total number of SAT attack iterations (and
thus the total execution time) to reveal the correct key in the Anti-SAT block is
an exponential function of the key-size in the Anti-SAT block. Therefore, it can
be integrated into a design to enhance its resistance to the SAT attack. The
contributions of this paper are summarized as follows:
The SAT attack model [12] assumes that the attacker is an untrusted foundry
whose objective is to obtain the correct key of a locked circuit. The malicious
foundry has access to the following two components:
The key idea of the SAT attack is to reveal the correct key using a small number
of carefully selected inputs and their correct outputs observed from an activated
functional chip. These special input/output pairs are referred to as distinguishing
input/output (I/O) pairs. Each distinguishing I/O pair can identify a subset of
wrong key combinations and all together they guarantee that only the correct
key can be consistent with these correct I/O pairs. This implies that a key
that correctly matches the inputs to the outputs for all the distinguishing I/O
pairs must be the correct key. The crux of the SAT attack is to find this set of
distinguishing I/O pairs by solving a sequence of SAT formulas.
Definition 1: (Wrong key combination). Consider the logic function Y =
fl (X , K ) and its CNF SAT formula C(X , K , Y ). Let (X , Y ) = (X i , Y i ),
where (X i , Y i ) is a correct I/O pair. The set of key combinations W Ki which
result in an incorrect output of the logic circuit (i.e.,Y i = fl (X i , K ), ∀K ∈
W Ki ) is called the set of wrong key combinations identified by the I/O pair
(X i , Y i ). In terms of SAT formula, it can be represented as C(X i , K , Y i ) =
F alse, ∀K ∈ W Ki .
Mitigating SAT Attack on Logic Locking 131
λ
G := C(X di , K , Y di ) (1)
i=1
where (X di , Y di ) is the distinguishing I/O pair from i-th iteration and λ is the
total number of iterations. Basically it finds a key K which satisfies the correct
functionality for all the identified distinguishing I/O pairs. This must be the
correct key since no other distinguishing I/O pairs exist (see Definition 2).
Take the XOR/XNOR based locked circuit in Fig. 1(c) as an example. At first
iteration, the I/O pair (X d1 , Y d1 ) = (00, 10) is a distinguishing I/O pair because
it can rule out wrong key combinations K = (01), (10), and (11) as these
key combinations will result in incorrect outputs (y1 y2 ) = (11), (00) and (01),
respectively. Since this single I/O observation has already ruled out all incorrect
key combinations, we have revealed the correct key K = (00). In general, a small
number of correct I/O pairs (compared to all possible I/O pairs) are usually
enough to infer the correct key [12]. As a result, the SAT attack is efficient
because it only requires a small number of iterations to find these distinguishing
I/O pairs.
Fi :=C(X , K 1 , Y 1 ) ∧ C(X , K 2 , Y 2 ) ∧ (Y 1 = Y 2 )
j=i−1
j=i−1
(2)
( C(X dj , K 1 , Y dj )) ∧ ( C(X dj , K 2 , Y dj ))
j=1 j=1
where C(X , K , Y ) is the SAT formula (CNF form) for a locked circuit and
(X d{1...i−1} , Y d{1...i−1} ) are the distinguishing I/O pairs that are found in pre-
vious i − 1 iterations. If satisfiable, an assignment for variables X , K 1 , K 2 ,
132 Y. Xie and A. Srivastava
Y 1 , Y 2 will be generated. The first line in the formula (2) evaluates the circuit
functionality for a specific X = X di at two different key values K 1 and K 2 such
that the outputs are different (see Y 1 = Y 2 ). This guarantees that the input
X = X di is capable of identifying two keys K 1 , K 2 which produce different
outputs. Hence at least one of the two keys must be wrong. This in itself is
not enough to call X = X di as a distinguishing input because previous itera-
tion may have found another input assignment that could have differentiated
between K 1 and K 2 . According to Definition 2, a distinguishing input in the
i-th iteration must find “unique” wrong key combinations that have not been
identified by previous i − 1 distinguishing I/O pairs. This condition is checked
by the SAT clauses in the second line. In the second line X dj is the distinguish-
ing input identified in the previous j-th iteration and Y dj is the corresponding
correct output. This correct output is know from the activated functional chip
obtained from the open market. The clauses in the second line guarantee that
the keys K 1 and K 2 which result in “different”outputs in line 1 of this formula
produce the “correct” outputs for all previous distinguishing I/O pairs. Hence
in this iteration we could identify at least one incorrect key combination which
previous iterations could not. Therefore by Definition 2 the input X di (obtained
from the SAT solver) and the corresponding “correct” output Y di = eval(X di )
(obtained from the activated chip) represent the i-th distinguishing I/O pair.
The SAT attack algorithm is shown in Algorithm 1. Basically it starts by
first solving the line one of the formula (2) and as iterations progress it adds
the clauses comprised in line two of the formula (2). It stops when the resulting
SAT formula is unsatisfiable indicating no further distinguishing I/O pairs exist.
The correct key is obtained by finding a key value which satisfies the correct
I/O behavior of all the distinguishing I/O pairs. This algorithm is guaranteed
to find the correct key. Please refer to [12] for any further theoretical details.
Mitigating SAT Attack on Logic Locking 133
λ
T = ti (3)
i=1
where λ is the total number of SAT attack iterations and ti is the SAT solving
time for i-th iteration. Consequently, the SAT attack can be mitigated if ti is
large and/or λ is large.
The SAT solving time ti is dependent on benchmark characteristics as well
as the efficiency of the SAT solver used. To increase ti , Yasin et al. [14] proposed
to add an AES circuit to protect the locked circuit, as shown in Fig. 2(a). As the
AES circuit is hard to be solved by a SAT solver, the SAT attack will fail to find
a satisfiable assignment for the SAT attack formula. Although this approach is
effective, the AES circuit leads to a large performance overhead since a standard
AES circuit implementation requires a large number of gates [4].
Increasing the number of iterations λ is another approach to mitigate the
SAT attack. λ depends on the key-size and key location in the locked circuit.
However, simply increasing the key-size or trying different key locations may
not effectively thwart the SAT attack. As shown in the SAT attack results [12],
even with large number of keys (50 % area overhead), for six previously proposed
key-gate insertion algorithms [1,2,8,9,11], 86 % benchmarks on average can still
be unlocked in 10 h.
Fig. 3. Anti-SAT block configuration: (a) An Anti-SAT block that always outputs 0
if key values are correct; (b) An Anti-SAT block that always outputs 1 if key values
are correct. (c) Integrating the Anti-SAT block into a circuit.
be 1 for some inputs (XOR gate behaves as an inverter) and thus can produce
a fault in the original circuit.
In the subsequent sections, we provide details on constructing the Anti-SAT
block (i.e., the functionality of g) and its impact on SAT attack complexity. We
provide a rigorous mathematical analysis which give a provable lower bound to
the number of SAT attack iterations. For some constructions of g, this lower
bound is exponential in the number of keys thereby making the SAT-attack
complexity very high. In the remaining of this paper, we take Fig. 3(a) as the
configuration in our analysis and experiments (without loss of generality).
Now we describe how the Anti-SAT block can be constructed. Note that this
construction may not be unique and other constructions may also be feasible.
Consider the circuit illustrated in Fig. 4(a). Here a set of key-gates (XORs) are
inserted at the inputs of two logic blocks, so B1 = g(X ⊕ K l1 ) and B2 =
g(X ⊕ K l2 ), where |K l1 | = |K l2 | =n. Hence the key-size is 2n. The outputs B1
and B2 are fed into an AND gate and produce an output Y . As a result, we have
Y = g(X ⊕ K l1 ) ∧ g(X ⊕ K l2 ).
Note that here we are using only XOR gates as key-gates for the sake of ease
of explanation. The key-gates used in Fig. 4(a) could be either XOR or XNOR
gates based on a user-defined key. Similar to conventional XOR/XNOR base
logic locking [9], if a correct key-bit is 0, the key-gate can be XOR or XNOR +
inverter. If the key-bit is 1, the key-gate can be XNOR or XOR + inverter. The
usage of inverters can remove the association between key-gate types and key-
values (e.g. the correct key into an XOR gate can now be either 0 or 1). Moreover,
as discussed in [9], the synthesis tools can “bubble push” the inverters to their
fan-out gates and an attacker cannot easily identify which inverters are part of
the key-gates. Besides, the XOR/XNOR gates can be synthesized using other
gate types. Combined with obfuscation techqniues which will be discussed in
Sect. 4.4, the attacker cannot obtain the correct key-values based on the types
of gates connected to the key-inputs.
Mitigating SAT Attack on Logic Locking 135
Fig. 4. Anti-SAT block construction: (a) basic Anti-SAT block construction and (b)
one possible construction to ensure large number of SAT attack iterations.
Since the Anti-SAT block has 2n keys, the total number of wrong key com-
binations is 22n − c, assuming there exists c correct key combinations. Because
correct key input (for Fig. 4(a)) happens when i-th key from K l1 and i-th key
from K l2 have the same value, the number of correct key combinations c = 2n .
Here we analyze the complexity of a SAT attack on the Anti-SAT block con-
struction of Fig. 4(a) (assuming this is the circuit being attacked to decode the
2n key bits).
Terminology. Given a Boolean function g(L) with n inputs, assuming there
exists p input vectors that make g equal to one (denote p as output-one count,
1 ≤ p ≤ 2n − 1), we can classify the input vectors L into two groups LT and LF ,
where one group makes g = 1 and another makes g = 0:
The function g and its complementary function g are used to construct the
Anti-SAT block as shown in Fig. 4(a).
Proof: As described in Sect. 2, the SAT attack algorithm will iteratively find
a distinguishing I/O pair (X di , Yid ) to identify wrong key combinations in the
Anti-SAT block until all wrong key combinations are identified. In the i-th iter-
ation, the corresponding distinguishing I/O pair can identify a subset of wrong
key combinations, denoted as W Ki . Notice that for any input combinations
(including the distinguishing inputs X di ), the correct output (when provided the
correct key) is 0. Therefore, a wrong key combination K = (K l1 , K l2 ) ∈ W Ki
which was identified by (X di , Yid ) must produce the Anti-SAT block output as 1.
This condition is described below.
136 Y. Xie and A. Srivastava
p · (2n − p) ≥ U Ki (6)
λ
λ(p · (2n − p)) ≥ U Ki (7)
i=1
λ
Since i=1 U Ki is the total number of incorrect key combinations, its value
must be = 22n − c. The equation above can be rewritten as follows.
22n − c
λ ≥ λl = (8)
p(2n − p)
Mitigating SAT Attack on Logic Locking 137
Here λl is the lower bound on λ. As noted in Fig. 4(a) the correct key happens
when the i-th bit from K1 and i-th bit from K2 have the same value. Hence
c = 2n . When p → 1 or p → 2n − 1, we have the lower bound as follows:
22n − 2n 22n − 2n
λl = → = 2n (9)
p(2n − p) 1 × (2n − 1)
Table 1. Impact of output-one count p on the security level of the n = 16-bit baseline
Anti-SAT block. Timeout is 10 h.
Table 2. Impact input-size n on the security level of the baseline Anti-SAT block
(output-one count p = 1). Timeout is 10 h.
n 8 10 12 14 16
# Iteration 255 1023 4095 16383 -
Time (s) 1.14569 20.024 324.727 4498.03 timeout
combined with our Anti-SAT block designs for achieving foolproof logic locking.
Moreover, the key-gates inserted at the original circuit can make the Anti-SAT
block less distinguishable with the original circuit. Without these key-gates in
the original circuit, an attacker has less difficulty to locate the Anti-SAT block
by inspecting the only key-inputs into the Anti-SAT block.
In this section, we evaluate the security level of our proposed Anti-SAT blocks.
The security level is evaluated by the number of SAT attack iterations as well as
the execution time to infer the correct key. The SAT attack tools and benchmarks
used are from [12]. The CPU time limit is set to 10 h as [12]. The experiments
are running on an Intel Core i5-2400 CPU with 16 GB RAM.
We firstly evaluate the security level of the Anti-SAT block with respect to
different design parameters: (a) the input-size n and output-one count p of func-
tion g and (b) the Anti-SAT block location. The n-bit baseline Anti-SAT
(BA) block is constructed using a n-input AND gate and a n-input NAND
gate (output-one count p = 1) as g and g to ensure large number of iterations.
However notice that this is not the only possible choice for g and g. As we
have shown in Sect. 4.2, other function g that has sufficiently large n and suf-
ficiently small (or large) p can also guarantee large number of iterations. The
key-gates (XOR/XNOR) are inserted at the inputs of g and g with key-size
|K l1 | = |K l2 | = n. Obfuscation techniques proposed in Sect. 4.4 are not applied
here but they will be evaluated in the Sect. 5.3 when the Anti-SAT block is
integrated into a circuit.
140 Y. Xie and A. Srivastava
Table 3. Impact of Anti-SAT block location on security level of the baseline n-bit
Anti-SAT blocks (n = 8, 12, 16) inserted at c1355 circuit. Timeout is 10 h. The random
case is averaged over 5 trials.
|K l1 | = |K l2 | = n 8 12 16
Random Avg. # Iteration 151 1748 11461
Avg. Time (s) 1.4296 162.529 10272.4
Secure # Iteration 255 4095 -
Time (s) 3.452 759.924 timeout
Input-size n and output-one count p. As shown in Eq. (8), the lower bound
of SAT attack iterations λl to unlock the Anti-SAT block is related to the input-
size n and output-one count p of function g.
If n is fixed, λl is maximized when p → 1 or p → 2n − 1. To evaluate the
impact of p, we replace some 2-input AND gates with 2-input OR gates in a
n = 16-bit baseline Anti-SAT block to gradually increase p. Table 1 illustrates
the impact of p on the security level of the 16-bit Anti-SAT block. For p = 1
and p = 216 − 1 = 65535, the SAT attack algorithm fails to unlock the Anti-SAT
block in 10 h. This is because that it requires a large number of iterations (≈ 216 )
to rule out all the incorrect key combinations. As p → 216 /2 (the worst case),
the SAT attack begins to succeed using less and less iterations and execution
time. This result validates that when p is very small or very large for a fixed n,
the iterations λ will be large and the SAT attack will fail within a practical time
limit.
Moreover, as described in Sect. 4.2, λl is an exponential function of n when
p is very low (p → 1) or very high (p → 2n − 1). Table 2 shows the exponen-
tial relationship between λ and n when p = 1 for five baseline Anti-SAT block
(n = 8, 10, 12, 14, 16). It can be seen that as n increases, the simulated SAT iter-
ations and execution time grows exponentially. Besides, the number of iterations
validates that the lower bound λl is tight when p = 1, as discussed in Sect. 4.2.
Anti-SAT Block Location. As noted in Sect. 4.3, the Anti-SAT block location
may impact the its security in terms of SAT attack iterations and execution
time. We compare two approaches of integrating the Anti-SAT block with the
original circuit, namely secure integration and random integration. For the secure
integration, n inputs of the Anti-SAT block X are connected to n PIs of the
original circuit. The output Y is connected to a wire which is randomly selected
from wires that have the top 30 % observability. The randomness of the location
of Y can assist in hiding the output of the Anti-SAT block. For the random
integration, the inputs X are connected to random wires of the original circuit,
and the output Y is connected to a random wire. For both cases, the wire
for Y has a latter topological order than that of the wires for X to prevent
combinational loop. Table 3 compares two integration approaches when three
baseline Anti-SAT block of different sizes (n = 8, 12, 16) are integrated into the
Mitigating SAT Attack on Logic Locking 141
c1355 circuit from ISCAS85. It can be seen that for three Anti-SAT blocks,
secure integration is more secure than random integration as the former requires
more iterations (∼ 2×) and execution time (∼ 3×) for the SAT attack algorithm
to reveal the key. Therefore, in the following experiments, we adopt the secure
integration as the way to integrate the Anti-SAT block into a circuit.
We evaluate the security level of the Anti-SAT block when it’s applied to 6 cir-
cuits of different sizes from ISCAS85 and MCNC benchmark suites. The bench-
mark information is shown in Table 4. We compare three logic locking configu-
rations as follows:
– TOC13: The original circuit is locked using TOC13 logic locking algorithm [9]
which inserts XOR/XNOR gates into the circuit to obfuscate its functionality.
Figure 5 shows that TOC13 is effective in increasing the output corruptibility
(in terms of the Hamming distance (HD) between the output of an original
circuit and a locked circuit given a random key). Also, it can be seen that 5 %
overhead (ratio between # key-gates and # original gates) is roughly enough
to approach 50 % HD for all benchmarks.
– TOC13(5 %) + n-bit BA: In this configuration, the original circuit is locked
with TOC13 with 5 % overhead. Besides, we integrate a n-bit baseline Anti-
SAT (BA) block into the locked circuit using the secure integration, i.e., the n
inputs of the Anti-SAT block are connected to n PIs of the original circuit, and
the output of the Anti-SAT block is connected to a wire in the original circuit
which is randomly selected from the wires that have the top 30 % observability.
For a n-bit BA, its key-size is kBA = 2n because 2n keys are inserted at the
inputs of g and g.
– TOC13(5 %) + n-bit OA: In this configuration, the obfuscation techniques
proposed in Sect. 4.4 are applied to make the baseline Anti-SAT block less dis-
tinguishable from the locked circuit. In our experiment, we insert n MUXes to
increase the inter-connectivity between a n-bit baseline Anti-SAT block and
the locked circuit. Besides, we insert additional n XOR/XNOR gates at ran-
dom internal wires of the logic blocks g and g to obfuscate their functionality
to prevent the detection of complementary pairs of signals. Thus, the keys in
a n-bit obfuscated Anti-SAT (OA) block is kOA = 4n.
We compare the security level of three configurations when the same number
of keys are used in each configuration. We investigate the sensitivity of SAT
attack complexity on the increase of key-size. For TOC13, the increased keys are
inserted to the original circuit. For TOC13 (5 %) + n-bit BA/OA, the increased
keys are used in the Anti-SAT block and increasing the key-size also indicates
increasing the Anti-block size (in terms of input-size n) because we construct the
BA and OA with kBA = 2n and kOA = 4n, respectively. In this experiment, we
integrate the baseline Anti-SAT blocks of input-size nBA = 8, 10, 12, 14, 16, 18, 20
142 Y. Xie and A. Srivastava
Fig. 6. SAT attack results on 6 benchmarks with three logic locking configurations:
TOC13 only, TOC13(5 %) + BA, and TOC13(5 %) + OA. Timeout is 10 h (3.6 × 104
s). The dashed line in the top figure (execution time) is the curve fitting result when
the SAT attack has time-outed after certain key-size.
The SAT attack result on three configurations w.r.t increasing key-size are
shown in Fig. 6. For each benchmark, the top figure shows the SAT attack exe-
cution time and the bottom figure shows the number of SAT attack iterations,
both in log scale. It can be seen that for TOC13, increasing the key-size can-
not effectively increase SAT attack complexity. For all benchmarks locked with
TOC13, they can be easily unlocked using at most 48 iterations and 8.48 s. On
the other hand, when the Anti-SAT blocks are integrated, the SAT attack com-
plexity increases exponentially with the key-size in the Anti-SAT block. This
holds for both the baseline Anti-SAT block and the obfuscated Anti-SAT block.
Also the results show that with the same key-size, the grow rate of n-bit OA is
slower than n-bit BA. This is because that for OA, a portion of keys are utilized
to obfuscate the design and the resulting OA is half the size as the BA (in terms
of n) as described earlier. Finally, we can see that for all benchmarks, the SAT
attack fails to unlock the circuits within 10 h when a 14-bit BA (kBA = 28) is
inserted or when a 10-bit OA (kOA = 40) is inserted.
different types of gates, it’s difficult for an attacker to obtain the exact size
of the Anti-SAT block. Therefore, we perform the partitioning algorithm while
assuming a certain area estimation error for the Anti-SAT block and analyze its
impact on the attack results.
As shown in Table 5, when the area estimation error is 0 % and no MUXes
are inserted, the Anti-SAT block in 4 circuits (c1355, c1908, dalu, and i8) can
be isolated. In these circuits, the percentages of gates of the Anti-SAT block iso-
lated by the partitioning algorithm are almost 100 %. However, with the increase
of area estimation error, the partitioning fails to isolate the Anti-SAT block.
Besides, when n = 14 MUXes are inserted to increase the inter-connectivity
between the Anti-SAT block and the locked circuit, the percentages of isolated
Anti-SAT block are almost 0 % for four assumptions of area estimation error.
This is because that the number of interconnections between the Anti-SAT block
and the locked circuit is increased and partitioning them will result in a large
cut-size, so the partitioning algorithm will avoid to separate the Anti-SAT block.
Fig. 7. SAT attack execution time (in log scale) and area overhead for the des circuit
integrated with n-bit obfuscated Anti-SAT. The original circuit is locked with TOC13
(5 % overhead). The blue dashed line is the fitting curve for CPU time.
Mitigating SAT Attack on Logic Locking 145
proposed in [14] which inserts an AES circuit to defend the SAT attack, our
proposed technique has much less overhead.
6 Conclusion
In this paper, we present a circuit block called Anti-SAT to mitigate the SAT
attack on logic locking. We show that the iterations required by the SAT attack
to reveal the correct key in the Anti-SAT block is an exponential function of the
key-size in the Anti-SAT block. The Anti-SAT block is integrated to a locked
circuit to increase its resistance against SAT attack. Compared to adding a large
hard-SAT circuit (e.g.AES), our proposed Anti-SAT block has a much smaller
overhead, which makes it a cost-effective technique to mitigate the SAT attack.
Acknowledgments. This work was supported by NSF under Grant No. 1223233 and
AFOSR under Grant FA9550-14-1-0351.
References
1. Baumgarten, A., Tyagi, A., Zambreno, J.: Preventing IC piracy using reconfig-
urable logic barriers. IEEE Des. Test Comput. 27(1), 66–75 (2010)
2. Dupuis, S., Ba, P.S., Di Natale, G., Flottes, M.L., Rouzeyre, B.: A novel hardware
logic encryption technique for thwarting illegal overproduction and hardware tro-
jans. In: 2014 IEEE 20th International On-Line Testing Symposium (IOLTS), pp.
49–54. IEEE (2014)
3. Guin, U., Huang, K., DiMase, D., Carulli, J.M., Tehranipoor, M., Makris, Y.:
Counterfeit integrated circuits: a rising threat in the global semiconductor supply
chain. Proc. IEEE 102(8), 1207–1228 (2014)
4. HelionTechnology: High performance AES (Rijndael) cores for ASIC (2015).
http://www.heliontech.com/downloads/aes asic helioncore.pdf
5. Khaleghi, S., Da Zhao, K., Rao, W.: IC piracy prevention via design withhold-
ing and entanglement. In: 2015 20th Asia and South Pacific Design Automation
Conference (ASP-DAC), pp. 821–826. IEEE (2015)
6. Liu, B., Wang, B.: Embedded reconfigurable logic for ASIC design obfuscation
against supply chain attacks. In: Proceedings of the Conference on Design, Automa-
tion and Test in Europe, p. 243. European Design and Automation Association
(2014)
7. Plaza, S.M., Markov, I.L.: Solving the third-shift problem in IC piracy with test-
aware logic locking. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 34(6),
961–971 (2015)
8. Rajendran, J., Pino, Y., Sinanoglu, O., Karri, R.: Security analysis of logic obfus-
cation. In: Proceedings of the 49th Annual Design Automation Conference, pp.
83–89. ACM (2012)
9. Rajendran, J., Zhang, H., Zhang, C., Rose, G.S., Pino, Y., Sinanoglu, O., Karri,
R.: Fault analysis-based logic encryption. IEEE Trans. Comput. 64(2), 410–424
(2015)
10. Rostami, M., Koushanfar, F., Karri, R.: A primer on hardware security: models,
methods, and metrics. Proc. IEEE 102(8), 1283–1295 (2014)
146 Y. Xie and A. Srivastava
11. Roy, J.A., Koushanfar, F., Markov, I.L.: Epic: Ending piracy of integrated circuits.
In: Proceedings of the Conference on Design, Automation and Test in Europe, pp.
1069–1074. ACM (2008)
12. Subramanyan, P., Ray, S., Malik, S.: Evaluating the security of logic encryption
algorithms. In: 2015 IEEE International Symposium on Hardware Oriented Secu-
rity and Trust (HOST), pp. 137–143. IEEE (2015)
13. Wendt, J.B., Potkonjak, M.: Hardware obfuscation using PUF-based logic. In:
Proceedings of the 2014 IEEE/ACM International Conference on Computer-Aided
Design, pp. 270–277. IEEE Press (2014)
14. Yasin, M., Rajendran, J., Sinanoglu, O., Karri, R.: On improving the security of
logic locking. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. PP(99), 1
(2015)
No Place to Hide: Contactless Probing
of Secret Data on FPGAs
1 Introduction
the transmission of the bitstream (even in an encrypted format) can expose the
design [23,30,31,46]. Furthermore, volatile Battery Backed RAMs (BBRAMs)
and eFuses, which can be used to store the secret key for decryption of
the bitstream, are unreliable and vulnerable to scanning electron microscopy
(SEM) [46].
FPGA vendors always attempt to add more advanced countermeasures to
their devices, to effectively mitigate physical attacks. While DPA vulnerabilities
of the decryption cores can be solved by DPA-resistant IP cores and asymmetric
authentication schemes, Physically Unclonable Functions (PUFs) can mitigate
the insecurity of eFuses and BBRAMs [46]. Moreover, different physical sensors
inside the FPGAs can monitor the environmental changes to detect glitching
and fault injection attacks. However, a proper physical protection against semi-
and fully-invasive attacks from the IC backside is still missing on these modern
platforms.
There are good reasons for FPGA vendors to be less concerned about the
security of the IC backside. First, the latest generations of SRAM-based FPGAs
are manufactured with 20 nm technology and the next generation of FPGAs
will be built with 16 and 14 nm technologies [13,19]. Yet, it has already been
demonstrated that, even for larger FPGA technologies such as 45 nm and 60 nm,
conventional semi-invasive attacks from the IC backside, such as Laser Fault
Injection (LFI) [39] and Photonic Emission Analysis (PEM) [41], are onerous
tasks. Therefore, such attacks cannot be scaled down efficiently along with the
trend of shrinking transistor technologies. Second, FPGA vendors believe that
integration of new storage solutions, such as PUFs, raises the security level of
key storage against backside attacks [7,25,35], as no key is stored permanently
on the chip to be read-out by the adversary.
Our Contribution. In this work we introduce a novel semi-invasive attack
against FPGAs using a known failure analysis technique, called Laser Voltage
Probing (LVP) [24]. We demonstrate how the attacker can use LVP and deriva-
tives to locate circuitry of interest, such as registers and ring oscillators (ROs), by
knowing or estimating the frequency of different operations. Estimation of afore-
mentioned frequency characteristics can be achieved by either having knowledge
of implementations or by performing power analysis in the frequency domain.
Moreover, we explain how LVP enables us to probe different volatile and on-die-
only signals and data streams on the chip without having any physical contact
to the wires or transistors. Besides, with the help of LVP one can characterize
high frequency signals, such as the output of ROs, which are used in RO PUFs
and True Random Number Generators (TRNGs). For our practical evaluation,
we consider a PUF in key generation mode inside an FPGA to decrypt the bit-
stream. The PoC implementation was realized on an FPGA manufactured in a
60 nm process technology. Due to lack of proper protection, we were able to per-
form our analysis from the IC backside. This work is presenting the first results
to evaluate the potential of LVP for possible future attacks on small technolo-
gies, where conventional backside semi-invasive attacks, such as PEM and LFI,
would require much more efforts.
No Place to Hide: Contactless Probing of Secret Data on FPGAs 149
(a) (b)
Fig. 1. (a) Bitstream encryption and decryption using a red key [46]. (b) Bitstream
encryption and decryption using a black key, PUF key and red key [35].
2 Background
FPGA vendors, the implementation details differ. In this work, we explain the
red key wrapping technique using soft PUFs and soft decryptors, which is used
by Xilinx SoCs [35]. The main idea is to generate a “black key” (i.e., an encrypted
key, which in itself is useless to an attacker), to generate the secret red key on
the fly during configuration. This black key can then be stored safely in an
insecure NVM and the red key will only exist as volatile, internal-only data.
The preparations for this technique are as follows. In the trusted field a boot
loader containing the red key and a soft PUF IP is transferred into the volatile
configuration SRAM of the FPGA. After the boot loader is loaded, the PUF is
configured on the programmable logic of the device and its responses are used
in conjunction with the red key to generate the black key [35], see Fig. 1(b).
The black key generated in this way can only be converted back to the red
key with the correct, chip-specific, internal-only PUF response (i.e., PUF key).
In the untrusted field an encrypted first stage boot loader with the black key,
the same soft PUF IP and a DPA-resistant decryption IP core is loaded into the
device. The chip-specific PUF response is then used to unwrap the black key and
generate the red key on the fly. Finally, the encrypted configuration bitstream is
transferred to the device and will be decrypted by the red key inside the FPGA.
In this way the decryption IP core can be updated against future side-channel
analysis threats. Furthermore, the soft PUF in conjunction with the black key
provides volatile, internal-only and updatable key storage, and therefore, the red
key is in memory only during the configuration of the device.
Several techniques have been introduced into failure analysis to allow contactless
probing of Devices Under Test (DUTs). One category of such techniques uses
optical beams and is therefore, referred to as contactless optical probing. These
techniques allow failure analysis engineers to probe electrical signals through the
silicon backside and to also create 2D activity maps of active circuitry. Turnkey
solutions for optical probing are readily available from different manufactur-
ers, among them Hamamatsu Photonics, Checkpoint Technologies, DCG Sys-
tems (now part of FEI) and Semicaps. In the literature optical probing can be
referred to as Laser Voltage Probing (LVP), Electro Optical Probing (EOP),
Laser Timing Module (LTM) or Laser Time Probe (LTP). Acquisition of 2D
activity maps is similarly referred to as Laser Voltage Imaging (LVI), Electro
Optical Frequency Mapping (EOFM) or Signal Mapping Image (SMI). In this
paper we choose to refer to waveform probing as Laser Voltage Probing (LVP)
and to acquisition of 2D activity maps as Laser Voltage Imaging (LVI). Both
techniques together will be referred to as LVx.
The actual technical realisation of LVx varies depending on the manufac-
turer, however, the basic principles remain the same. For optical probing as
used in LVP a laser beam is focussed through the silicon backside, traverses the
active device area, is reflected of, for instance, metal structures and leaves the
device again through the silicon backside, see Fig. 2. The returning beam is then
fed to an optical detector to measure its intensity. Usually near infrared (NIR)
wavelengths are used to prevent the absorption of light by the silicon. Inside
the active area the electrical parameters of the device, such as electrical fields
and currents, lead to changes in the absorption coefficient and refractive index.
Because of this, the optical beam intensity is altered either directly through
absorption or in some cases indirectly through interference effects because of the
changed refractive index. Empirical studies have shown, that a linear approxi-
mation is often sufficient to describe the relationship between the voltage at the
electrical node and the reflected light signal. Therefore, the detector signal wave-
form recreates the electrical waveform from inside the device. This allows optical
probing of electrical waveforms by just pointing the laser beam at the electrical
node of interest. However, since the light modulation is very small (on the order
of 100 ppm) the detector signal usually needs to be averaged while the device
is running in a triggered loop to achieve a decent signal to noise ratio. As this
is just a rough sketch of the principles of optical probing, readers interested in
152 H. Lohrke et al.
3 Attack Scenario
We propose two LVP-based attacks against FPGAs during configuration. In the
first attack scenario we demonstrate how the adversary can probe the red, black
and PUF key using Laser Voltage Imaging (LVI). This allows the attacker to
extract the red key, and therefore, enables her to decrypt the encrypted bit-
stream offline, which can lead to reverse engineering or cloning of the design. In
the second attack, we will show how the attacker can characterize an RO PUF
based on a combination of LVI, Laser Voltage Probing (LVP) and power analy-
sis. Characterization of the individual oscillators of the RO PUF enables the
attacker to model the PUF, and therefore, to clone its functionality. Knowing
the approximate location of the key registers and the PUF components on the
chip is the main assumption of our proposed attacks.
No Place to Hide: Contactless Probing of Secret Data on FPGAs 153
Fig. 3. (a) Parallel generation of the red key, (b) serial generation of the red key.
The principle of key generation inside an FPGA has been discussed in Sect. 2.1.
All three key values can be either shifted serially through a shift register or
they can be loaded into the registers in parallel based on the implementation,
see Fig. 3. We will first discuss the case, where the register values are loaded
and processed in parallel. In this case the attacker can utilize LVI directly to
extract all three values. As discussed in Sect. 2.3, LVI reveals nodes switching
with a certain frequency, or more precisely, having certain frequency components.
Therefore, to locate registers of interest, the attacker has to know a frequency or
frequency component, which reveals the registers and is ideally data-dependent.
Thus, she will need to take a look at the switching frequencies during red key
generation. It is evident that after power-on all registers are first initialized to
their default value by the reset circuitry. Following that, all black key registers
are loaded in parallel and the PUF circuit is started. As soon as the PUF has
finished generating its output, its values are also loaded onto the corresponding
registers simultaneously. In a final step, the red key, which is now available at the
XOR output, can be loaded onto all red key registers. Consequently, we can see
that all register blocks of interest (black key, PUF key, red key) receive data —
exactly once per power-on. This can be exploited to generate suitable frequency
components by placing the device in a reset loop. In such a scenario, the first
harmonic of the waveforms on these registers will be the reset frequency, as they
change their states once per reset. If we now take a detailed look at the data
dependency of these waveforms, we notice that there is a fundamental difference
between registers carrying a zero bit and registers carrying a one bit. In Fig. 4
the waveforms of two registers receiving a one and a zero bit as well as the
reset signal RST are depicted. For the register receiving a one bit (REGA ) it is
evident that the register starts at logic low level and then changes its state, as
soon as the time needed for the preceding calculations (TCALC ) has elapsed. As
soon as the reset input goes high, the register is reset and afterwards the power-
on cycle is restarted once reset goes low again. Since we can expect TCALC to
be constant for consecutive power-ons we can see that REGA ’s period will be
TRST and we can expect its first harmonic to be at 1/TRST . For register REGB ,
154 H. Lohrke et al.
RST
TRST
REGA
TRST
REGB
t
TCALC TCALC
Fig. 4. Waveforms of the reset signal (RST ) and two registers, receiving a one (REGA )
and a zero (REGB ) bit.
carrying a zero, the case is much simpler. REGB will not change its value at all,
and therefore, will not to have any harmonics at the reset frequency. Thus the
attacker can expect the registers carrying a one to modulate the reflected light
with a first harmonic of 1/TRST . Registers carrying a zero are expected to not
modulate the reflected light at all. The interaction will be the same for black
key, PUF key and red key register blocks. Although TCALC will change for each
register block, the first harmonic will still be at 1/TRST for all of them. Therefore,
to extract the register values the attacker can perform an LVI measurement on
the register block of interest while setting the spectrum analyzer filter frequency
to the reset loop frequency. If the LVI measurement is then grayscale encoded,
registers carrying a one are expected to show up white while registers carrying
a zero will remain black.
For the case of the serial implementation the situation is slightly different.
Here the data will be processed bit by bit and the individual registers in the
relevant register blocks will be connected together to form one shift register for
each block. The data bits will then be shifted out of the black key and PUF
key shift registers, passed through the XOR and shifted into the red key shift
register. As a result, each individual register would show a different waveform
depending on its position in the shift register and the actual data values. The
waveforms of the individual registers would still have the reset frequency as their
first harmonic, however, detecting the bit values can not be broken down to a
simple black/white distinction as for the parallel case. Nevertheless, the attacker
will still detect the registers of interest in an LVI image, although with varying
signal strength. Since she is able to determine the precise register locations this
way, she can then move on to directly probe the waveforms of individual registers
using Laser Voltage Probing (LVP). This might be a tedious task, depending on
the number of bits, however, she should be able to find the first register of each
shift register this way. As soon as the first register of the red key shift register is
found, the attacker can extract the key from its waveform, as the complete key
gets shifted through this register during calculation.
Therefore, using just LVI or a combination of LVI and LVP the attacker
should be able extract the key data regardless of the chosen implementation.
No Place to Hide: Contactless Probing of Secret Data on FPGAs 155
4 Setup
4.1 Device Under Test
The samples used for our experiments were Altera Cyclone IV FPGAs with part
number EP4CE6E22C8N manufactured in a 60 nm process [8]. In this sample all
Logic Elements (LEs) contain 4-input Lookup Tables (LUTs) and a dedicated
register. The device contains 6272 Logic Array Blocks (LAB) with 16 LEs each.
We chose the 144 pin TQFP package in order to simplify the sample preparation.
The first step of preparation was the removal of the exposed ground pad on the
backside of the package. The samples were then thinned by an Ultratec ASAP-1
polishing machine to a remaining silicon thickness of 25 µm. However this step
would not have been necessary. Modern ICs only have to be depackaged and
are sufficiently thin as-is for NIR analysis, just leading to a lower signal level if
used directly. In the second step, the prepared samples were inversely soldered
to a custom PCB. Bond wires originally leading to the exposed ground pad were
then reconnected using silver conductive paint. A JTAG connection was used for
configuring the FPGA after power-on.
156 H. Lohrke et al.
en
counter
subtract ?
en
counter
For our Proof-of-Concept we have implemented an RO PUF and a red key (See
Sect. 2.1.) calculation. To make the design less complex, we have connected the
outputs of the ROs directly to individual counters, see Fig. 5. Each RO in our
design has been realized with 21 inverters. All components of the ROs and the
counters have been placed manually inside the FPGA using the Altera Quartus
II integrated development environment. The LEs in every RO were placed as
close as possible, directly next to each other. We have emulated the rebooting
and configuration of the FPGA by adding a reset signal to our implementation.
The black key and PUF key in our design have 8-bit length. As discussed in
Sect. 3, unwrapping the black key can be carried out either in a parallel or serial
way. Hence, for the first scenario, we have implemented the red key generation
by XORing all values of the black key with the PUF key in parallel, see Fig. 3.
For the second scenario, we have realized two shift registers for the black key
and PUF key, where those values are shifted serially to an XOR gate and the
result is shifted into the red key registers.
The core of our optical setup (Fig. 6(a)) is a Hamamatsu “PHEMOS-1000” laser
scanning microscope. The PHEMOS is equipped with an optical probing and
frequency mapping option. This option consists of a highly stable laser light
source (Hamamatsu C12993), a Laser Voltage Probing and Laser Voltage Imag-
ing preamplifier (Hamamatsu C12323), an Agilent “Acqiris” digitizer card and
an Advantest U3851 spectrum analyzer. The laser light source emits radiation at
1319 nm which is input into the optical path, deflected by galvanometric mirrors
and then focussed through an objective lens into the backside of the DUT. The
reflected light from the DUT is passed on to a detector and the detector signal
is fed into the preamplifier. The signal leaving the preamplifier can then either
be routed to the spectrum analyzer for LVI or to the digitizer card for acquisi-
tion of LVP waveforms. For all measurements shown in this paper a Hamamatsu
50x/0.76NA lens with silicon thickness correction was used. The approximate
laser power with this lens on the DUT is 50 mW for 100 % laser power. Addi-
tionally 5x and 20x objective lenses were used for navigation. The whole optical
setup is controlled by a PC running the PHEMOS control software.
No Place to Hide: Contactless Probing of Secret Data on FPGAs 157
Our electrical setup (Fig. 6(b)) is as follows: Two power supplies are con-
nected to the DUT. The first one (Agilent E3645A) provides VCCIN T = 1.2V
(internal logic), the second one (Power Designs Inc. 2005) supplies VCCIO = 2.5V
(I/O) and VCCA = 2.5V (PLL and analog). All voltages were within recom-
mended levels [8]. A Rigol DG4162 two channel function generator produces
clock and reset signals which are fed into the DUT. The clock and reset signals
as well as an auxiliary DUT output are also connected to a LeCroy WaveMaster
8620 A oscilloscope for testing and control purposes. The reset signal is fur-
thermore fed into the Laser Voltage Probing (LVP) trigger input. To be able
to conduct basic power analysis in the frequency domain, a Software Defined
Radio (SDR) is AC-coupled to the VCCIN T power rail. The SDR is an inexpen-
sive USB dongle which uses a Realtek RTL2832U chipset and a Rafael Micro
R820T tuner. For controlling the SDR, free and open source software is used.
“Gqrx” [2] is used for measurements with a spectral bandwidth below 2.4 MHz
and the python script “RTLSDR Scanner” [1] for higher bandwidths.
5 Results
5.1 Key Extraction
(a) (b)
Fig. 7. LVI images of the parallel implementation. (a) All three register blocks taking
part in the red key calculation. (b) Detail view of the individual register blocks. Dashed
lines denote the LE boundaries. Each LE is approx. 6 µm in height.
respective keys. To analyze the data content of the registers, a higher resolution
is helpful. The measurement has thus been repeated on each register block while
applying a scanner zoom. The resulting LVI images can be seen in Fig. 7(b) and
the expected behaviour discussed in Sect. 3.1 is observed. As expected, registers
carrying a zero do not contribute to the LVI signal while registers carrying a
one can clearly contribute. We can see that there are slight differences in the
appearance of the nodes from measurement to measurement, which are probably
due to focus drift. Nevertheless, we can observe that the attacker is easily able
to extract the relevant values of the black key, PUF key and red key directly
from these LVI images. For the serial implementation we used the same basic
measurement setup. However, the reset signal and LVI frequency were modified
to be 1 MHz, as the serial implementation needs more clock cycles to execute.
The reset duty cycle was set to 58 % as a makeshift trigger delay, causing only
full bits to show up in the result before reset assertion. The laser power was
increased to 15 % and the pixel dwell time decreased to 1 ms. Following that, an
LVI image of the red key register block was taken, which is shown in Fig. 8. It
is evident that there is no simple black/white data dependency, as discussed in
Sect. 3.1. Still, we can see a difference in signal strength for the registers, with
the ones at the top giving less signal than the ones at the bottom. To get a rough
idea of which points could be promising for Laser Voltage Probing (LVP) we used
a fast Fourier transform calculator to analyze the amplitude of the first harmonic
component for different expected waveforms. We observed that for our case of
one to eight bits shifted with a comparatively large reset “dead time” following,
the waveforms with more bit shifts gave us a stronger first harmonic component.
Our conclusion was therefore that the lower half area was the most promising
to probe. Direct probing of the lower-half registers was successful and revealed
the lowest register to be the “shift-in” register. However, it was noticed that
No Place to Hide: Contactless Probing of Secret Data on FPGAs 159
Fig. 8. LVI image of the red key register block and probed waveforms for the serial
implementation. Reset assertion is marked by a dashed vertical line.
waveforms with a better signal to noise ratio could be acquired on the locations
right of the actual register area. We assume that these locations are associated
with routing and therefore the signal has already been buffered before reaching
them. Furthermore, these locations are more isolated signal-wise which also leads
to a better signal waveform. Hence, the final measurements were carried out on
these locations for the shift-in register and two other registers further down the
signal path. The resulting waveforms can be seen in Fig. 8. It is obvious that the
red key can be extracted from the lowest LVP waveform of the shift-in register by
an attacker. We acquired further waveforms while setting the integration number
down to 100.000 loops, which is the current limit in the PHEMOS software, and
were still able to distinguish the bit states easily. Therefore, we expect this
approach to work with even less loop counts, as soon as the limit is removed
from the software.
5.2 RO Characterization
For characterisation of the ring oscillators (ROs) we used the approach discussed
in 3.2. In this section we will demonstrate the frequency measurement for one of the
ROs. We first used the Software Defined Radio (SDR) to get a rough estimation for
the LVI frequency by taking a look at the superposition of all RO frequencies in the
spectral domain on the power rail. By slight adjustments of this estimate we were
then able to create LVI overview images of the LEs forming the different ROs, one of
which is depicted in Fig. 9(a). The parameters used for this LVI measurement were:
127.3539 MHz spectrum analyzer filter frequency, 60 % laser power, and 0.33 ms
pixel dwell time. The ROs showed much more short term frequency fluctuations
than the previously used conventional clock sources. Therefore, the LVI filter band-
width had to be set to 100 kHz to account for the more widespread RO spectrum.
After being able to identify the nodes of interest inside the LEs in this way, the beam
was held stationary on one of them and the preamplified light detector signal was
fed into the spectrum analyzer. The spectrum analyzer was then configured to show
160 H. Lohrke et al.
(a) (b)
Fig. 9. (a) LVI image of 8 LEs of an RO, each approx. 6 µm in height. Dashed lines
denote the LE boundaries. Each LE shows multiple potential probing locations. (b)
LVP spectrum of the same RO.
the spectrum of this signal, which was modulated by the RO waveform present at
the electrical node. For this measurement the laser power was set slightly higher,
to 73 %, the spectrum analyzer frequency span to 1 MHz, resolution bandwidth to
30 kHz and video bandwidth to 10 Hz. The resulting spectrum in Fig. 9(b) shows
the RO frequency approximately 10 dBm above the noise floor. Thus, the attacker
is able to determine the current RO frequency precisely using only contactless opti-
cal probing methods. It should be noted that the resolution bandwidth mentioned
before is not the resolution to be expected for the frequency measurement. As the
attacker will only be interested in the average frequency of the RO, she is free to use
multiple frequency sweeps to get a smooth spectrum and determine its peak value.
The frequency of this peak value will then deliver the average frequency with a pre-
cision only depending on the number of averaged sweeps. By analysing the average
frequency acquired this way it can be seen that the RO frequency was shifted by
approximately 0.15 % when the laser power was increased from 60 % to 73 %. As
long as the individual ROs are probed in the same way with the same laser power,
this should not lead to problems for the attacker. Since the important question for
the attacker is just which RO is faster, characterizing the RO PUF will still be suc-
cessful if she takes care to probe all ROs in the same way, generating the same shift.
Nevertheless, we will discuss this aspect in detail in Sect. 6.
6 Discussion
6.1 Locating the Registers and IP Cores on the Chip
As mentioned in Sect. 3, knowing the approximate location of the key registers
and PUF IP core is the main assumption of our proposed attacks. Different
scenarios can be considered to understand how realistic this assumption is.
As discussed in Sect. 2.1, the soft PUF IP cores, black key and their place-
ments are transmitted in the first stage boot loader. If the first stage boot loader
No Place to Hide: Contactless Probing of Secret Data on FPGAs 161
or Boot0 is not encrypted, the attacker can intercept the boot loader on the
board and gain knowledge about the configuration of the PUF and the red and
black key registers. For instance, the Microsemi Root of Trust solution [26] per-
mits either the transfer of unencrypted or encrypted first stage boot loaders
to the target SRAM-based FPGA. If the boot loader is encrypted, it will be
decrypted by the hard dedicated AES core inside the target FPGA. While in
the unencrypted case the boot loader can be easily intercepted, for the encrypted
case DPA vulnerabilities of dedicated AES cores might be used to extract the
encryption key and decrypt the boot loader [23,30–32]. However, in the case of
asymmetric authentication as used by Xilinx SoCs, it is much harder for the
attacker to expose the boot loader configuration [32]. Because of the authenti-
cation, the attacker cannot launch a DPA attack against the hard AES core and
therefore might not be able to decrypt the first stage boot loader.
If the first stage boot loader cannot be intercepted, the attacker has to have
access to the used IP cores prior to the attack. Though difficult, it is conceivable
that the adversary can get access to the IP cores via an insider or by posing as
a potential customer to IP core suppliers. Having the IP cores, the attacker can
synthesize the PUF on an identical FPGA model and analyze the design either
in the IDE (if no obfuscation is used) or by looking at the generated bitstream
to find the circuitry of the interest.
If the attacker cannot get access to the IP cores, the attack will be more
difficult due to the unknown location of the circuitry of interest. In this case, if
the utilized soft PUF is an RO PUF, one could launch the attack proposed in
Sect. 3.2 to find the ROs and the counters connected to them on the chip. The
location of the RO PUF can then be a reference point to localize other parts
of the design inside the FPGA. Furthermore, one can estimate the operational
frequency of different registers to apply LVI and localize the related registers
individually on the chip. After a successful localization of the key registers, the
attacker can extract data from them by LVP/LVI based on the implementation
(See Sect. 5.1). In the case of a parallel implementation, if the key registers are
naively implemented in the right order (i.e., from LSB to MSB), the attacker
can easily extract the key by using LVI. Otherwise, if the keys are latched in an
obfuscated way, the attacker can only read the state of the permuted registers
and might not find the right order of the registers to assemble the key. For a
serial implementation, if the order of the registers is obfuscated, the attacker can
probe all registers to find the one through which the whole key is shifted.
The proposed attacks to key registers can in principle also be applied, if a
hard PUF and a hard AES are in use. In this case, the attacker has to reverse-
engineer the ASIC configuration circuit of the FPGA to locate the circuitry of
interest. Although the search space for the region of interest might be reduced,
the attacker has to probe and reverse-engineer more compact and dense ASIC
circuits in comparison to FPGA logic cells, which might be challenging.
162 H. Lohrke et al.
The process technology of FPGAs and programmable SoCs, which are support-
ing partial reconfiguration for soft PUF implementation, are equal to or smaller
than 60 nm. Since our LVI and LVP experiments have been carried out on an
FPGA with 60 nm technology, the question of the applicability of the same tech-
nique on smaller technologies might be raised. The real size of the transistors
is normally 7 to 8 times larger than the nominal technology node [18]. Besides,
the size of the LEs and the routing (intra and inter LEs) of FPGAs is much
larger than the size of the transistors, see Fig. 7. Hence, the optical resolution
requirements for data extraction are much less severe than for probing individual
transistors. Based on our measurements, the LE height in an Altera Cyclone IV
is about 6 µm. The theoretical expected resolution of our laser spot is approxi-
mately 1 µm2 . Thus, optical probing should still be possible on an LE approx.
six times smaller. It is worth mentioning that for LVP and LVI typical FPGAs
are an advantageous target, as multiple transistors close together will carry the
same waveform in an LE.
There are also solutions for increasing the optical resolution of LVP and LVI
techniques. For instance, one can use solid immersion lenses (SILs) to get 2
to 3 times better resolution, which already enables single transistor probing at
14 nm [18]. Moreover, lasers with shorter wavelengths (e.g., in the visible light
spectrum) can be used to further increase the resolution [10,12]. However, in
the latter case, the substrate of the chip has to be thinned to 10µm or less to
prevent the absorption of the photons.
Meanwhile, it is still interesting to understand why other backside semi-
invasive attacks, such as PEM or LFI, have limited efficiency on small technolo-
gies in comparison to LVP and LVI. In the case of PEM, the photon emission
rate is proportional to the core voltage of the chip. However, the core voltage
of technologies smaller than 60 nm is too low [41] and the attacker therefore
has to integrate over a large number of iterations to capture enough photons for
analysis. LFI attacks on the other hand target mostly single memory cells, which
requires the system used for the attack to be able to resolve single transistors
on the chip.
mechanical stress from depackaging and substrate thinning have negligible effects
on the absolute and relative frequencies of ring oscillators (ROs) [11]. In another
experiment, it has been shown that removing most of the bulk silicon, down to
the bottom of the n-wells, does not alter the delays of the inverter chains [38].
Additionally, without affecting the challenge-response behavior of the PUFs,
different successful semi-invasive attacks have been reported on silicon intrin-
sic PUF instances in the literature [20,29,33,43,44]. On the other hand, PUF
developers do their best to mitigate the noisy response of the PUF by different
error correction techniques [22,28]. Therefore, if few CRPs are changed by the
physical tampering, they will be corrected by such error correction techniques.
Based on these results, depackaging the chip and thinning the substrate does
not destruct the target PUF.
Although passive semi-invasive attacks do not affect the behavior of the PUF,
the laser beam in our proposed attack can change the temperature of the tran-
sistors. Temperature variations have transient and reversible effects on the delay
and frequency of the inverter chains in arbiter PUFs and RO PUFs. In our
experiments, a shift of frequency has been observed while performing LVI and
LVP on the ROs. However, the attacker is still able to precisely characterize and
measure the frequencies of the ROs by performing LVI and LVP, if she takes
care to probe all ring oscillators under the same conditions. If the attacker is not
able to fulfill this requirement, she might also probe the registers of the counters
which are connected to the RO output. Assuming the counters or other circuitry
connected to the RO PUFs are located far enough away she will be able to
mount her attack without influencing the ROs. Finally she might take measure-
ments of one individual RO frequency for different laser powers and extrapolate
from that to the frequency for zero laser power. Therefore, a precise physical
characterization of the RO PUF is certainly feasible.
6.4 Countermeasures
Silicon light sensors have been proposed to detect the photons of the laser beam.
However, in our experiments we have used a laser beam which has a longer wave-
length than the silicon band gap. Hence, no electron-hole pairs will be generated
by the laser photons. A silicon photo sensor is therefore unlikely to trigger.
A potential algorithmic countermeasure can be randomization of the reset
states of the registers for the parallel implementation. As a result, the simple
black/white data distinction (see Sect. 3.1) would be severely impeded, as there
now would be switching activity during the reset loop on all registers. For the
serial case, a randomization of the relation of the outer reset signal to the internal
reset signal would destroy the needed trigger relationship and make waveform
probing on the registers impossible. Another simple countermeasure includes the
obfuscation of the key registers by randomizing their order, see Sect. 6.1.
Finally, the ROs in a ring oscillator network with virtually equal frequencies
can be placed in different areas of the FPGA. Using LVP will then slightly shift
the frequencies of ROs which are in or close to the probed area. Hence, the
frequency deviation of these ROs in comparison to the mean frequency of all
164 H. Lohrke et al.
ROs can be used to raise an alarm. Similarly, delay-based PUFs might be useful
as sensors, if their elements are placed in different regions of the chip.
7 Conclusion
In this paper, we have proposed novel semi-invasive attacks from the IC backside
using LVP and LVI techniques. We have demonstrated that these techniques
can be potentially used against modern FPGAs and programmable SoCs during
configuration. Based on these considerations, it becomes apparent that replacing
the eFuses or BBRAMS with controlled PUFs does not raise the security level of
key storage as high as one would expect in the first place. Even recent controlled
stateless PUF constructions [22] are vulnerable to contactless probing. Moreover,
while the size of the transistors is shrinking, novel inexpensive failure analysis
techniques are developed to debug and probe nanoscale manufactured circuits in
a semi-invasive and contactless way. It is worth mentioning that much less time is
required for optical contactless probing of different signals than for conventional
techniques, such as FIB microprobing [21]. Using our approach the amount of
time needed to probe multiple nodes is on the order of minutes while for FIB
microprobing it will be on the order of days. Furthermore, it is obvious that
our attack technique has the potential to directly probe the bitstream after on-
chip decryption, circumventing all security measures in place. However, there
are several requirements for probing such a large amount of data and finding
a suitable probing location in the much smaller and denser ASIC area, which
might not be fulfilled by a standard LVP setup. Nevertheless, we strongly believe
that future generations of FPGAs remain vulnerable to contactless probing, if
proper protections or countermeasures for the IC backside are not implemented.
References
1. Ear to Ear Oak. http://eartoearoak.com/software/rtlsdr-scanner/. Accessed 6
June 2016
2. Gqrx SDR. http://gqrx.dk. Accessed 6 June 2016
3. Helion Technology Limited. http://www.heliontech.com. Accessed 6 June 2016
4. Intrisic-ID Inc. https://www.intrinsic-id.com. Accessed 6 June 2016
5. Lewis Innovative Technology Inc. http://lewisinnovative.com. Accessed 6 June
2016
6. Verayo Inc. http://www.verayo.com. Accessed 6 June 2016
7. White Paper: Overview of Data Security Using Microsemi FPGAs and SoC
FPGAs. Microsemi Corporation, Aliso Viejo, CA (2013)
No Place to Hide: Contactless Probing of Secret Data on FPGAs 165
26. Luis, W., Richard Newell, G., Alexander, K.: Differential power analysis counter-
measures for the configuration of SRAM FPGAs. In: IEEE Military Communica-
tions Conference, MILCOM 2015–2015. pp. 1276–1283. IEEE (2015)
27. Maes, R.: Physically Unclonable Functions: Constructions: Properties and Appli-
cations. Springer, Heidelberg (2013)
28. Maes, R., van der Leest, V., van der Sluis, E., Willems, F.: Secure key generation
from biased PUFs. In: Güneysu, T., Handschuh, H. (eds.) CHES 2015. LNCS, vol.
9293, pp. 517–534. Springer, Heidelberg (2015)
29. Merli, D., Schuster, D., Stumpf, F., Sigl, G.: Semi-invasive EM attack on FPGA
RO PUFs and countermeasures. In: Proceedings of the Workshop on Embedded
Systems Security, p. 2. ACM (2011)
30. Moradi, A., Barenghi, A., Kasper, T., Paar, C.: On the vulnerability of FPGA
bitstream encryption against power analysis attacks: extracting keys from xilinx
virtex-II FPGAs. In: Proceedings of the 18th ACM Conference on Computer and
Communications Security. pp. 111–124. ACM (2011)
31. Moradi, A., Oswald, D., Paar, C., Swierczynski, P.: Side-channel attacks on the
bitstream encryption mechanism of altera stratix II: facilitating black-box analysis
using software reverse-engineering. In: Proceedings of the ACM/SIGDA Interna-
tional Symposium on Field Programmable Gate Arrays. pp. 91–100. ACM (2013)
32. Moradi, A., Schneider, T.: Improved Side-Channel Analysis Attacks on Xilinx Bit-
stream Encryption of 5, 6, and 7 Series, COSADE 2016, Graz, Austria, 14 April
2016
33. Nedospasov, D., Seifert, J.P., Helfmeier, C., Boit, C.: Invasive PUF analysis. In:
2013 Workshop on Fault Diagnosis and Tolerance in Cryptography (FDTC), pp.
30–38. IEEE (2013)
34. Pappu, R., Recht, B., Taylor, J., Gershenfeld, N.: Physical one-way functions.
Science 297(5589), 2026–2030 (2002)
35. Peterson, E.: White Paper WP468: Leveraging Asymmetric Authentication to
Enhance Security-Critical Applications Using Zynq-7000 All Programmable SoCs.
Xilinx, Inc., San Jose (2015)
36. Ravikanth, P.S.: Physical one-way functions. Ph.D. thesis, Massachusetts Institute
of Technology (2001)
37. Rührmair, U., Sehnke, F., Sölter, J., Dror, G., Devadas, S., Schmidhuber1, J.:
Modeling attacks on physical unclonable functions. In: Proceedings of the 17th
ACM Conference on Computer and Communications Security. pp. 237–249 (2010)
38. Schlangen, R., Leihkauf, R., Kerst, U., Lundquist, T., Egger, P., Boit, C.: Phys-
ical analysis, trimming and editing of nanoscale IC function with backside FIB
processing. Microelectron. Reliab. 49(9), 1158–1164 (2009)
39. Selmke, B., Brummer, S., Heyszl, J., Sigl, G.: Precise laser fault injections into
FPGA BRAMs in 90 nm and 45 nm feature size. In: 14th Smart Card Research
and Advanced Application Conference - CARDIS 2015 (2015)
40. Simpson, E., Schaumont, P.: Offline hardware/software authentication for reconfig-
urable platforms. In: Goubin, L., Matsui, M. (eds.) CHES 2006. LNCS, vol. 4249,
pp. 311–323. Springer, Heidelberg (2006)
41. Tajik, S., Dietz, E., Frohmann, S., Dittrich, H., Nedospasov, D., Helfmeier, C.,
Seifert, J.P., Boit, C., Hübers, H.W.: Photonic side-channel analysis of arbiter
PUFs. J. Cryptol. 1–22 (2016). doi:10.1007/s00145-016-9228-6
42. Tajik, S., Dietz, E., Frohmann, S., Seifert, J.-P., Nedospasov, D., Helfmeier, C.,
Boit, C., Dittrich, H.: Physical characterization of arbiter PUFs. In: Batina, L.,
Robshaw, M. (eds.) CHES 2014. LNCS, vol. 8731, pp. 493–509. Springer, Heidel-
berg (2014)
No Place to Hide: Contactless Probing of Secret Data on FPGAs 167
43. Tajik, S., Ganji, F., Seifert, J.P., Lohrke, H., Boit, C.: Laser fault attack on phys-
ically unclonable functions. In: 2015 Workshop on Fault Diagnosis and Tolerance
in Cryptography (FDTC), IEEE (2015)
44. Tajik, S., Nedospasov, D., Helfmeier, C., Seifert, J.P., Boit, C.: Emission analysis of
hardware implementations. In: 2014 17th Euromicro Conference on Digital System
Design (DSD), pp. 528–534. IEEE (2014)
45. Trimberger, S.M.: Copy protection without non-volatile memory. US Patent
8,416,950 (2013)
46. Trimberger, S.M., Moore, J.J.: FPGA security: motivations, features, and applica-
tions. Proc. IEEE 102(8), 1248–1265 (2014)
47. Tuyls, P., Schrijen, G.-J., Škorić, B., van Geloven, J., Verhaegh, N., Wolters, R.:
Read-proof hardware from protective coatings. In: Goubin, L., Matsui, M. (eds.)
CHES 2006. LNCS, vol. 4249, pp. 369–383. Springer, Heidelberg (2006)
Side Channel Countermeasures I
Strong 8-bit Sboxes with Efficient Masking
in Hardware
1 Introduction
Block ciphers are among the most important cryptographic primitives. Although
they usually follow ad-hoc design principles, their security with respect to known
attacks is generally well-understood. However, this is not the case for the security
of their implementations. The security of an implementation is often challenged
by physical threats such as side-channel analysis or fault-injection attacks. In
many cases, those attacks render the mathematical security meaningless. Hence,
it is essential that a cipher implementation incorporates appropriate counter-
measures against physical attacks. Usually, those countermeasures are developed
retroactively for a given, fully specified block cipher. A more promising approach
is including the possibility of adding efficient countermeasures into the design
from the very start.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 171–193, 2016.
DOI: 10.1007/978-3-662-53140-2 9
172 E. Boss et al.
For software implementations, this has been done. Indeed, a few ciphers
have been proposed that aim to address the issue of protection against phys-
ical attacks by facilitating a masked Sbox by design. The first example is cer-
tainly NOEKEON [18], other examples include Zorro [20], Picarro [33] and the
LS-design family of block ciphers [21].
For hardware implementations, the situation is significantly different. Here,
simple masking is less effective due to several side-effects, most notably glitches
(see [27]). As an alternative to simple masking, a preferred hardware counter-
measure against side-channel attacks is the so-called threshold implementation
(TI) [32], as used for the cipher FIDES [6]. TI is a masking variant that splits
any secret data into several shares, using a simple secret-sharing scheme. Those
shares are then grouped in non-complete subsets to be separately processed by
individual subfunctions. All subfunctions jointly correspond to the target func-
tion (i.e., the block cipher). Since none of the subfunctions depends on all shares
of the secret data at any time, it is intuitive to see that it is impossible to recon-
struct the secret by first-order side-channel observations. We provide a more
detailed description of the functionality of threshold implementations in Sect. 2.
Unfortunately, it is not trivial to apply the TI concept to a given block
cipher. The success of this process strongly depends on the complexity of the
cipher’s round function and its internal components. While the linear aspects of
any cipher are typically easy to convert to TI, this is not generally true for the
non-linear Sbox. For 4-bit Sboxes, it is possible to identify a corresponding TI
representation by exhaustive search [10]. However, for larger Sboxes, in particular
8-bit Sboxes, the situation is very different. In this case, the search space is far
too large to allow an exhaustive search. In fact, 8-bit Sboxes are far from being
fully understood, from both a cryptographic and an implementation perspective.
With respect to cryptographic strength against differential and linear attacks,
the AES Sbox (and its variants) can be seen as holding the current world record.
We do not know of any Sbox with better properties, but those might well exist.
Unfortunately, despite considerable effort, no TI representation is known for the
AES Sbox that does not require any additional external randomness [7,9,31].
Our Contribution. In this paper we approach this problem of identifying crypto-
graphically strong 8-bit Sboxes that provide a straightforward TI representation.
More precisely, our goal is to give examples of Sboxes that come close to the
cryptanalytic resistance of the AES Sbox. Also, the straight application of the
TI concept to an Sbox should still lead to minimal resource and area costs. This
enables an efficient and low-cost implementation in hardware as well as bit-sliced
software.
In our work we systematically investigate 8-bit Sboxes that are constructed
based on what can be seen as a mini-cipher. Concretely, we construct Sboxes
based on either a Feistel-network (operating with two 4-bit branches and a 4-bit
Sbox as the round function), a substitution permutation network or the MISTY
network. This general approach has already been used and studied extensively.
Examples of Sboxes constructed like this are used for example in the ciphers
Strong 8-bit Sboxes with Efficient Masking in Hardware 173
Crypton [25,26], ICEBERG [40], Fantomas [21], Robin [21] and Khazad [3].
A more theoretical study was most recently presented by Canteaut et al. in [16].
Our idea extends the previous work by combining those constructions aiming
at achieving strong cryptographic criteria with small Sboxes that are easy to
share and intrinsically support the TI concept. As a result of our investigation,
we present a set of different 8-bit Sboxes. These Sboxes are either (a) superior
to the known constructions from a cryptographic perspective but can still be
implemented with moderate resource requirements or (b) outperform all known
constructions in terms of efficiency in the application of the TI concept to the
Sbox, while still maintaining a comparable level of cryptographic strength with
respect to other known Sboxes. All our findings are detailed in Table 1.
Outline. This work is structured as follows. Preliminaries on well-known strate-
gies to construct Sboxes as well as the TI concept are given in Sect. 2. We discuss
the applicability of TI on known 8-bit Sboxes in Sect. 3. The details and results
of the search process are given in Sects. 4 and 5, respectively. We conclude with
Sect. 6.
2 Preliminaries
2.1 Cryptanalytic Properties for Sboxes
In this subsection we recall the tools used for evaluating the strength of Sboxes
with respect to linear, differential and algebraic properties. For this purpose, we
consider an n-bit Sbox S as a vector of Boolean functions: S = (f0 , . . . , fn−1 ),
fi : Fn2 → F2 . We denote the cardinality of a set A by #A and the dot product
n−1
between two elements a, b ∈ Fn2 by: a, b = i=0 ai bi .
can be used to evaluate the correlation of a linear approximation (a, b) = (0, 0).
More precisely,
1 WS (a, b)
P(b, S(x) = a, x) = + .
2 2n+1
The larger the absolute value of WS (a, b), the better the approximation by the
linear function a, x (or the affine function a, x + 1, in case WS (a, b) < 0).
This motivates the following well known definition.
174 E. Boss et al.
The smaller Lin(S), the stronger the Sbox is against linear cryptanalysis.
It is known that for any function S from Fn2 to Fn2 it holds that Lin(S) ≥
n+1
2 2 [17]. Functions that reach this bound are called Almost Bent (AB) func-
tions. However, in the case n > 4 and n even, we do not know the minimal value
of the linearity that can be reached. In particular, for n = 8 the best known
non-linearity is achieved by the AES Sbox with Lin(S) = 32.
The smaller Diff(S), the stronger the Sbox regarding differential cryptanalysis.
It is known that for Sboxes S that have the same number of input and output
bits it holds that Diff(S) ≥ 2. Functions that reach that bound are called Almost
Perfect Nonlinear (APN). While APN functions are known for any number n of
input bits, APN permutations are known only in the case of n odd and n = 6.
In particular, for n = 8 the best known case is Diff(S) = 4, e.g., AES Sbox.
n−1
where xu = i=0 xui i , with the convention 00 = 1. Now, the algebraic degree
can be defined as follows.
Strong 8-bit Sboxes with Efficient Masking in Hardware 175
Affine Equivalence. An important tool in our search for good Sboxes is the
notion of affine equivalence. We say that two functions f and g are affine equiva-
lent if there exists two affine permutations A1 and A2 such that f = A1 ◦ g ◦ A2 .
The importance of this definition is given by the well-known fact that both the
linearity and the differential uniformity are invariant under affine equivalence.
That is, two functions that are affine equivalent have the same linear and differ-
ential criteria.
Apart from the AES Sbox, which is basically the inversion in the finite field
F28 , hardly any primary construction for useful, cryptographically strong, 8-bit
Sboxes is known.
However, several secondary constructions have been applied successfully.
Here, the idea is to build larger Sboxes from smaller Sboxes. For block ciphers
this principle was first introduced in MISTY [29].
Later, this approach was modified and extended. In particular, it was used
by several lightweight ciphers to construct Sboxes with different optimization
criteria, e.g., smaller memory requirements, more efficient implementation, invo-
lution, and easier software-level masking.
There are basically three known constructions, all of which can be seen as
mini-block ciphers: Feistel networks, the MISTY construction and SP-networks.
Figure 1 shows how these constructions build larger Sboxes from smaller Sboxes.
Note that the MISTY construction is a special case of the SPN. Indeed, the
1 1 construction is equivalent to SPN when F1 = Id and the matrix A =
MISTY
10 .
For a small number of rounds, we can systematically analyze the crypto-
graphic properties of those constructions (see [16] for the most recent results).
However, for a larger number of rounds, a theoretical understanding becomes
increasingly more difficult in most cases.
Table 1 shows the different characteristics of 8-bit Sboxes known in the liter-
ature that are built from smaller Sboxes. We excluded the PICARO Sbox [33]
176 E. Boss et al.
F1 F1 F1 F2
from the list, since it is not a bijection. Furthermore, Zorro is also excluded since
the exact specifications of its structure are not publicly known. We refer often
to this table as it summarizes all our findings and achievements.
Correctness. The masked Sbox should provide the output in a shared form
m
(y 1 , . . . , y m ) with y i = y = S(x) and m ≥ n.
i=1
Uniformity. The security of most masking schemes relies on the uniform distri-
bution of the masks. Since in this work we consider only the cases with n = m
and bijective Sboxes, we can define the uniformity as follows. The masked Sbox
Strong 8-bit Sboxes with Efficient Masking in Hardware 177
with n × k input bits and n × k output bits should form a bijection. Otherwise,
the output of the masked Sbox (which is not uniform) will appear at the input of
the next masked non-linear functions (e.g., the Sbox at the next cipher round),
and lead to first-order leakage.
Indeed, the challenge is the realization of the masked Sboxes with high alge-
braic degree. If t > 2, we can apply the same trick used in [32,34], i.e., by
decomposing the Sbox into quadratic bijections. In other words, if we can write
S : G ◦ F , where both G and F are bijections with t = 2, we are able to imple-
ment the first-order TI of F and G with the minimum number of shares n = 3.
Such a construction needs registers between the masked F and G to isolate the
corresponding glitches.
After the decomposition, fulfilling all the TI requirements except uniformity
is straightforward. As a solution, the authors of [10] proposed to find affine
functions A1 and A2 in such a way that F : A2 ◦ Q ◦ A1 . If we are able to
represent a uniform sharing of the quadratic function Q, applying A1 on all
input shares, and A2 on all output shares gives us a uniform sharing of F .
Sbox
F4 F3 F2 F1 1 2 3 n
(a) raw
F4 1 F3 2 F2 3 F1 n
(b) interleaved
F 1 2 3 n
(c) iterative
Later in [9] three more efficient variants of the AES TI Sbox were introduced.
The authors applied several tricks, e.g., increasing the number of shares to 4 and
5 and reduce them back to 3 in order to relax the fresh randomness requirements.
Details of all different designs are listed in Table 1. In short, the most efficient
design (called nimble) forms a 3-stage pipeline, where 92 extra registers and 32
fresh random bits are required.
1
In the following we denote functions by a hexadecimal-string in which the first letter
denotes the first element of the look-up table implementing the function.
2
Alternatively, one can apply the technique presented in [24].
180 E. Boss et al.
(AND gate) cannot be uniformly shared with 3 shares, but a·b + c (AND+XOR)
can be uniform if a, b, and c are uniformly shared. Therefore, a 4-share version
of TI S0 (resp. S1 ) can be realized in 5 stages.
Robin. is constructed based on the 3-round Feistel, similar to Crypton V0.5, but
a single 4-bit bijection S4 plays the role of all functions P1 , P2 , and P3 . Although
the swap of the nibbles in the last Feistel round is omitted, the Robin Sbox is
the only known 8-bit Sbox which can be implemented in an iterative fashion.
S4 : 086D5F7C4E2391BA has been taken from [41], known as the Class 13 Sbox. S4
is affine equivalent to the cubic class C223 and, as stated above, can be uniformly
shared with 3 shares in 2 stages. As one of the smallest solutions we considered
A3 ◦ Q294 ◦ A2 ◦ Q294 ◦ A1 with A1 : AE268C04BF379D15, A2 : C480A2E6D591B3F7,
A3 : 20A8B93164ECFD75. Therefore, with no extra fresh randomness we can
realize uniform sharing of the Robin Sbox with 3 shares in 6 stages.
In order to implement this construction, we have four different options. A
block diagram of the design is shown in Fig. 3(b) (the registers filled by the gray
color are essential for pipeline designs).
– Iterative, w/o pipeline, each Sbox in 6 clock cycles.
– Iterative, pipeline, each two Sboxes in 6 clock cycles.
– Raw, w/o pipeline, each Sbox in 6 clock cycles.
– Raw, pipeline, each 6 Sboxes in 6 clock cycles, each one with a latency of 6
clock cycles.
Note that extra control logic (such as multiplexers) is required for all iterative
designs which is excluded from Fig. 3(b) and Table 1 for the sake of clarity.
Fig. 3. Threshold implementation of Robin and Fantomas Sboxes, each signal repre-
sents 3 shares, the gray registers for pipeline variants
maximum degree of two. Additional shares, otherwise, may increase the area or
randomness requirements for the whole circuit. In [11], six main quadratic per-
mutation classes are identified which are listed in Table 2. All existing quadratic
4-bit permutations are affine equivalent to one of those six. However, it should be
noted that permutations of class Q4300 cannot be easily shared with three shares
without decomposition or additional randomness. Therefore, we mainly focus on
the other classes from our search. Note that we include the identity function A40
in the case of the SPN construction. Since the identity function does not require
any area, round functions based on a combination of identity and one quadratic
4-bit permutation can result in very lightweight designs.
One important difference to all previous constructions listed in Table 1 is
that we do consider higher number of iterations for our constructions. This is
motivated by two observations. First, it might allow to improve upon the crypto-
graphic criteria and second it might be beneficial to actually use a simpler round
function, in particular those that can be implemented in one stage, more often
than a more complicated round function with a smaller number of iterations.
As can be seen in Table 1 this approach of increasing the number of iterations is
quite successful in many cases.
Next we describe in detail the search for good Sboxes for each of the three
constructions we considered.
4.1 Feistel-Construction
As a first construction, we examine round functions using a Feistel-network sim-
ilar to Fig. 1(a). By the basic approach described below, we were able to exhaus-
tively investigate all possible constructions based on any 4-bit to 4-bit function
for any number of iterations between 1 and 5. This can be seen as an extension
(in the case of n = 4 and for identical round functions) to the results given in
[16] where up to 3 rounds have been studied.
However, such an exhaustive search is not possible in a naive way. As there
are 264 4-bit functions and checking the cryptographic criteria of an n-bit Sbox
requires roughly 22n basic operations, a naive approach would need more than
280 operations.
Fortunately, this task can be accelerated by exploiting the distinct structure
of Feistel-networks while still covering the entire search space.
We recall the definition of a Feistel round for the function F : Fn2 → Fn2 :
affine equivalence [14]. There are 4713 equivalence classes up to extended affine
equivalence. Now, with the results given in the full version of the paper [12], it
is enough to consider all functions of the form A1 ◦ F + C, where A1 is an affine
permutation and C is any linear mapping on 4 bits. As FeistelnA1 ◦F ◦A2 +C is affine
equivalent to the function FeistelnA2 ◦A1 ◦F ◦A2 ◦A−1 +C ◦A−1 = FeistelnA2 ◦A1 ◦F +C ,
2 2
this will exhaust all possibilities up to affine equivalence. Doing so, we reduce
the search space to:
As this is still a large search space, we emplyed GPUs to tackle this task.
Sboxes. This is clearly not feasible. Therefore, we decide to restrict the number
of possibilities for each of the two functions. In particular, we only consider the
representative for each class as presented in [11] without affine equivalents. This
reduces the search space to
P1 (L||R) ⊕ C1 = P2 (R||L) ⊕ C2 , ∀ R, L ∈ Fn2 .
Thus, the search can be speeded up since BitPerm1F1 ,F2 ,C1 ,P1 is the same as
BitPerm1F2 ,F1 ,C2 ,P2 . Therefore, we only need to check
250 250
200 200
150 150
Diff.
Lin.
100 100
50 50
0 0
2 6 10 2 6 10
#Iteration #Iteration
(a) Differential Uniformity (b) Linearity
Fig. 4. The smallest achievable differential uniformity and linearity for each number
of iterations for round functions with F16 -linear layers and F1 = A40 and ()F2 = Q44 ,
(∗)F2 = Q412 , ()F2 = Q4293 , (◦)F2 = Q4294 , ()F2 = Q4299 .
188 E. Boss et al.
5 Results
We completed the search for the three aforementioned types of round functions
with up to ten iterations.
The search for Feistel-networks for all 4713 classes takes around two weeks on
a machine with four NVIDIA K80s for a specific set of parameters. In particular,
the performance depends on the bounds defined by cryptographic properties
(differential uniformity) as well as the iteration count of the network. Note that,
with respect to cryptographic criteria, our search shows that for iterations ≤ 5
no 8-bit balanced Feistel with identical round functions can have a linearity below
56 and a differential uniformity below 8.
Furthermore, the search for SPNs with bit permutations (resp. with F16 -
linear layer) required around 48 h (resp. 54 h) on one Intel Xeon CPU with 12
cores. It was possible to detect some very basic relations between the security,
number of iterations and area of the Sbox. Figure 4 shows the smallest differential
uniformity and linearity values which can be achieved for a specific number of
iterations using a round function based on the F16 -linear layer with constant
addition. As expected, the more iterations are applied, the higher resistance
against linear and differential cryptanalysis could be achieved. The size of each
of the considered quadratic permutations is given in Table 2. Bigger functions
like Q4293 and Q4299 achieve good cryptographic properties with fewer iterations
than smaller functions like Q44 . For the other combinations of (F1 , F2 ) and types
of round functions the graphs behave similarly. Depending on the remaining
layers of the cipher and the targeted use case, a designer needs to find a good
balance between the parameters. In the following, we present a few selected
Sboxes optimized for different types of applications.
In our evaluation, we only consider Sboxes with differential uniformity at
most 16 and linearity of at most 64. These are the worst properties between
the observed constructed 8-bit Sboxes in Table 1. From the cryptographic stand-
point, our Sboxes should not be inferior to these functions. We identified the
following strong Sboxes that cover the most important scenarios.
– SB1 : This Sbox possesses a very small round function. In a serial design the
round function is usually implemented only once to save area.
– SB2 : This Sbox is selected to enable an efficient implementation in a round-
based design. For this not only the size of the round function is important but
also the number of iterations. Additional iterations require additional instan-
tiations of the round function with a dedicated register stage. Furthermore,
this Sbox requires the least number of iterations and can be implemented with
a very low number of AND gates. Thus, it is also suited to masked software
implementations.
– SB3 : This Sbox has very good cryptographic properties and requires one less
iteration than SB4 .
– SB4 : This Sbox has very good cryptographic properties.
– SB5 : This Sbox is similar to SB1 which has a small round function. However,
it trades area for better cryptographic properties.
Strong 8-bit Sboxes with Efficient Masking in Hardware 189
– SB6 : This Sbox is similar to SB2 that is optimized for raw implementations.
However, it trades area for better cryptographic properties.
5.2 Comparison
Table 1 gives an overview of our results and we summarize the most important
observations in the following. The first observation is that our proposed designs
do not require fresh mask bits to achieve uniformity. This is an improvement over
190 E. Boss et al.
all TI types of the AES Sbox and some other Sboxes from Table 1. They need
up to 64 bits of randomness for one full Sbox. Given that modern ciphers usually
include multiple rounds with many Sboxes, this can add up to a significant
amount of randomness which needs to be generated.
Furthermore, all of our proposed Sboxes can be implemented iteratively. This
comes with the advantage that even the more complex designs, e.g., SB4 and
SB5 , can be realized with very few gates depending on the design architecture.
From all the other Sboxes in Table 1 this is only possible for Robin and its round
function requires more area than any of our proposed Sboxes.
In particular, SB1 and SB2 require the least area in their respective target
architectures (i.e., iterative and raw) out of all considered 8-bit Sboxes. The dif-
ference for the iterative architecture is especially large where SB1 needs roughly
six times less area than the Robin Sbox.
SB2 requires the least number of stages. Additionally, it requires only 12
AND gates for the whole Sbox which is very close to the best number, i.e., 11 for
Fantomas. This is an advantage for masked bit-sliced implementations making
SB2 suitable for software and hardware designs. A more detailed discussion of
this aspect is given in the full version of the paper [12].
As expected, we did not find any Sbox with better cryptographic properties
than the AES Sbox. However, SB3 and SB4 can still provide better resistance
against cryptanalysis attacks than most of the other considered Sboxes. This
comes at the cost of an increased area for the raw implementations. Nevertheless,
the required area is still smaller than any AES TI and their round function is
still smaller than Robin for iterative designs.
As depicted in Fig. 4, a trade-off between resources and cryptographic prop-
erties is possible. If SB1 and SB2 do not provide the desired level of security
and SB3 and SB4 are too large, SB5 and SB6 might be the best solution. Their
cryptographic properties are still better or equal than the competitors while the
area is significantly smaller than SB3 and SB4 . For the sake of completeness,
we included the area requirement of the unprotected implementation as well as
the latency of different designs in Table 1.
Strong 8-bit Sboxes with Efficient Masking in Hardware 191
In this work we identified a set of six 8-bit S-boxes with highly useful proper-
ties using a systematic search on a range of composite Sbox constructions. Our
findings include 8-bit Sboxes that provide comparable or even higher resistance
against linear and differential cryptanalysis with respect to other 8-bit Sbox
but intrinsically support the TI concept without any external randomness. At
the same time our selected Sboxes come with a range of useful implementa-
tion properties, such as a highly efficient serialization option, or a very low area
requirement. Future work comprises extended criteria for the Sbox composition,
including diffusion layers beyond permutations.
References
1. Banik, S., Bogdanov, A., Isobe, T., Shibutani, K., Hiwatari, H., Akishita, T.,
Regazzoni, F.: Midori: a block cipher for low energy. In: Iwata, T., et al. (eds.)
ASIACRYPT 2015. LNCS, vol. 9453, pp. 411–436. Springer, Heidelberg (2015).
doi:10.1007/978-3-662-48800-3 17
2. Barkan, E., Biham, E.: In how many ways can you write Rijndael? In: Zheng,
Y. (ed.) ASIACRYPT 2002. LNCS, vol. 2501, pp. 160–175. Springer, Heidelberg
(2002)
3. Barreto, P.S.L.M., Rijmen, V.: The Khazad legacy-level block cipher. Primitive
Submitted to NESSIE, 97 (2000)
4. Beierle, C., Jean, J., Kölbl, S., Leander, G., Moradi, A., Thomas Peyrin, Y., Sasaki,
P.S., Sim, S.M.: The Skinny family of block ciphers and its low-latency variant
mantis. CRYPTO 2016. LNCS. Springer, Berlin (2016). (to appear)
5. Biham, E., Shamir, A.: Differential cryptanalysis of DES-like cryptosystems. In:
Menezes, A., Vanstone, S.A. (eds.) CRYPTO 1990. LNCS, vol. 537, pp. 2–21.
Springer, Heidelberg (1991)
6. Bilgin, B., Bogdanov, A., Knežević, M., Mendel, F., Wang, Q.: Fides: light-
weight authenticated cipher with side-channel resistance for constrained hardware.
In: Bertoni, G., Coron, J.-S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 142–158.
Springer, Heidelberg (2013)
192 E. Boss et al.
7. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: A more effi-
cient AES threshold implementation. In: Pointcheval, D., Vergnaud, D. (eds.)
AFRICACRYPT. LNCS, vol. 8469, pp. 267–284. Springer, Heidelberg (2014)
8. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: Higher-order threshold
implementations. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014, Part II. LNCS,
vol. 8874, pp. 326–343. Springer, Heidelberg (2014)
9. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: Trade-offs for threshold
implementations illustrated on AES. IEEE Trans. CAD Integr. Circ. Syst. 34(7),
1188–1200 (2015)
10. Bilgin, B., Nikova, S., Nikov, V., Rijmen, V., Stütz, G.: Threshold implementations
of all 3 × 3 and 4 × 4 S-boxes. In: Prouff, E., Schaumont, P. (eds.) CHES 2012.
LNCS, vol. 7428, pp. 76–91. Springer, Heidelberg (2012)
11. Bilgin, B., Nikova, S., Nikov, V., Rijmen, V., Tokareva, N., Vitkup, V.: Threshold
implementations of small S-boxes. Cryptogr. Commun. 7(1), 3–33 (2015)
12. Boss, E., Grosso, V., Güneysu, T., Leander, G., Moradi, A., Schneider, T.: Strong
8-bit Sboxes with efficient masking in hardware. Cryptology ePrint Archive, Report
2016/647 (2016). http://eprint.iacr.org/2016/647
13. Boyar, J., Peralta, R.: A new combinational logic minimization technique with
applications to cryptology. In: Festa, P. (ed.) SEA 2010. LNCS, vol. 6049, pp.
178–189. Springer, Heidelberg (2010)
14. Brinkmann, M.: EA classification of all 4 bit functions. Personal Communication
(2008)
15. Canright, D.: A very compact S-box for AES. In: Rao, J.R., Sunar, B. (eds.) CHES
2005. LNCS, vol. 3659, pp. 441–455. Springer, Heidelberg (2005)
16. Canteaut, A., Duval, S., Leurent, G.: Construction of lightweight S-boxes
using Feistel and MISTY structures. In: Dunkelman, O., et al. (eds.) SAC
2015. LNCS, vol. 9566, pp. 373–393. Springer, Heidelberg (2016). doi:10.1007/
978-3-319-31301-6 22
17. Chabaud, F., Vaudenay, S.: Links between differential and linear cryptanalysis. In:
De Santis, A. (ed.) EUROCRYPT 1994. LNCS, vol. 950, pp. 356–365. Springer,
Heidelberg (1995)
18. Daemen, J., Peeters, M., Van Assche, G., Rijmen, V.: Nessie proposal: NOEKEON.
In: 1st Open NESSIE Workshop, pp. 213–230 (2000)
19. Daemen, J., Rijmen, V.: The Design of Rijndael: AES - The Advanced Encryption
Standard. Information Security and Cryptography. Springer, Berlin (2002)
20. Gérard, B., Grosso, V., Naya-Plasencia, M., Standaert, F.-X.: Block ciphers that
are easier to mask: how far can we go? In: Bertoni, G., Coron, J.-S. (eds.) CHES
2013. LNCS, vol. 8086, pp. 383–399. Springer, Heidelberg (2013)
21. Grosso, V., Leurent, G., Standaert, F.-X., Varıcı, K.: LS-designs: bitslice encryption
for efficient masked software implementations. In: Cid, C., Rechberger, C. (eds.)
FSE 2014. LNCS, vol. 8540, pp. 18–37. Springer, Heidelberg (2015)
22. Grosso, V., Leurent, G., Standaert, F.-X., Varici, K., Journault, A., Durvaux, F.,
Gaspar, L., Kerckhof, S.: SCREAM side-channel resistant authenticated encryption
with masking - Version 3. Submission to CAESAR Competition of Authenticated
Ciphers. https://competitions.cr.yp.to/round2/screamv3.pdf
23. Guo, J., Peyrin, T., Poschmann, A., Robshaw, M.: The LED block cipher. In:
Preneel, B., Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 326–341. Springer,
Heidelberg (2011)
24. Kutzner, S., Nguyen, P.H., Poschmann, A.: Enabling 3-share threshold implemen-
tations for all 4-bit S-boxes. In: Lee, H.-S., Han, D.-G. (eds.) ICISC 2013. LNCS,
vol. 8565, pp. 91–108. Springer, Heidelberg (2014)
Strong 8-bit Sboxes with Efficient Masking in Hardware 193
25. Lim, C.H.: CRYPTON: a new 128-bit block cipher - specification and analysis.
NIST AES Proposal (1998)
26. Lim, C.H.: A revised version of CRYPTON - CRYPTON V1.0. In: Knudsen, L.R.
(ed.) FSE 1999. LNCS, vol. 1636, pp. 31–45. Springer, Heidelberg (1999)
27. Mangard, S., Pramstaller, N., Oswald, E.: Successfully attacking masked AES hard-
ware implementations. In: Rao, J.R., Sunar, B. (eds.) CHES 2005. LNCS, vol. 3659,
pp. 157–171. Springer, Heidelberg (2005)
28. Matsui, M.: Linear cryptanalysis method for DES cipher. In: Helleseth, T. (ed.)
EUROCRYPT 1993. LNCS, vol. 765, pp. 386–397. Springer, Heidelberg (1994)
29. Matsui, M.: New block encryption algorithm MISTY. In: Biham, E. (ed.) FSE
1997. LNCS, vol. 1267, pp. 54–68. Springer, Heidelberg (1997)
30. Moradi, A., Mischke, O., Eisenbarth, T.: Correlation-enhanced power analysis col-
lision attack. In: Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol.
6225, pp. 125–139. Springer, Heidelberg (2010)
31. Moradi, A., Poschmann, A., Ling, S., Paar, C., Wang, H.: Pushing the limits: a
very compact and a threshold implementation of AES. In: Paterson, K.G. (ed.)
EUROCRYPT 2011. LNCS, vol. 6632, pp. 69–88. Springer, Heidelberg (2011)
32. Nikova, S., Rijmen, V., Schläffer, M.: Secure hardware implementation of nonlinear
functions in the presence of glitches. J. Cryptol. 24(2), 292–321 (2011)
33. Piret, G., Roche, T., Carlet, C.: PICARO – a block cipher allowing efficient higher-
order side-channel resistance. In: Bao, F., Samarati, P., Zhou, J. (eds.) ACNS 2012.
LNCS, vol. 7341, pp. 311–328. Springer, Heidelberg (2012)
34. Poschmann, A., Moradi, A., Khoo, K., Lim, C.-W., Wang, H., Ling, S.: Side-
channel resistant crypto for less than 2,300 GE. J. Cryptol. 24(2), 322–345 (2011)
35. Poschmann, A.Y.: Lightweight cryptography: cryptographic engineering for a per-
vasive world. Ph.D. thesis, Ruhr University Bochum (2009)
36. Raddum, H.: More dual Rijndaels. In: Dobbertin, H., Rijmen, V., Sowa, A. (eds.)
AES 2005. LNCS, vol. 3373, pp. 142–147. Springer, Heidelberg (2005)
37. Rijmen, V., Barreto, P.S.L.M.: The WHIRLPOOL hash function. World-Wide Web
document, p. 72 (2001)
38. Shahverdi, A., Taha, M., Eisenbarth, T.: Silent simon: a threshold implementation
under 100 slices. In: HOST 2015, pp. 1–6. IEEE (2015)
39. Shirai, T., Shibutani, K., Akishita, T., Moriai, S., Iwata, T.: The 128-bit blockci-
pher CLEFIA (extended abstract). In: Biryukov, A. (ed.) FSE 2007. LNCS, vol.
4593, pp. 181–195. Springer, Heidelberg (2007)
40. Standaert, F.-X., Piret, G., Rouvroy, G., Quisquater, J.-J., Legat, J.-D.: ICEBERG:
an involutional cipher efficient for block encryption in reconfigurable hardware.
In: Roy, B., Meier, W. (eds.) FSE 2004. LNCS, vol. 3017, pp. 279–299. Springer,
Heidelberg (2004)
41. Ullrich, M., De Cannière, C., Indesteege, S., Küçük, Ö., Mouha, N., Preneel, B.:
Finding optimal bitsliced implementations of 4 × 4-bit S-boxes. In: Symmetric Key
Encryption Workshop, p. 20 (2011)
42. Virtual Silicon Inc.: 0.18 µm VIP Standard Cell Library Tape Out Ready, Part
Number: UMCL18G212T3, Process: UMC Logic 0.18 µm Generic II Technology:
0.18 µm, July 2004
Masking AES with d + 1 Shares in Hardware
1 Introduction
When cryptography is naively deployed in embedded devices, secrets can leak
through side-channel information such as instantaneous power consumption,
electromagnetic emanations or timing of the device. Ever since attacks based
on side-channels were discovered and investigated [3,17,18], several studies have
been performed to counter the exploitation of these vulnerabilities.
A popular way to strengthen cryptographic implementations against such
physical cryptographic attacks is masking [10]. It randomizes the internal compu-
tation and hence detaches the side-channel information from the secret-dependent
intermediate values. Masking is both provable secure [10,23] and practical. Mask-
ing has been shown to increase the difficulty of mounting side-channel attacks on
a wide range of cryptographic algorithms.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 194–212, 2016.
DOI: 10.1007/978-3-662-53140-2 10
Masking AES with d + 1 Shares in Hardware 195
There exist plenty masked AES implementations, hence we limit our intro-
duction to TIs. The first TI of AES presented in [20] requires 11.1 kGE. Later,
the hardware footprint of TI-AES is reduced to 8.1 kGE in a sequence of pub-
lications [4,6]. All these first-order TIs use functions with at least three input
shares, with the exception of the smallest TI-AES which uses two shares for
linear operations. A second-order TI of the AES S-box using six input shares
is presented in [14] and is shown to require 7.8 kGE. We emphasize that in all
these TIs, the number of input shares of the nonlinear operations are chosen to
be sin ≥ td + 1.
1.2 Contribution
2 Preliminaries
2.1 Notation
We use small and bold letters to describe elements of GF(2n ) and their sharing
respectively. We assume that any possibly sensitive variable a ∈ GF(2n ) is split
into s shares (a1 , . . . , as ) = a, where ai ∈ GF(2n ), in the initialization phase
of the cryptographic algorithm. A possible way of performing this initialization,
which we inherit, is as follows: the shares a1 , . . . , as−1 are selected
randomly
from a uniform distribution and as is calculated such that a = i∈{1,...,s} ai .
We refer to the j th bit of a as aj unless a ∈ GF(2). We use the same notation
to share a function f to s shares f = (f1 , . . . , fs ).
The number of input and output shares of f are denoted by sin and sout
respectively. We refer to field multiplication, addition and concatenation as ⊗,
⊕ and respectively.
||
Masking AES with d + 1 Shares in Hardware 197
Nonlinear layer N . This layer is composed of all the linear and nonlinear terms
(ai bj for the AND-gate example) of the shared function, and hence responsi-
ble for the correctness of the sharing. A requirement is that this layer must
see uniformly shared inputs.
Linear layer L. This layer inherits non-completeness, the essence of TI. It ensures
that no more than d shares of a variable are used within each group of terms
to be XORed. If the number of input shares is limited to d + 1, the non-
completeness implies the use of only one share per unmasked value in each
group. We refer to [26] for more details.
Refreshing layer R. The multivariate security of a dth -order masking scheme
depends on the proper insertion of additional randomness to break depen-
dency between intermediates potentially appearing in different clock cycles.
One way of remasking is using sout bits of randomness for sout shares at the
end of L in a circular manner. The restriction of this layer can be relaxed
when first-order or univariate security is satisfactory.
Synchronization layer S. In a circuit with non-ideal gates, this layer ensures that
non-completeness is satisfied in between nonlinear operations. It is depicted
with a bold line in Fig. 1 and is typically implemented as a set of registers
in hardware. The lack of this layer causes leakage in subsequent nonlinear
operations.
Compression layer C. This layer is used to reduce the number of shares synchro-
nized in S. It is especially required when the number of shares after S is
different from the number of input shares of N .
For further clarification, we also describe the concept of uniformity, the dif-
ference between using d + 1 shares or more and the limitations brought by using
d + 1 shares in the rest of this section.
a2 b1 a3 b
b4 3a1
b3 a3 Nonlinear N
b a4
2 Linear L
a2 a1 b3
b2 b2
a3 a1 b2
a1 b3 a2
b2
a2 R1
a2 RR3 b1
b1
R10 R2 b1 R44
a1 R2R2 R
a4 R9
c1 c2 R3
b1 R
R55
b5 a4 RR11 ccc11cc22
c5 c3 b c33 a2
a5 c4 a41 b3
b4 R8
R4
b a3 R
R99 R66
R b2
a44
R5 R88
R R
R77
a1 R7
b a2
a43
R6
b5 b2 a b3
a5 b 3
b1a a34
1 b1 a3
b1 Refreshing R
a3 b5 Compression C
a5 b3 a5b2
a5 b5a2b5
Number of Input Shares. Using td+1 input shares originates from the rule-of-
thumb “combinations of up to d component functions fi should be independent
of at least one input share”. However, this is an overly strict requirement to
fulfill non-completeness. One can construct a sharing such that combinations of
up to d component functions are independent of at least one input share of each
variable, without imposing any condition on the index i. The resulting sharing
f is clearly secure since no combinations of up to d component functions reveals
all shares of a variable.
In this paper, we benefit from this observation and use d + 1 shares. This
incurs a significantly smaller area footprint, as will be shown later on. It is
however not obvious at first sight whether a construction with d + 1 shares is
necessarily smaller. As a matter of fact, there are many factors that work in
the opposite direction, i.e. the number of component functions fk is increased,
and there is a need for additional circuitry for the refreshing and compression
of the output shares. On the other hand, the shares fk are significantly smaller,
since they depend on fewer input bits. A classic result from Shannon [28] states
that almost all Boolean functions with d input bits require a circuit of size
Θ(2d /d). One can assume that the size of the component functions fk follows
this exponential dependency regarding the number of input shares. Thus, it may
pay off to have more component functions fk and additional circuitry to obtain
a smaller overall sharing.
illustrates the problem: assume a first-order sharing of an AND gate with shared
inputs (a1 , a2 ) and (b1 , b2 ) which necessarily calculates the terms a1 b2 and a2 b1 .
If these sharings are dependent, for the sake of the example say a = b, the term
a1 b2 = a1 a2 obviously leaks a. This example clearly breaks the joint uniformity
rule for a and b. Note that this does not necessarily imply the requirement of
unshared values to be independent.
As in all previous TIs of the AES S-box [4,6,14,20], our masked implementa-
tion is based on Canright’s Very Compact S-box [9]. This allows for a fairer
comparison of the area reduction that comes from our masking strategy.
Figure 2 depicts the unmasked S-box with the specific subfield decomposi-
tions we adopt. Although it is possible to reduce the number of pipeline stages
of one S-box by merging Stage 3 and Stage 4 into an inversion in GF(24 ) [4,6],
we choose to rely on multiplications alone, since the number of component func-
tions equals (d + 1)t , i.e. we can achieve a lower area and a reduced randomness
consumption by using multiplications (t = 2) instead of inversions (t = 3). We
now go over the masked design in a stage by stage manner, where the stages are
separated by pipeline registers. The complete masked S-box is depicted in Fig. 3.
First Stage. The first operation occurring in the decomposed S-box performs
a change of basis through a linear map. Its masking requires instantiating this
linear map once for each share i. This mapping is implemented in combinational
logic and it maps the 8-bit input (a1i , . . . , a8i ) to the 8-bit output (yi1 , . . . , yi8 ) for
each share i as follows:
200 T. De Cnudde et al.
yi1 = a8i ⊕ a7i ⊕ a6i ⊕ a3i ⊕ a2i ⊕ a1i yi5 = a8i ⊕ a5i ⊕ a4i ⊕ a2i ⊕ a1i
yi2 = a7i ⊕ a6i ⊕ a5i ⊕ a1i yi6 = a1i
yi3 = a7i ⊕ a6i ⊕ a2i ⊕ a1i yi7 = a7i ⊕ a6i ⊕ a1i
yi4 = a8i ⊕ a7i ⊕ a6i ⊕ a1i yi8 = a7i ⊕ a4i ⊕ a3i ⊕ a2i ⊕ a1i
Note that synchronizing the output values of the first stage with registers
is required for security. For simplicity, we explain what can go wrong in the
absence of these registers for the first-order case, but the same can be expressed
for any order d. Let’s consider the y 2 and y 6 bits of the output of the linear map.
The shares corresponding to those bits are then given by (y12 , y22 ) and (y16 , y26 )
respectively. These two bits will go through the AND gates of the subsequent
GF(24 ) multiplier, which leads to the following term being computed at one
point:
y12 y26 = (a71 + a61 + a51 + a11 )a12
If there is no register between the linear map and the GF(24 ) multiplier, the
above expression is realized by combinational logic, which deals with a11 and a12
in a nonlinear way and causes leakage on a1 = (a11 , a12 ). Note that the problem
mentioned above does not happen in TIs with sin = td + 1 shares, since the con-
servative non-completeness condition makes sure that each component function
is independent of at least one share (for d = 1). Hence, linear functions before
and after nonlinear component functions can be used without synchronization.
No remasking is required after this stage since the computed function is linear.
Second Stage. We consider the parallel application of nonlinear multiplication
and affine Square Scaling (Sq. Sc.) as one single function d = b⊗c⊕SqSc(b⊕c).
For the second order, the resulting equations are given by:
d1 = b1 ⊗ c1 ⊕ SqSc(b1 ⊕ c1 )
d6 = b2 ⊗ c3
d2 = b1 ⊗ c2
d7 = b3 ⊗ c1
d3 = b1 ⊗ c3
d8 = b3 ⊗ c2
d4 = b2 ⊗ c1
d9 = b3 ⊗ c3 ⊕ SqSc(b3 ⊕ c3 )
d5 = b2 ⊗ c2 ⊕ SqSc(b2 ⊕ c2 )
Masking AES with d + 1 Shares in Hardware 201
Linear Map
24 bit
R1 R9 R8 R7 R6 R5 R4 R3 R2 R1
60 bit
R1 R9 R8 R7 R6 R5 R4 R3 R2 R1
54 bit
GF(22 ) Inv.
R1 R9 R8 R7 R6 R5 R4 R3 R2 R1
60 bit
R1 R9 R8 R7 R6 R5 R4 R3 R2 R1
72 bit
It is important to add the affine contribution from the Square Scaling to the mul-
tiplier output in such a way that the non-completeness property is not broken,
which leaves only one possibility for the construction. In previous works [4,6,20],
these two functions are treated separately, leading to more outputs at this stage.
By approaching the operations in the second stage in parallel, we obtain two
advantages. Firstly, we omit the extra registers for storing the outputs of both
sub-functions separately. Secondly, less randomness is required to achieve uni-
formity for the inputs of the next stage.
Before the new values are clocked in the register, we need to perform a mask
refreshing. This serves two purposes for higher-order TI. Firstly, it is required
to make the next stage’s inputs uniform and secondly, we require new masks
for the next stage’s inputs to provide multivariate security. The mask refreshing
uses a ring structure and has the advantage that the sum of fresh masks does
not need to be saved in an extra register. In addition, we use an equal number
of shares and fresh masks, which leads to a randomness consumption of 36 bits
for this stage. After the mask refreshing, a compression is applied to reduce the
number of output shares back to d + 1.
Third Stage. This stage is similar to the second stage. Here, the received nib-
bles are split in 2-bit couples for further operation. The Scaling operation (Sc)
replaces the similar affine Square Scaling and is executed alongside the multipli-
cation in GF(22 ). By combining both operations, we can share the total function
by taking again the non-completeness into account. Since a nonlinear multipli-
cation is performed on the 2-bit shares, remasking is required on its 9 outputs,
consuming a total of 18 bits of randomness.
Fourth Stage. The fourth stage is composed of an inversion and two parallel
multiplications in GF(22 ). The inversion in GF(22 ) is linear and is implemented
by swapping the bits using wires and comes at no additional cost. The outputs of
the multiplications are concatenated, denoted by in Fig. 3, to form 4-bit values
||
in GF(24 ). The concatenated 4-bit values of the 9 outputs of the multipliers are
remasked with a total of 36 fresh random bits.
Fifth Stage. Stage 5 is similar to Stage 4. The difference of the two stages lies in
the absence of the inversion operation and the multiplications being performed
in GF(24 ) instead of GF(22 ). The concatenation of its outputs results in byte
values, which are remasked with 72 fresh random bits.
Sixth Stage. In the final stage of the S-box, the inverse linear map is performed.
By using a register between Stage 5 and Stage 6, we can remask the shares and
perform a compression before the inverse linear map is performed resulting in
only three instead of nine instances of inverse linear maps. As with the linear
map, no uniform sharing of its inputs is required for security. However, in the full
AES, this output will at some point reappear at the input of the S-box, where
it undergoes nonlinear operations again. This is why we insert the remasking.
Note that this register and the register right after the linear map can be merged
with the AES state registers.
Masking AES with d + 1 Shares in Hardware 203
Parallel Operations. The parallel linear and nonlinear operations from Stage 2
and 3 are altered in the following way:
d1 = b1 ⊗ c1 ⊕ SqSc(b1 ⊕ c1 )
d2 = b1 ⊗ c2
d3 = b2 ⊗ c1
d4 = b2 ⊗ c2 ⊕ SqSc(b2 ⊕ c2 )
Again, the ith output of both SqSc and Sc operations are combined with
output bi ⊗ ci of the multiplier in order to preserve non-completeness. While
this structure is similar to our second-order design, we consider this parallel
operation an optimization compared to other first-order TIs [4,6,20].
Modified Refreshing. The ring structure of the refreshing in the general, higher-
order case can be substituted with a less costly structure for first-order security.
This structure of the refreshing is shown in Fig. 4. This modification lowers the
randomness requirements from 4 to 3 units of randomness.1
Preliminary Tests. A preliminary evaluation was carried out with the tool
from [25] in a simulated environment, allowing to refine our design. We then
proceed with the side-channel evaluation based on actual measurements.
R1 R1
S1 S1
R2 R2
S2 S2
R3 R3
S3 S3
R4 R1 ⊕ R2 ⊕ R3
S4 S4
R1
Low Noise. We did our best to keep the measurement noise to the lowest possible
level. The platform itself is very low noise (DPA on an unprotected AES succeeds
with few tens of traces). We clock our designs at 3 MHz with a very stable
clock and sample at 1 GS/s with a Tektronix DPO 7254C oscilloscope. The
measurements cover 1.5 rounds of AES.
Synthesis. We used standard design flow tools (Xilinx ISE) to synthetize our
designs. We selected the KEEP HIERARCHY option during synthesis to prevent
optimizations over module boundaries that would destroy our (security critical)
design partitioning.
4.2 Methodology
We use leakage detection tests [11–13,15,27] to test for any power leakage of our
masked implementations. The fix class of the leakage detection is chosen as the
zero plaintext in all our evaluations.
(a) Power trace of the first-order imple- (b) Power trace of the second-order im-
mentation, PRNG inactive plementation, PRNG inactive
(c) Power trace of the first-order imple- (d) Power trace of the second-order im-
mentation, PRNG active plementation, PRNG active
We first evaluate the first-order secure masked AES. Figures 5a and 5c show an
example of power traces when the PRNG is inactive and active respectively for
the first-order implementations. It is clear that the interleaved PRNG does not
overlap with AES. We now apply the leakage detection test.
PRNG Off. Fig. 6a shows the result of the t-test on the implementation without
randomness. First- (top) and second-order (bottom) results clearly show leaks
that surpass the confidence threshold of ±4.5. Thus, as expected, this setting is
not secure.
PRNG On. When we turn on the random number generator, our design shows no
first-order leakage with up to 100 million traces. The t-test statistic for the first
and second orders are plotted in Fig. 6b. In agreement with the security claim of
the design, the first-order trace does not show leakage. The second-order does.
206 T. De Cnudde et al.
-200
-400
t-value
40
20
-20
Fig. 6. First- (top) and second-order (bottom) leakage detection test results for the
first-order implementation
Fig. 7. First- (top), second- (middle) and third-order (bottom) leakage detection test
results for the second-order implementation
This is expected since the design does not provide second-order security (note
that sensitive variables are split among two shares).
PRNG On. When the masking is turned on by enabling the PRNG, we expect
our design to show no leakage in the first and second order.
The results of the t-test with 100 million traces are shown in Fig. 7b. As
expected, we only observe leakage in the third order. The t-values of the first-
and second-order tests never exceed the confidence threshold of ±4.5.
PRNG Off. The lower left corner of Fig. 8 shows the absolute t values for the
bivariate analysis of the unmasked implementation. As expected, leakages of
considerable magnitude (t values exceeding 100) are present and we conclude
that the measurement setup for the bivariate analysis is sound.
PRNG On. When the PRNG is switched on, the outcome of the test is different.
The absolute value of the resulting bivariate leakage when the masks are switched
on with 100 million traces is depicted in the upper right corner of Fig. 8. No
excursions of the t-values beyond ±4.5 occur and thus the test is passed.
One might ask if 100 million traces are enough. To gain some insight that this
(arbitrary) number is indeed enough, we refer back to the performed third-order
tests of Fig. 7b. We can see that third-order leakage is detectable, and thus we
can assert that bivariate second-order attacks are not the cheapest strategy for
an adversary. Therefore, the masking is deemed effective.
5 Implementation Cost
Table 1 lists the area costs of the individual components of our designs. Table 2
gives the full implementation costs of our designs and of related TIs. The area
estimations are obtained with Synopsys 2010.03 and the NanGate 45 nm Open
Cell Library [1].
Area. Both the first- and the second-order masked AES cores are the smallest
available to this date. Moving from first-order to second-order security requires
an increase of 50 % in GE for linear functions and an increase of around 100 % for
nonlinear functions. The larger increase for nonlinear functions stems from the
quadratic increase of output shares as function of an increment in input shares,
resulting in more registers per stage.
Area [GEs]
Compile Compile ultra
First-order TI
S-box 1977 1872
AES key & State array 4472 4238
AES control 232 230
Total AES 6681 6340
Second-order TI
S-box 3796 3662
AES key & State array 6287 6258
AES control 366 356
Total AES 10449 10276
Masking AES with d + 1 Shares in Hardware 209
Speed. The number of clock cycles for an AES encryption is equal for our first-
and second-order implementations. All previous first-order TIs have a faster
encryption because they have less pipeline stages in the S-box.
Table 3. Implementation cost per pipeline stage in function of the order d > 1
6 Conclusion
In this paper, two new hardware implementations of AES secure against dif-
ferential power analysis attacks were described. Both implementations use the
theoretical minimum number of shares in the linear and nonlinear operations by
following the conditions from Reparaz et al. [26]. The security of both designs
were validated by leakage detection tests in lab conditions.
In summary, our first-order implementation of AES requires 6340 GE, 54
bits of randomness per S-box and a total of 276 clock cycles. In comparison to
the previously smallest TI of AES by Bilgin et al. [6], an area reduction of 15%
is obtained. The number of clock cycles for an encryption is increased by 11%
and the required randomness is raised with 68%. The presented second-order
implementation of AES requires 10276 GE, 162 bits of randomness per S-box
and 276 clock cycles. Compared to the second-order TI-AES of [14], we obtain
a 53% reduction in area at the cost of a 28% increase in required randomness.
The number of clock cycles for an encryption stays the same.
While the area of these implementations are the smallest published for AES
to this date, the required randomness is substantially increased. Investigating
ways of reducing the randomness is essential for lightweight application. In future
work, paths leading to minimizing this cost will be researched. A second direction
for future work is to compare the security in terms of number of traces required
to perform a successful key retrieval between our implementations and the AES
in [6]. This can lead to better insights in the trade-off between security and
implementation costs for TIs with sin = d + 1 and sin = td + 1 shares.
Acknowledgments. The authors would like to thank the anonymous reviewers for
providing constructive and valuable comments. This work was supported in part by
NIST with the research grant 60NANB15D346, in part by the Research Council KU
Leuven (OT/13/071 and GOA/11/007) and in part by the European Unions Horizon
2020 research and innovation programme under grant agreement No. 644052 HEC-
TOR. Begül Bilgin is a Postdoctoral Fellow of the Fund for Scientific Research - Flan-
ders (FWO). Oscar Reparaz is funded by a PhD fellowship of the Fund for Scientific
Research - Flanders (FWO). Thomas De Cnudde is funded by a research grant of the
Institute for the Promotion of Innovation through Science and Technology in Flanders
(IWT-Vlaanderen).
Masking AES with d + 1 Shares in Hardware 211
References
1. NanGate Open Cell Library. http://www.nangate.com/
2. Research Center for Information Security, National Institute of AdvancedIndustrial
Science and Technology, Side-channel Attack Standard EvaluationBoard SASEBO-
G Specification. http://satoh.cs.uec.ac.jp/SASEBO/en/board/sasebo-g.html
3. Agrawal, D., Archambeault, B., Rao, J.R., Rohatgi, P.: The EM side-channel(s).
In: Kaliski, B.S., Koç, Ç.K., Paar, C. (eds.) CHES 2002. LNCS, vol. 2523, pp.
29–45. Springer, Heidelberg (2002). http://dx.doi.org/10.1007/3-540-36400-5 4
4. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: A more effi-
cient AES threshold implementation. In: Pointcheval, D., Vergnaud, D. (eds.)
AFRICACRYPT 2014. LNCS, vol. 8469, pp. 267–284. Springer, Heidelberg (2014).
http://dx.doi.org/10.1007/978-3-319-06734-6 17
5. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: Higher-
order threshold implementations. In: Sarkar, P., Iwata, T. (eds.) ASI-
ACRYPT 2014. LNCS, vol. 8874, pp. 326–343. Springer, Heidelberg (2014).
http://dx.doi.org/10.1007/978-3-662-45608-8 18
6. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: Trade-offs for threshold
implementations illustrated on AES. IEEE Trans. CAD Integr. Circ. Syst. 34(7),
1188–1200 (2015). http://dx.doi.org/10.1109/TCAD.2015.2419623
7. Bilgin, B., Daemen, J., Nikov, V., Nikova, S., Rijmen, V., Van Assche, G.: Effi-
cient and first-order DPA resistant implementations of Keccak. In: Francillon,
A., Rohatgi, P. (eds.) CARDIS 2013. LNCS, vol. 8419, pp. 187–199. Springer,
Heidelberg (2014). http://dx.doi.org/10.1007/978-3-319-08302-5 13
8. Borghoff, J., Canteaut, A., Güneysu, T., Kavun, E.B., Knezevic, M., Knudsen, L.R.,
Leander, G., Nikov, V., Paar, C., Rechberger, C., Rombouts, P., Thomsen, S.S.,
Yalçin, T.: PRINCE - a low-latency block cipher for pervasive computing appli-
cations (full version). IACR Cryptology ePrint Archive 2012/529 (2012). http://
eprint.iacr.org/2012/529
9. Canright, D.: A very compact S-box for AES. In: Rao, J.R., Sunar, B.
(eds.) CHES 2005. LNCS, vol. 3659, pp. 441–455. Springer, Heidelberg (2005).
http://dx.doi.org/10.1007/11545262 32
10. Chari, S., Jutla, C.S., Rao, J.R., Rohatgi, P.: Towards sound approaches
to counteract power-analysis attacks. In: Wiener, M. (ed.) CRYPTO 1999.
LNCS, vol. 1666, pp. 398–412. Springer, Heidelberg (1999). http://dx.doi.org/
10.1007/3-540-48405-1 26
11. Cooper, J., DeMulder, E., Goodwill, G., Jaffe, J., Kenworthy, G., Rohatgi, P.:
Test vector leakage assessment (TVLA) methodology in practice. In: International
Cryptographic Module Conference (2013). http://icmc-2013.org/wp/wp-content/
uploads/2013/09/goodwillkenworthtestvector.pdf
12. Coron, J.-S., Kocher, P.C., Naccache, D.: Statistics and secret leakage. In: Frankel,
Y. (ed.) FC 2000. LNCS, vol. 1962, pp. 157–173. Springer, Heidelberg (2001).
http://dx.doi.org/10.1007/3-540-45472-1 12
13. Coron, J., Naccache, D., Kocher, P.C.: Statistics and secret leak-
age. ACM Trans. Embed. Comput. Syst. 3(3), 492–508 (2004).
http://doi.acm.org/10.1145/1015047.1015050
14. De Cnudde, T., Bilgin, B., Reparaz, O., Nikov, V., Nikova, S.: Higher-order thresh-
old implementation of the AES S-box. In: Homma, N., et al. (eds.) CARDIS
2015. LNCS, vol. 9514, pp. 259–272. Springer, Heidelberg (2016). doi:10.1007/
978-3-319-31271-2 16
212 T. De Cnudde et al.
15. Goodwill, G., Jun, B., Jaffe, J., Rohatgi, P.: A testing methodology for side-
channel resistance validation. In: NIST Non-Invasive Attack Testing Workshop
(2011). http://csrc.nist.gov/news events/non-invasive-attack-testing-workshop/
papers/08 Goodwill.pdf
16. Ishai, Y., Sahai, A., Wagner, D.: Private circuits: securing hardware against prob-
ing attacks. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol. 2729, pp. 463–481.
Springer, Heidelberg (2003). http://dx.doi.org/10.1007/978-3-540-45146-4 27
17. Kocher, P.C.: Timing attacks on implementations of Diffie-Hellman, RSA, DSS,
and other systems. In: Koblitz, N. (ed.) CRYPTO 1996. LNCS, vol. 1109, pp.
104–113. Springer, Heidelberg (1996). http://dx.doi.org/10.1007/3-540-68697-5 9
18. Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M.
(ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999).
http://dx.doi.org/10.1007/3-540-48405-1 25
19. Mangard, S., Pramstaller, N., Oswald, E.: Successfully attacking masked AES hard-
ware implementations. In: Rao, J.R., Sunar, B. (eds.) CHES 2005. LNCS, vol. 3659,
pp. 157–171. Springer, Heidelberg (2005). http://dx.doi.org/10.1007/11545262 12
20. Moradi, A., Poschmann, A., Ling, S., Paar, C., Wang, H.: Pushing the limits: a
very compact and a threshold implementation of AES. In: Paterson, K.G. (ed.)
EUROCRYPT 2011. LNCS, vol. 6632, pp. 69–88. Springer, Heidelberg (2011).
http://dx.doi.org/10.1007/978-3-642-20465-4 6
21. Nikova, S., Rijmen, V., Schläffer, M.: Secure hardware implementation of non-
linear functions in the presence of glitches. J. Cryptol. 24(2), 292–321 (2011).
http://dx.doi.org/10.1007/s00145-010-9085-7
22. Poschmann, A., Moradi, A., Khoo, K., Lim, C., Wang, H., Ling, S.: Side-channel
resistant crypto for less than 2, 300 GE. J. Cryptol. 24(2), 322–345 (2011).
http://dx.doi.org/10.1007/s00145-010-9086-6
23. Prouff, E., Rivain, M.: Masking against side-channel attacks: a for-
mal security proof. In: Johansson, T., Nguyen, P.Q. (eds.) EURO-
CRYPT 2013. LNCS, vol. 7881, pp. 142–159. Springer, Heidelberg (2013).
http://dx.doi.org/10.1007/978-3-642-38348-9 9
24. Prouff, E., Roche, T.: Higher-order glitches free implementation of the AES
using secure multi-party computation protocols. In: Preneel, B., Takagi, T.
(eds.) CHES 2011. LNCS, vol. 6917, pp. 63–78. Springer, Heidelberg (2011).
http://dx.doi.org/10.1007/978-3-642-23951-9 5
25. Reparaz, O.: Detecting flawed masking schemes with leakage detection tests. In:
Peyrin, T. (ed.) FSE 2016. LNCS, vol. 9813, pp. xx–yy. Springer, Heidelberg (2016)
26. Reparaz, O., Bilgin, B., Nikova, S., Gierlichs, B., Verbauwhede, I.: Consolidat-
ing masking schemes. In: Gennaro, R., Robshaw, M. (eds.) CRYPTO 2015.
LNCS, vol. 9215, pp. 764–783. Springer, Heidelberg (2015). http://dx.doi.org/
10.1007/978-3-662-47989-6 37
27. Schneider, T., Moradi, A.: Leakage assessment methodology. In: Güneysu, T.,
Handschuh, H. (eds.) CHES 2015. LNCS, vol. 9293, pp. 495–513. Springer,
Heidelberg (2015). http://dx.doi.org/10.1007/978-3-662-48324-4 25
28. Shannon, C.: The synthesis of two-terminal switching circuits. Bell Syst. Tech. J.
28(1), 59–98 (1949)
New Directions
Differential Computation Analysis:
Hiding Your White-Box Designs is Not Enough
1 Introduction
The widespread use of mobile “smart” devices enables users to access a large vari-
ety of ubiquitous services. This makes such platforms a valuable target (cf. [48]
for a survey on security for mobile devices). There are a number of techniques
to protect the cryptographic keys residing on these mobile platforms. The solu-
tions range from unprotected software implementations on the lower range of the
security spectrum, to tamper-resistant hardware implementations on the other
end. A popular approach which attempts to hide a cryptographic key inside a
software program is known as a white-box implementation.
Ch. Hubain and Ph. Teuwen—This work was performed while the second and fourth
author were an intern and employee in the Innovation Center Crypto & Security at
NXP Semiconductors, respectively.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 215–236, 2016.
DOI: 10.1007/978-3-662-53140-2 11
216 J.W. Bos et al.
The white-box attack model allows the adversary to take full control over the
cryptographic implementation and the execution environment. It is not surpris-
ing that, given such powerful capabilities of the adversary, the authors of the orig-
inal white-box paper [16] conjectured that no long-term defense against attacks
on white-box implementations exists. This conjecture should be understood in
the context of code-obfuscation, since hiding the cryptographic key inside an
implementation is a form of code-obfuscation. It is known that obfuscation of
any program is impossible [3], however, it is unknown if this result applies to
a specific subset of white-box functionalities. Moreover, this should be under-
stood in the light of recent developments where techniques using multilinear
maps are used for obfuscation that may provide meaningful security guarantees
(cf. [2,10,22]). In order to guard oneself in this security model in the medium- to
long-run one has to use the advantages of a software-only solution. The idea is to
use the concept of software aging [27]: this forces, at a regular interval, updates
to the white-box implementation. It is hoped that when this interval is small
enough, this gives insufficient computational time to the adversary to extract
the secret key from the white-box implementation. This approach makes only
sense if the sensitive data is only of short-term interest, e.g. the DRM-protected
broadcast of a football match. However, the practical challenges of enforcing
these updates on devices with irregular internet access should be noted.
DCA: Hiding Your White-Box Designs is Not Enough 219
External Encodings. Besides its primary goal to hide the key, white-box
implementations can also be used to provide additional functionality, such as
putting a fingerprint on a cryptographic key to enable traitor tracing or harden-
ing software against tampering [42]. There are, however, other security concerns
besides the extraction of the cryptographic secret key from the white-box imple-
mentation. If one is able to extract (or copy) the entire white-box implementation
to another device then one has copied the functionality of this white-box imple-
mentation as well, since the secret key is embedded in this program. Such an
attack is known as code lifting. A possible solution to this problem is to use exter-
nal encodings [16]. When one assumes that the cryptographic functionality Ek is
part of a larger ecosystem then one could implement Ek = G ◦ Ek ◦ F −1 instead.
The input (F ) and output (G) encoding are randomly chosen bijections such
that the extraction of Ek does not allow the adversary to compute Ek directly.
The ecosystem which makes use of Ek must ensure that the input and output
encodings are canceled. In practice, depending on the application, input or out-
put encodings need to be performed locally by the program calling Ek . E.g. in
DRM applications, the server may take care of the input encoding remotely but
the client needs to revert the output encoding to finalize the content decryption.
In this paper, we can mount successful attacks on implementations which
apply at most a single remotely handled external encoding. When both the input is
received with an external encoding applied to it remotely and the output is com-
puted with another encoding applied to it (which is removed remotely) then the
implementation is not a white-box implementation of a standard algorithm (like
AES or DES) but of a modified algorithm (like G ◦ AES ◦ F −1 or G ◦ DES ◦ F −1 ).
in [35]. However, in 2007, two differential cryptanalytic attacks [6] were presented
which can extract the secret key from this type of white-box [23,59]. This latter
approach has a time complexity of only 214 .
The same technique can be applied on other traces which contain other types
of side-channel information such as, for instance, the electromagnetic radiations
of the device. Although we focus on DPA in this paper, it should be noted
that there exist more advanced and powerful attacks. This includes, among
others, higher order attacks [41], correlation power analyses [11] and template
attacks [15].
The following steps outline the process how to obtain software traces and
mount a DPA attack on these software traces.
First Step. Trace a single execution of the white-box binary with an arbitrary
plaintext and record all accessed addresses and data over time. Although the
tracer is able to follow execution everywhere, including external and system
libraries, we reduce the scope to the main executable or to a companion library
if the cryptographic operations happen to be handled there. A common computer
security technique often deployed by default on modern operating systems is the
Address Space Layout Randomization (ASLR) which randomly arranges the
address space positions of the executable, its data, its heap, its stack and other
elements such as libraries. In order to make acquisitions completely reproducible
we simply disable the ASLR, as the white-box model puts us in control over the
execution environment. In case ASLR cannot be disabled, it would just be a
mere annoyance to realign the obtained traces.
Second Step. Next, we visualize the trace to understand where the block cipher
is being used and, by counting the number of repetitive patterns, determine
which (standardized) cryptographic primitive is implemented: e.g., a 10-round
AES-128, a 14-round AES-256, or a 16-round DES. To visualize a trace, we
decided to represent it graphically similarly to the approach presented in [43].
Figure 1 illustrates this approach: the virtual address space is represented on
the x-axis, where typically, on many modern platforms, one encounters the text
segment (containing the instructions), the data segment, the uninitialized data
(BSS) segment, the heap, and finally the stack, respectively. The virtual address
space is extremely sparse so we display only bands of memory where there is
something to show. The y-axis is a temporal axis going from top to bottom. Black
represents addresses of instructions being executed, green represents addresses of
memory locations being read and red when being written. In Fig. 1 one deduces
that the code (in black) has been unrolled in one huge basic block, a lot of
memory is accessed in reads from different tables (in green) and the stack is
comparatively so small that the read and write accesses (in green and red) are
barely noticeable on the far right without zooming in.
Third Step. Once we have determined which algorithm we target we keep the
ASLR disabled and record multiple traces with random plaintexts, optionally
using some criteria e.g. in which instructions address range to record activity.
This is especially useful for large binaries doing other types of operations we
are not interested in (e.g., when the white-box implementation is embedded in
a larger framework). If the white-box operations themselves take a lot of time
then we can limit the scope of the acquisition to recording the activity around
just the first or last round, depending if we mount an attack from the input or
output of the cipher. Focusing on the first or last round is typical in DPA-like
attacks since it limits the portion of key being attacked to one single byte at once,
224 J.W. Bos et al.
as explained in Sect. 3. In the example given in Fig. 1, the read accesses pattern
make it trivial to identify the DES rounds and looking at the corresponding
instructions (in black) helps defining a suitable instructions address range. While
recording all memory-related information in the initial trace (first step), we only
record a single type of information (optionally for a limited address range) in
this step. Typical examples include recordings of bytes being read from memory,
or bytes written to the stack, or the least significant byte of memory addresses
being accessed.
This generic approach gives us the best trade-off to mount the attack as fast
as possible and minimize the storage of the software traces. If storage is not a
concern, one can directly jump to the third step and record traces of the full
execution, which is perfectly acceptable for executables without much overhead,
as it will become apparent in several examples in Sect. 5. This naive approach
can even lead to the creation of a fully automated acquisition and key recovery
setup.
(a)
(b)
Therefore, in such scenarios, the most elementary leakage model is the Ham-
ming weight of the bytes being transferred between CPU and memory. However,
in our software setup, we know the exact 8-bit value and to exploit it at best,
we want to attack each bit individually, and not their sum (as in the Ham-
ming weight model). Therefore, the serialization step we perform (converting
the observed values into vectors of ones and zeros) is as if in the hardware model
each corresponding bus line was leaking individually one after the other.
When performing a DPA attack, a power trace typically consists of sampled
analog measures. In our software setting we are working with perfect leakages
(i.e., no measurement noise) of the individual bits that can take only two possible
values: 0 or 1. Hence, our software tracing can be seen from a hardware perspec-
tive as if we were probing each individual line with a needle, something requiring
heavy sample preparation such as chip decapping and Focused Ion Beam (FIB)
milling and patching operations to dig through the metal layers in order to reach
the bus lines without affecting the chip functionality. Something which is much
more powerful and invasive than external side-channel acquisition.
When using software traces there is another important difference with tradi-
tional power traces along the time axis. In a physical side-channel trace, analog
values are sampled at a fixed rate, often unrelated to the internal clock of the
device under attack, and the time axis represents time linearly. With software
execution traces we record information only when it is relevant, e.g. every time
a byte is written on the stack if that is the property we are recording, and more-
over bits are serialized as if they were written sequentially. One may observe
that given this serialization and sampling on demand, our time axis does not
represent an actual time scale. However, a DPA attack does not require a proper
226 J.W. Bos et al.
time axis. It only requires that when two traces are compared, corresponding
events that occurred at the same point in the program execution are compared
against each other. Figure 2a and b illustrate those differences between traces
obtained for usage with DPA and DCA, respectively.
Fifth Step. Once the software execution traces have been acquired and shaped,
we can use regular DPA tools to extract the key. We show in the next section
what the outcome of DPA tools look like, besides the recovery of the key.
Optional Step. If required, one can identify the exact points in the execution
where useful information leaks. With the help of known-key correlation analysis
one can locate the exact “faulty” instruction and the corresponding source code
line, if available. This can be useful as support for the white-box designer.
To conclude this section, here is a summary of the prerequisites of our dif-
ferential computation analysis, in opposition to the previous white-box attacks’
prerequisites which were detailed in Sect. 2.2: (1) Be able to run several times
(a few dozens to a few thousands) the binary in a controlled environment. (2)
having knowledge of the plaintexts (before their encoding, if any), or of the
ciphertexts (after their decoding, if any).
(a) (b)
Fig. 3. (a) Visualization of a software execution trace of the binary Wyseur white-box
challenge showing the entire accessed address range. (b) A zoom on the stack address
space from the software trace shown in (a). The 16 rounds of the DES algorithm are
clearly visible. (Color figure online)
the stack (on the far right) then the 16 rounds of DES can be clearly distin-
guished. This zoomed view is outlined in Fig. 3b where the y-axis is unaltered
(from Fig. 3a) but the address-range (the x-axis) is rescaled to show only the
read and write accesses to the stack.
Due to the loops in the program flow, we cannot just limit the tracer to a
specific memory range of instructions and target a specific round. As a trace
over the full execution takes a fraction of a second, we traced the entire program
without applying any filter. The traces are easily exploited with DCA: e.g., if
we trace the bytes written to the stack over the full execution and we compute
a DPA over this entire trace without trying to limit the scope to the first round,
the key is completely recovered with as few as 65 traces when using the output
of the first round as intermediate value.
The execution of the entire attack, from the download of the binary challenge
to full key recovery, including obtaining and analyzing the traces, took less than
an hour as its simple textual interface makes it very easy to hook it to an attack
framework. Extracting keys from different white-box implementations based on
this design now only takes a matter of seconds when automating the entire
process as outlined in Sect. 4.
As part of the Hack.lu 2009 conference, which aims to bridge ethics and secu-
rity in computer science, Jean-Baptiste Bédrune released a challenge [4] which
consisted of a crackme.exe file: an executable for the Microsoft Windows plat-
form. When launched, it opens a GUI prompting for an input, redirects it to
a white-box and compares the output with an internal reference. It was solved
independently by Eloi Vanderbéken [58], who reverted the functionality of the
white-box implementation from encryption to decryption, and by “SysK” [54]
who managed to extract the secret key from the implementation.
228 J.W. Bos et al.
Our plugins for the DBI tools have not been ported to the Windows operating
system and currently only run on GNU/Linux and Android. In order to use our
tools directly we decided to trace the binary with our Valgrind variant and
Wine [1], an open source compatibility layer to run Windows applications under
GNU/Linux. Due to the configuration of this challenge we had full control on
the input to the white-box.
Visualizing the traces using our software framework clearly shows ten repeti-
tive patterns on the left interleaved with nine others on the right. This indicates
(with high probability) an AES encryption or decryption with a 128-bit key.
The last round being shorter as it omits the MixColumns operation as per the
AES specification. We captured a few dozen traces of the entire execution, with-
out trying to limit ourselves to the first round. Due to the overhead caused by
running the GUI inside Wine the acquisition ran slower than usual: obtaining a
single trace took three seconds. Again, we applied our DCA technique on traces
which recorded bytes written to the stack. The secret key could be completely
recovered with only 16 traces when using the output of the first round SubBytes
as intermediate value of an AES-128 encryption. As “SysK” pointed out in [54],
this challenge was designed to be solvable in a couple of days and consequently
did not implement any internal encoding, which means that the intermediate
states can be observed directly. Therefore in our DCA the correlation between
the internal states and the traced values get the highest possible value, which
explains the low number of traces required to mount a successful attack.
Fig. 4. Visualization of the stack reads and writes in the software execution trace
portion limited to the core of the Karroumi WB-AES.
box makes use of a separate Bits class to handle its variables so we added some
hooks to record all new instances of that particular class. This was sufficient.
Again, as for the Hack.lu 2009 WB-AES challenge (see Sect. 5.2), 16 traces were
enough to recover the key of this WB-DES when using the output of the first
round as intermediate value. This approach works with such a low number of
traces since the intermediate states are not encoded.
Table 1. DCA ranking for a Karroumi white-box implementation when targeting the
output of the SubBytes step in the first round based on the least significant address
byte on memory reads.
All ✗
The best results were obtained when tracing the lowest byte of the memory
addresses used in read accesses (excluding stack). Initially we followed the same
approach as before: we targeted the output of the SubBytes in the first round.
But, in contrast to the other challenges considered in this work, it was not
enough to immediately recover the entire key. For some of the tracked bits of
the intermediate value we observed a significant correlation peak: this is an
indication that the first key candidate is very probably the correct one. Table 1
shows the ranking of the right key byte value amongst the guesses after 2000
traces, when sorted according to the difference of means (see Sect. 3). If the key
byte is ranked at position 1 this means it was properly recovered by the attack.
In total, for the first challenge we constructed, 15 out of 16 key bytes were ranked
at position 1 for at least one of the target bits and one key byte (key byte 6 in
the table) did not show any strong candidate. However, recovering this single
missing key-byte is trivial using brute-force.
It is interesting to observe in Table 1 that when a target bit of a given key byte
does not leak (i.e. is not ranked first) it is very often the worst candidate (ranked
at the 256th position) rather than being at a random position. This observation,
that still holds for larger numbers of traces, can also be used to recover the
key. In order to give an idea of what can be achieved with an automated attack
against new instantiations of this white-box implementation with other keys,
we provide some figures: The acquisition of 2000 traces takes about 800s on a
regular laptop (dual-core i7-4600U CPU at 2.10 GHz). This results in 3328 kbits
(416 kB) of traces when limited to the execution of the first round. Running
the attack requires less than 60 s. Attacking the second challenge with external
encodings gave similar results. This was expected as there is no difference, from
our adversary perspective, when applying external encodings or omitting them
since in both cases we have knowledge of the original plaintexts before any
encoding is applied.
DCA: Hiding Your White-Box Designs is Not Enough 231
In April 2013, a challenge designed by Eloi Vanderbéken was published for the
occasion of the NoSuchCon 2013 conference7 . The challenge consisted of a Win-
dows binary embedding a white-box AES implementation. It was of “keygen-me”
type, which means one has to provide a name and the corresponding serial to
succeed. Internally the serial is encrypted by a white-box and compared to the
MD5 hash of the provided name.
The challenge was completed by a number of participants (cf. [38,53]) but
without ever recovering the key. It illustrates one more issue designers of white-
box implementations have to deal with in practice: one can convert an encryption
routine into a decryption routine without actually extracting the key.
For a change, the design is not derived from Chow [16]. However, the white-
box was designed with external encodings which were not part of the binary.
Hence, the user input was considered as encoded with an unknown scheme and
the encoded output is directly compared to a reference. These conditions, with-
out any knowledge of the relationship between the real AES plaintexts or cipher-
texts and the effective inputs and outputs of the white-box, make it infeasible to
apply a meaningful DPA attack, since, for a DPA attack, we need to construct
the guesses for the intermediate values. Note that, as discussed in Sect. 2, this
white-box implementation is not compliant with AES anymore but computes
some variant Ek = G ◦ Ek ◦ F −1 . Nevertheless we did manage to recover the
key and the encodings from this white-box implementation with a new algebraic
attack, as described in [56]. This was achieved after a painful de-obfuscation of
the binary (almost completely performed by previous write-ups [38,53]), a step
needed to fulfill the prerequisites for such attacks as described in Sect. 2.2.
The same white-box is found among the CHES 2015 challenges8 in a Game-
Boy ROM and the same algebraic attack is used successfully as explained in [55]
once the tables got extracted.
7
See http://www.nosuchcon.org/2013/.
8
https://ches15challenge.com/static/CHES15Challenge.zip, preserved at https://
archive.org/details/CHES15Challenge
232 J.W. Bos et al.
References
1. Amstadt, B., Johnson, M.K.: Wine. Linux J. 1994(4) (1994). http://dl.acm.org/
citation.cfm?id=324681.324684, ISSN: 1075-3583
2. Barak, B., Garg, S., Kalai, Y.T., Paneth, O., Sahai, A.: Protecting obfuscation
against algebraic attacks. In: Nguyen, P.Q., Oswald, E. (eds.) EUROCRYPT 2014.
LNCS, vol. 8441, pp. 221–238. Springer, Heidelberg (2014)
3. Barak, B., Goldreich, O., Impagliazzo, R., Rudich, S., Sahai, A., Vadhan, S.P.,
Yang, K.: On the (im)possibility of obfuscating programs. In: Kilian, J. (ed.)
CRYPTO 2001. LNCS, vol. 2139, pp. 1–18. Springer, Heidelberg (2001)
4. Bédrune, J.-B.: Hack.lu 2009 reverse challenge 1 (2009). http://2009.hack.lu/index.
php/ReverseChallenge
5. Bhatkar, S., DuVarney, D.C., Sekar, R.: Address obfuscation: an efficient approach
to combat a broad range of memory error exploits. In: Proceedings of the 12th
USENIX Security Symposium. USENIX Association (2003)
6. Biham, E., Shamir, A.: Differential cryptanalysis of Snefru, Khafre, REDOC-II,
LOKI and Lucifer. In: Feigenbaum, J. (ed.) CRYPTO 1991. LNCS, vol. 576, pp.
156–171. Springer, Heidelberg (1992)
7. Billet, O., Gilbert, H.: A traceable block cipher. In: Laih, C.-S. (ed.) ASIACRYPT
2003. LNCS, vol. 2894, pp. 331–346. Springer, Heidelberg (2003)
8. Billet, O., Gilbert, H., Ech-Chatbi, C.: Cryptanalysis of a white box AES imple-
mentation. In: Handschuh, H., Hasan, M.A. (eds.) SAC 2004. LNCS, vol. 3357, pp.
227–240. Springer, Heidelberg (2004)
9. Biryukov, A., Canniére, C., Braeken, A., Preneel, B.: A toolbox for cryptanalysis:
linear and affine equivalence algorithms. In: Biham, E. (ed.) EUROCRYPT 2003.
LNCS, vol. 2656, pp. 33–50. Springer, Heidelberg (2003)
10. Brakerski, Z., Rothblum, G.N.: Virtual black-box obfuscation for all circuits via
generic graded encoding. In: Lindell, Y. (ed.) TCC 2014. LNCS, vol. 8349, pp.
1–25. Springer, Heidelberg (2014)
11. Brier, E., Clavier, C., Olivier, F.: Correlation power analysis with a leakage model.
In: Joye, M., Quisquater, J.-J. (eds.) CHES 2004. LNCS, vol. 3156, pp. 16–29.
Springer, Heidelberg (2004)
12. Bringer, J., Chabanne, H., Dottax, E.: White box cryptography: another attempt.
Cryptology ePrint Archive, Report 2006/468 (2006). http://eprint.iacr.org/2006/
468
13. Certicom: The certicom ECC challenge. https://www.certicom.com/index.php/
the-certicom-ecc-challenge
14. Chari, S., Jutla, C.S., Rao, J.R., Rohatgi, P.: Towards sound approaches to coun-
teract power-analysis attacks. In: Wiener, M. (ed.) CRYPTO 1999. LNCS, vol.
1666, pp. 398–412. Springer, Heidelberg (1999)
15. Chari, S., Rao, J.R., Rohatgi, P.: Template attack. In: Kaliski Jr., B.S., Koç, Ç.K.,
Paar, C. (eds.) CHES 2002. LNCS, vol. 2523, pp. 13–28. Springer, Heidelberg
(2003)
234 J.W. Bos et al.
16. Chow, S., Eisen, P.A., Johnson, H., van Oorschot, P.C.: White-box cryptography
and an AES implementation. In: Nyberg, K., Heys, H.M. (eds.) SAC 2002. LNCS,
vol. 2595, pp. 250–270. Springer, Heidelberg (2003)
17. Chow, S., Eisen, P., Johnson, H., van Oorschot, P.C.: A white-box DES imple-
mentation for DRM applications. In: Feigenbaum, J. (ed.) DRM 2002. LNCS, vol.
2696, pp. 1–15. Springer, Heidelberg (2003)
18. de Mulder, Y.: White-box cryptography: analysis of white-box AES implementa-
tions. Ph.D. thesis, KU Leuven (2014)
19. Delerablée, C., Lepoint, T., Paillier, P., Rivain, M.: White-box security notions for
symmetric encryption schemes. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC
2013. LNCS, vol. 8282, pp. 247–264. Springer, Heidelberg (2014)
20. EMC Corporation: The RSA factoring challenge. http://www.emc.com/emc-plus/
rsa-labs/historical/the-rsa-factoring-challenge.htm
21. Falco, F., Riva, N.: Dynamic binary instrumentation frameworks: I know you’re
there spying on me. In: REcon (2012). http://recon.cx/2012/schedule/events/216.
en.html
22. Garg, S., Gentry, C., Halevi, S., Raykova, M., Sahai, A., Waters, B.: Candidate
indistinguishability obfuscation and functional encryption for all circuits. In: 54th
Annual IEEE Symposium on Foundations of Computer Science, FOCS, pp. 40–49.
IEEE Computer Society (2013)
23. Goubin, L., Masereel, J.-M., Quisquater, M.: Cryptanalysis of white box DES
implementations. In: Adams, C., Miri, A., Wiener, M. (eds.) SAC 2007. LNCS,
vol. 4876, pp. 278–295. Springer, Heidelberg (2007)
24. Goubin, L., Patarin, J.: DES and differential power analysis. In: Koç, Ç.K., Paar,
C. (eds.) CHES 1999. LNCS, vol. 1717, pp. 158–172. Springer, Heidelberg (1999)
25. Huang, Y., Ho, F.S., Tsai, H., Kao, H.M.: A control flow obfuscation method
to discourage malicious tampering of software codes. In: Lin, F., Lee, D., Lin,
B.P., Shieh, S., Jajodia, S. (eds.) Proceedings of the 2006 ACM Symposium on
Information, Computer and Communications Security, ASIACCS 2006, p. 362.
ACM (2006)
26. Jacob, M., Boneh, D., Felten, E.W.: Attacking an obfuscated cipher by injecting
faults. In: Feigenbaum, J. (ed.) DRM 2002. LNCS, vol. 2696, pp. 16–31. Springer,
Heidelberg (2003)
27. Jakobsson, M., Reiter, M.K.: Discouraging software piracy using software aging.
In: Sander, T. (ed.) DRM 2001. LNCS, vol. 2320, pp. 1–12. Springer, Heidelberg
(2002)
28. Karroumi, M.: Protecting white-box AES with dual ciphers. In: Rhee, K.-
H., Nyang, D.H. (eds.) ICISC 2010. LNCS, vol. 6829, pp. 278–291. Springer,
Heidelberg (2011)
29. Kirsch, J.: Towards transparent dynamic binary instrumentation using virtual
machine introspection. In: REcon (2015). https://recon.cx/2015/schedule/events/
20.html
30. Klinec, D.: White-box attack resistant cryptography. Master’s thesis, Masaryk Uni-
versity, Brno, Czech Republic (2013). https://is.muni.cz/th/325219/fi m/
31. Kocher, P., Jaffe, J., Jun, B., Rohatgi, P.: Introduction to differential power analy-
sis. J. Cryptogr. Eng. 1(1), 5–27 (2011)
32. Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.)
CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999)
33. Lepoint, T., Rivain, M., De Mulder, Y., Roelse, P., Preneel, B.: Two attacks on a
white-box AES implementation. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC
2013. LNCS, vol. 8282, pp. 265–286. Springer, Heidelberg (2014)
DCA: Hiding Your White-Box Designs is Not Enough 235
34. Li, X., Li, K.: Defeating the transparency features of dynamic binary instrumenta-
tion. In: BlackHat US (2014). https://www.blackhat.com/docs/us-14/materials/
us-14-Li-Defeating-The-Transparency-Feature-Of-DBI.pdf
35. Link, H.E., Neumann, W.D.: Clarifying obfuscation: improving the security of
white-box DES. In: International Symposium on Information Technology: Cod-
ing and Computing (ITCC 2005), pp. 679–684. IEEE Computer Society (2005)
36. Linn, C., Debray, S.K.: Obfuscation of executable code to improve resistance to
static disassembly. In: Jajodia, S., Atluri, V., Jaeger, T. (eds.) Proceedings of the
10th ACM Conference on Computer and Communications Security, CCS 2003, pp.
290–299. ACM (2003)
37. Luk, C., Cohn, R.S., Muth, R., Patil, H., Klauser, A., Lowney, P.G., Wallace, S.,
Reddi, V.J., Hazelwood, K.M.: Pin: building customized program analysis tools
with dynamic instrumentation. In: Sarkar, V., Hall, M.W. (eds.) Proceedings of the
ACM SIGPLAN 2005 Conference on Programming Language Design and Imple-
mentation, pp. 190–200. ACM (2005)
38. Maillet, A.: Nosuchcon 2013 challenge - write up and methodology (2013). http://
kutioo.blogspot.be/2013/05/nosuchcon-2013-challenge-write-up-and.html
39. Mangard, S., Oswald, E., Standaert, F.: One for all - all for one: unifying standard
differential power analysis attacks. IET Inf. Secur. 5(2), 100–110 (2011)
40. Marceau, F., Perigaud, F., Tillequin, A.: Challenge SSTIC 2012 (2012). http://
communaute.sstic.org/ChallengeSSTIC2012
41. Messerges, T.S.: Using second-order power analysis to attack DPA resistant soft-
ware. In: Paar, C., Koç, Ç.K. (eds.) CHES 2000. LNCS, vol. 1965, pp. 238–251.
Springer, Heidelberg (2000)
42. Michiels, W.: Opportunities in white-box cryptography. IEEE Secur. Priv. 8(1),
64–67 (2010)
43. Mougey, C., Gabriel, F.: Désobfuscation de DRM par attaques auxiliaires. In:
Symposium sur la sécurité des technologies de l’information et des communica-
tions (2014). http://www.sstic.org/2014/presentation/dsobfuscation de drm par
attaques auxiliaires
44. De Mulder, Y., Roelse, P., Preneel, B.: Cryptanalysis of the Xiao–Lai white-box
AES implementation. In: Knudsen, L.R., Wu, H. (eds.) SAC 2012. LNCS, vol.
7707, pp. 34–49. Springer, Heidelberg (2013)
45. De Mulder, Y., Wyseur, B., Preneel, B.: Cryptanalysis of a perturbated white-box
AES implementation. In: Gong, G., Gupta, K.C. (eds.) INDOCRYPT 2010. LNCS,
vol. 6498, pp. 292–310. Springer, Heidelberg (2010)
46. Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic binary
instrumentation. In: Ferrante, J., McKinley, K.S., (eds.) Proceedings of the ACM
SIGPLAN 2007 Conference on Programming Language Design and Implementa-
tion, pp. 89–100. ACM (2007)
47. Nikova, S., Rechberger, C., Rijmen, V.: Threshold implementations against side-
channel attacks and glitches. In: Ning, P., Qing, S., Li, N. (eds.) ICICS 2006. LNCS,
vol. 4307, pp. 529–545. Springer, Heidelberg (2006)
48. Polla, M.L., Martinelli, F., Sgandurra, D.: A survey on security for mobile devices.
IEEE Commun. Surv. Tutor. 15(1), 446–471 (2013)
49. Sanfelix, E., de Haas, J., Mune, C.: Unboxing the white-box: practical attacks
against obfuscated ciphers. In: BlackHat Europe 2015 (2015). https://www.
blackhat.com/eu-15/briefings.html
50. Sasdrich, P., Moradi, A., Güneysu, T.: White-box cryptography in the gray box -
a hardware implementation and its side channels. In: FSE 2016, LNCS. Springer,
Heidelberg (2016, to appear)
236 J.W. Bos et al.
51. Schramm, K., Paar, C.: Higher order masking of the AES. In: Pointcheval, D. (ed.)
CT-RSA 2006. LNCS, vol. 3860, pp. 208–225. Springer, Heidelberg (2006)
52. Scrinzi, F.: Behavioral analysis of obfuscated code. Master’s thesis, University
of Twente, Twente, Netherlands (2015). http://essay.utwente.nl/67522/1/Scrinzi
MA SCS.pdf
53. Souchet, A.: AES whitebox unboxing: No such problem (2013). http://0vercl0k.
tuxfamily.org/bl0g/?p=253
54. SysK: Practical cracking of white-box implementations. Phrack 68: 14. http://
www.phrack.org/issues/68/8.html
55. Teuwen, P.: CHES2015 writeup (2015). http://wiki.yobi.be/wiki/CHES2015
Writeup#Challenge 4
56. Teuwen, P.: NSC writeups (2015). http://wiki.yobi.be/wiki/NSC Writeups
57. Tolhuizen, L.: Improved cryptanalysis of an AES implementation. In: Proceed-
ings of the 33rd WIC Symposium on Information Theory. Werkgemeenschap voor
Inform.-en Communicatietheorie (2012)
58. Vanderbéken, E.: Hacklu reverse challenge write-up (2009). http://baboon.rce.free.
fr/index.php?post/2009/11/20/HackLu-Reverse-Challenge
59. Wyseur, B., Michiels, W., Gorissen, P., Preneel, B.: Cryptanalysis of white-box
DES implementations with arbitrary external encodings. In: Adams, C., Miri, A.,
Wiener, M. (eds.) SAC 2007. LNCS, vol. 4876, pp. 264–277. Springer, Heidelberg
(2007)
60. Xiao, Y., Lai, X.: A secure implementation of white-box AES. In: 2nd International
Conference on Computer Science and its Applications 2009, CSA 2009, pp. 1–6
(2009)
61. Zhou, Y., Chow, S.: System and method of hiding cryptographic private keys. 15
December 2009. US Patent 7,634,091
Antikernel: A Decentralized Secure
Hardware-Software Operating System
Architecture
Abstract. The “kernel” model has been part of operating system archi-
tecture for decades, but upon closer inspection it clearly violates the
principle of least required privilege. The kernel is a single entity which
provides many services (memory management, interfacing to drivers,
context switching, IPC) having no real relation to each other, and has
the ability to observe or tamper with all state of the system. This work
presents Antikernel, a novel operating system architecture consisting of
both hardware and software components and designed to be fundamen-
tally more secure than the state of the art. To make formal verification
easier, and improve parallelism, the Antikernel system is highly modu-
lar and consists of many independent hardware state machines (one or
more of which may be a general-purpose CPU running application or sys-
tems software) connected by a packet-switched network-on-chip (NoC).
We create and verify an FPGA-based prototype of the system.
1 Introduction
The Antikernel architecture is intended to be more, yet less, than simply a “ker-
nel in hardware”. By breaking up functionality and decentralizing as much as
possible we aim to create a platform that allows applications to pick and choose
the OS features they wish to use, thus reducing their attack surface dramati-
cally compared to a conventional OS (and potentially experiencing significant
performance gains, as in an exokernel).1
Antikernel is a decentralized architecture with no system calls; all OS func-
tionality is accessed through message passing directly to the relevant service.
To create a process, the user sends a message to the CPU core he wishes to
run it on. To allocate memory, he sends a message to the RAM controller.
1
This paper is based on author 1’s doctoral dissertation research [1].
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 237–256, 2016.
DOI: 10.1007/978-3-662-53140-2 12
238 A. Zonenberg and B. Yener
Each of these nodes is self-contained and manages its own state internally
(although nodes are free to, and many will, request services from other nodes).
There is no “all-powerful” software; all functionality normally implemented
by a kernel is handled by unprivileged software or hardware. Even the hard-
ware is limited in capability; for example the flash controller has no access to
RAM owned by the CPU. By formally verifying the isolation and interprocess
communication, we can achieve a level of security which exceeds even that of a
conventional separation kernel: even arbitrary code execution on a CPU grants
no privileges beyond those normally available to userspace software. Escalation
to “ring 0” or “kernel mode” is made impossible due to the complete lack of
such privileges; unprivileged userspace runs directly on “bare metal”.
Thus Antikernel architecture unifies two previously orthogonal fields - hard-
ware accelerators and operating system (OS) security - in order to create a new
OS architecture which can enforce OS security policy at a much lower level than
previously possible. In contrast to the classical OS model, our system blurs or
eliminates many of the typical boundaries between software, hardware, kernels,
and drivers. Most uniquely, there is no single piece of software or hardware in our
architecture which corresponds to the kernel in a classical OS. The operating sys-
tem is instead an emergent entity arising out of the collective behavior of a series
of distinct hardware modules connected via message passing, which together pro-
vide all of the services normally provided by a kernel and drivers. Each hardware
device includes state machines which implement low-level resource management
and security for that particular device, and provides an API via message passing
directly to userspace. Applications software may either access this API directly
(as in an exokernel [2]) or through server software providing additional abstrac-
tions (as in a microkernel).
By decentralizing to this extent, and creating natural chokepoints for
dataflow between functional subsystems (as in a separation kernel [3,4]), we
significantly reduce the portion of the system which is potentially compromised
in the event of a vulnerability in any one part, and render API-hooking rootkits
impossible (since there is no syscall table to tamper with). In order to avoid
difficult-to-analyze side channels between multiple modules accessing shared
memory, we require that all communication between modules take place via
message passing (as in a multikernel [5]). This modular structure allows piece-
wise formal verification of the system since the dataflow between all components
is constrained to a single well-defined interface.
Unlike virtualization-based separation platforms (such as Qubes [6]), our
architecture does not require massive processing and memory overhead for each
security domain, and is thus well suited to running many security domains on
an embedded system with limited resources. Our architecture also scales to a
large number of mutually untrusting security domains, unlike platforms such as
ARM TrustZone [7] which provide one-way protection of a single domain.
We have tested the feasibility of the architecture by creating a proof-of-
concept implementation targeting a Xilinx FPGA, and report experimental results
Antikernel: A Decentralized Secure Hardware-Software OS Architecture 239
including formal correctness proofs for several key components. The prototype is
open source [8] to encourage verification of our results and further research.
2 Related Work
There are many examples in the literature of operating system components being
moved into hardware2 however the majority of these systems are focused on per-
formance and do not touch on the security implications of their designs at all.
Fundamentally, any hard-wired OS component has an intrinsic local secu-
rity benefit over an equivalent software version - it is physically impossible for
software to tamper with it. This brings an unfortunate corollary - it cannot
be patched if a design error, possibly with security implications, is discovered.
Extremely careful testing and validation of both the design and implementa-
tion is thus required. Furthermore, hardware OSes may not provide any global
benefits to security: If the hardware component does not perform adequate vali-
dation or authentication on commands passed to it from software, compromised
or malicious software can simply coerce the hardware into doing its bidding.
Next we briefly review some of the related work in this domain.
While the TTNoC provides complete and deterministic isolation between hosts
(i.e., no traffic sent by any other host can ever impact the ability of another to
communicate and thus there are no timing/resource exhaustion side channels)
it suffers from the lack of burst capabilities and does not scale well to systems
involving a large number of hosts (in a system with N nodes each one can only
use 1/N of the available bandwidth).
[15] describes a “zero-kernel operating system” or ZKOS. The general guiding
principles of “no all-powerful component”, “hardware-software codesign”, and
“safe design” are very similar to our work, as well as the conclusion that privi-
lege rings are an archaic and far too coarse-grained concept. The main difference
is that their system relies on “streams” (point-to-point one-way communications
links) and “gates” (similar to a syscall vector, allows one security domain to call
into another) for IPC and does not support arbitrary point-to-point communi-
cation. Furthermore, while threading and message passing are implemented in
hardware, the ZKOS architecture appears to be primarily software based with
minimal hardware support and does not support hardware processes/drivers.
Finally, BiiN [16] was the result of a joint Intel-Siemens project to develop
a fault-tolerant computer, which could be configured in several fault-tolerant
modes including paired lock-step CPUs. A capability-based security system is
used to control access to particular objects in memory or disk. The system archi-
tecture advocates heavy compartmentalization with each program divided up as
much as possible, and using protected memory between compartments (although
the goal was reliability against hardware faults through means such as error
correcting codes and lock-stepped CPUs, not security against tampering). No
mention of formal verification could be found in any published documentation.
3
The choice of a quadtree was made purely for convenience of prototyping. Other
implementations of the Antikernel architecture could use an octree, 2D grid, add
direct sibling-to-sibling links to reduce load on the root, or use more esoteric topolo-
gies depending on system requirements.
4
For the remainder of this paper, NoC routing addresses are written in IPv6-style
hexadecimal CIDR notation. For example the subnet consisting of all possible
addresses is denoted 0000/0, 8002/16 is a single host, etc. The architecture can
be scaled to larger address sizes in the future if needed, however it is unlikely that
more than 65536 unique IP cores will be present in any SoC in the near future and
smaller addresses require less FPGA resources.
Antikernel: A Decentralized Secure Hardware-Software OS Architecture 241
subnets, followed by routers for /12 subnets, and so on. Routers are instantiated
as needed to cover active subnets only; if there are only four nodes in the system
the network will consist of a single top-level router with four children rather than
an eight-level tree. Nodes may also be allocated a subnet larger than a /16 if
they require multiple addresses: perhaps a CPU with support for four hardware
threads, with each thread as its own security domain, would use a /14 sized
subnet so that the remainder of the system can distinguish between the threads.
“The network” is actually two parallel networks specialized for different pur-
poses, as shown in Fig. 1. The RPC network transports fixed-size datagrams con-
sisting of one header word and three data words, and is optimized for low-latency
control-plane traffic. The DMA network transports variable size datagrams, and
is optimized for high-throughput data-plane traffic. Each node uses the same
address on both networks to ensure consistency, although individual nodes are
free to only use one network and disable their associated port on the other (for
example, node “n8002/16”). Entire routers for one network or the other may
be optimized out by the code generator if they have no children (for example,
there is no RPC router for the subnet 8004/14 as all nodes in that subnet are
DMA-only).
5
Links could potentially scale to 64, 128, or larger multiples of 32 bits if higher
bandwidth is needed, however our prototype does not implement this.
6
There is no layer-2 header field to distinguish RPC and DMA traffic; since the net-
works are physically distinct the protocol can be trivially determined from context.
Alternate implementations of the architecture could potentially merge both proto-
cols into a single network with an additional header to specify the protocol.
242 A. Zonenberg and B. Yener
Function Calls. The second major kind of RPC transaction is a function call:
a request by the master that the slave take some action, followed by a result
from the slave. This result may be either a success/fail return value indicating
that the remote procedure call completed, or a retry request indicating that the
slave is too busy to accept new requests and that the call should be repeated
later.
The call field of a function call packet is set to a slave-dependent value
describing one of 256 functions the master wishes the slave to perform. The
meaning of the data fields is dependent on the slave’s application-layer protocol.
A return packet (including a retry) must have the same call value as the
incoming function call request to allow matching of requests to responses. The
meaning of the data fields is dependent on the slave’s application-layer protocol.
Although not implemented by any current slaves, the RPC call protocol
allows out-of-order (OoO) transaction processing (handling multiple requests in
the most efficient order, rather than that in which they were received).
Flow Control/Routing. The RPC protocol will function over links with arbi-
trary latency (and thus register stages may be added at any point on a long link
to improve timing), however a round-trip delay of more than one packet time
will reduce throughput since the transmitter must block until an ACK arrives
from the next-hop router before it can send the next packet. We plan to solve
this issue with credit-based flow control in a future revision.
7
Note that the term “interrupt” was chosen because these messages convey roughly
the same information that IRQs do in classical computer architecture. While the slave
node is free to interrupt its processing and act on the incoming message immediately,
it may also choose to buffer the incoming message and handle it later.
Antikernel: A Decentralized Secure Hardware-Software OS Architecture 243
The RPC router is a full crossbar which allows any of the five ports to send
to any other, with multiple packets in flight simultaneously. Each exit queue
maintains a round-robin counter which increments mod 5 each time a packet is
sent. In the event that two ports wish to send out the same port simultaneously,
the port identified by the counter is given max priority; otherwise the lowest
numbered source port wishing to send wins. This ensures baseline quality of
service (each port is guaranteed 20 % of the available bandwidth) while still
permitting bursting (a port can use up to 100 % of available bandwidth if all
others are idle).
Flow Control and Routing. The current DMA flow control scheme expects
a fixed single-cycle latency between routers with lock-step acknowledgement8 .
The DMA router uses the same arbiter and crossbar modules as the RPC router,
although the buffers are somewhat larger.
4 Memory Management
One of the most critical services an operating system must provide is allowing
applications to allocate, free, and manipulate RAM. In the minimalistic environ-
ment of an exokernel there is no need for an OS to provide sub-page allocation
granularity, so we require nodes to allocate full pages of memory and manage
sub-page regions (such as for C’s malloc() function) in a userspace heap. If a
block larger than a page is required, the node must allocate multiple single pages
and map them sequentially to its internal address space.
Antikernel’s memory management enforces a “one page, one owner” model.
Shared memory is intentionally not supported, however data may be transferred
from one node to another in a zero-copy fashion by changing ownership of the
page(s) containing the data to the new user.
The Antikernel memory management API is extremely simple, in keeping
with the exokernel design philosophy. It consists of four RPC calls for manipu-
lating pages (“get free page count”, “allocate page”, “free page”, and “change
ownership of page”) as well as DMA reads and writes. A “write complete” RPC
interrupt is provided to allow nodes to implement memory fencing semantics
before chown()ing a page.
The data structures required to implement this API are extremely simple,
and thus easy to formally verify: a FIFO queue of free pages and an array
mapping page IDs to owner IDs. When the memory subsystem initializes, the
FIFO is filled with the IDs of all pages not used for the internal metadata and
the ownership array records all pages as owned by the memory manager.
Requesting the free page count simply returns the size of the free list FIFO.
Allocating a page fails if the free list is empty. If not, the first page address on
the FIFO is popped and returned to the caller; the ownership records are also
updated to record the caller as the new owner of the page. Freeing a page is
essentially the allocation procedure run in reverse. After checking that the caller
is the owner of the page, it is zeroized to prevent any data leakage between nodes,
then pushed onto the free list and the ownership records updated to record the
memory manager as the new owner of the page. Changing page ownership does
not touch the free list at all; after verifying that the caller is the owner of the
page the ownership records are simply updated with the new owner.
DMA reads and writes perform ownership checks and, if successful, return or
update the contents of the requested range. The current memory controller API
requires that all DMA transactions be aligned to 128-bit (4 word) boundaries,
be a multiple of 4 words in size, and not cross page boundaries.
8
We plan to extend this in the future in order to support variable latency for long-
range cross-chip links, as was done for RPC.
Antikernel: A Decentralized Secure Hardware-Software OS Architecture 245
CPU on its management address the free list is popped, the bitmap is updated
to reflect that this thread ID is allocated, and the thread ID is now available
for use (but is not yet scheduled). A simple hardware state machine loads the
statically linked ELF executable at the provided physical address, initializes the
thread, and requests that the scheduler append it to the run queue.
During execution, the CPU reads the current thread from the linked list and
schedules it for execution if possible, then goes on to the next thread in the
linked list the following cycle. If the thread is already in the pipeline (which may
be true if less than 8 threads are currently runnable) then it waits for one cycle
and tries again. If the thread is not in the run queue at all (which may be true if
the thread was just canceled, or if no threads are currently runnable), then the
CPU goes to the next thread and tries again the next cycle.
To delete a thread, it is removed from the linked list and pushed into the
free list, and the bitmaps are updated to reflect its state as free. The linked-list
pointers for the deleted thread are not changed; this ensures that if the CPU
is about to execute the thread being deleted it will correctly read the “next”
pointer and continue to a runnable thread the next clock cycle. (There are no
use-after-free problems possible due to the multi-cycle latency of the allocate
and free routines; by the time the freshly deleted thread can be reallocated the
CPU is guaranteed to have continued to a runnable thread.)
The architecture allows for a thread to very quickly remove itself from the
run queue without terminating (although the thread management API does not
currently provide a means for doing this). This will allow threads blocking on IO
or an L1 cache miss to be placed in a “sleep” state from which they can quickly
awake, but which does not waste CPU time.
5.3 L1 Cache
The L1 cache for SARATOGA is split into independent I- and D-side banks,
and is fully parameterizable for levels of associativity, words per line, and lines
per thread. The default configuration is 2-way set associative and 16 lines of 8
32-bit words, for a total size of 1 KB instruction and 1 KB data cache per thread
(plus tag bits) and 32 KB overall. The cache is virtually addressed and there is
no coherency between the I- and D-side caches.
Antikernel: A Decentralized Secure Hardware-Software OS Architecture 247
The current cache is quite small per thread, which is likely to lead to a high
miss rate, but this is somewhat made up for by the ability of multithreading to
hide latency - if all 32 threads are active, a 31-cycle miss latency can be tolerated
with a penalty of only one skipped instruction. We have not yet implemented
performance counters for measuring cache performance; after this is added there
are likely to be numerous optimizations to the cache structure.
5.4 MMU
In order to send an RPC message, the high half of the a0 register is loaded with
the “send” opcode; the low half of a0, as well as a1, a2, and a3, store the RPC
message. This is identical to the standard C calling convention for MIPS, which
makes implementation of the syscall() library function trivial. (The high half
of a0 is used as the opcode since this would normally be the source address of
the packet, but this is added by hardware). A syscall instruction then actually
sends the message.
Receiving an RPC message is essentially the same process in reverse. The
high half of a0 is loaded with the “receive” opcode and a syscall instruction
is executed. When a message is ready, it is written to v1, v0, k0, and k1. This
places the success/fail code and the first half-word of the return data in v0,
typically used for integer results in the MIPS C calling convention.
Since a new application starting up on a SARATOGA core does not nec-
essarily know the management address of its host CPU, we provide a means
for doing so through the syscall instruction. At any time, an application may
perform a syscall with the high half of a0 set to “get management address”
to set the v0 register to the current CPU’s management address. All other CPU
management operations are accessed via RPCs to the management address.
248 A. Zonenberg and B. Yener
To create a new process, a node sends a “create process” call to the CPU’s OoB
management port, specifying the physical address of the executable to run. The
management system begins by allocating a new thread context, returning failure
if all are currently in use.
If a thread ID was successfully obtained, the ELF loader then issues a DMA
read for sizeof(Elf32 Ehdr) bytes to the supplied physical address, expecting
to find a well formed ELF executable header. If the header is invalid (wrong
magic numbers, incorrect version, or not a big-endian MIPS executable file) an
error is returned.
If the header is well formed, the loader then looks at the e entry field to find
the address of the program’s entry point. This is fed into a FIFO of data to be
processed by the signature engine.12 It is important to hash headers, as well as
the contents of all executable pages, in order to ensure that a signed application
cannot be modified to start at a different address within the code, potentially
performing undesired actions.
The loader then checks the e phoff field to find the address of the program
header table, which stores the addresses of all segments in the program’s memory
image. It loops over the program header table and checks the p type field for
each entry. If the type is PT LOAD (meaning the segment is part of the loadable
memory image) then the loader reads the contents of the segment and feeds
them into the hashing engine and stores the virtual and physical addresses in
a buffer for future mapping. If the type is 0x70000005 (an unused value in the
processor-defined region of the ELF program header type specification) then the
segment is read into a buffer holding the expected signature. After all loadable
segments have been hashed, the signature is compared to the expected value. If
they do not match an error is returned and the allocated thread context is freed.
If the signature is valid, the list of address mappings is then fed to the
MMU. Note that the ELF loader is the only part of the processor which has
permission to set the PAGE EXECUTE permission on a memory page; permis-
sions for pages mapped by software through the OoB interface are ANDed with
PAGE READ WRITE before being applied. This means that it is impossible by design
for any unsigned code to ever execute as long as the physical memory backing
the executable cannot be modified externally (for example, by modifying the
contents of an external flash chip while the program is executing). With appro-
priate choices of access controls for on-chip memory, and use of encryption to
prevent tampering with off-chip memory, this risk can be mitigated. After the
initial memory mappings are created the program counter for the newly created
thread is set to the entry point address from the ELF header and the thread is
added to the run queue.
12
We used HMAC-SHA256 in the prototype due to FPGA capacity limitations, as
well as difficulty finding a suitable open source public key signature core. An actual
ASIC implementation would presumably use RSA or ECC signatures.
Antikernel: A Decentralized Secure Hardware-Software OS Architecture 249
6 Security Analysis
6.1 Threat Model
Antikernel’s primary goal is to enforce compartmentalization between user-
space processes, and between user-space and the operating system. The focus is
on damage control, rather than preventing initial penetration. The attacker is
assumed to be remote so physical attacks are not considered. Existing antitem-
per techniques can, of course, be used along with the Antikernel architecture to
produce a system with some degree of robustness against physical tampering;
but it is important to note that no physical security is perfect and an attacker
with unrestricted physical access to the system is likely to be able to penetrate
any security given sufficient time and budget.
Antikernel is designed to ensure that following are not possible given that
an attacker has gained unprivileged code execution within the context of a user-
space application or service: (i) download a backdoor payload and configure it
to run after system restart, (ii) modify executable code in memory or persistent
storage, intercept/spoof/modify system calls or IPC of another process, (iii) read
or write private state of another process, or (iv) gain access to handles belonging
to another process by any means.
We consider an abstract RTL-level model of the system with ideal digital
signals in which it is not possible for the state of one register or input pin to
observe or modify the state of another except if they are connected through
combinatorial logic in the RTL netlist.13
6.3 Assumptions
14
While full verification of the entire implementation is of course desirable, and a goal
we are working toward, it would require many man-years of additional effort. Addi-
tionally, several components of the design are still being optimized and improved,
making a correctnesss proof of the current code a waste of time.
15
MiniSAT by default, although different solvers can be configured at run time.
16
Since the FPGA microarchitecture is undocumented, equivalence checking on the
actual FPGA bitstream would not be possible without extensive reverse engineering
of the silicon. While an interesting problem, and one that researchers including
author 1 are actively working on [20], it is beyond the scope of this paper.
Antikernel: A Decentralized Secure Hardware-Software OS Architecture 251
6.4 Networks
All four combinations of RPC and DMA transceivers (node or router at each
end) for a layer-2 link were formally verified using yosys.17
Each test case instantiates one transmitter and one receiver of the appro-
priate types, as well as testbench code. yosys is then run on each testbench to
synthesize to RTLIL intermediate representation, followed by invoking the SAT
solver to prove the assertions in the testbenches. If the solver declares that all
assertions pass, the proof is considered to hold.
While the testbenches are all slightly different due to the differences in inter-
face between router/client transceivers and RPC/DMA network protocols, their
basic operation is the same. When the test starts, all outputs are in the idle
state and remain so in the absence of external stimuli. When a transmit is
requested, the test logic stores the signals at the transceiver’s inputs and asserts
that the same data exits the receiver a fixed time later. The test also verifies
that attempts to transmit while the receiver is busy block until the receiver is
free (thus preventing dropped packets) and that the transceiver fully resets to
its original state after sending a packet.
It is also necessary to prove that packets are correctly forwarded to the
desired layer-3 destination by routers. We can map the quadtree directly to
routing addresses by allocating two bits of the address to each level of the tree.
Each router simply checks if the high bits match its subnet, forwards out the
downstream port identified by the next two bits if so, and otherwise forwards
out the upstream port. It is easy to see by inspection that this algorithm will
always lead to a correct tree traversal.
17
It is important to note that due to the large maximum packet size (512 words) it
was not possible to run the DMA network proofs to a steady state, thus the proof
is not complete. The current proof is artificially limited to examining state for the
first 64 cycles and shows that no assertions are violated during this time. Running
the solver on each proof takes about ten minutes on a single CPU core and uses
between three and ten gigabytes of RAM; given a sufficiently large amount of CPU
time and RAM there is no reason why the proof cannot be extended until a steady
state is reached.
252 A. Zonenberg and B. Yener
Since correct routing at the hop level combined with a valid quadtree topology
implies correct routing at the network level, and the previous proofs show that
link-layer forwarding is correct, the proof for correct end-to-end forwarding thus
reduces to showing that the router correctly implements the routing algorithm,
which is shown by another of our proofs (for the RPC network only).18
is trusted implicitly (and an attacker has no way to modify these addresses short
of an invasive silicon attack). Names being registered by a random NoC node
at run time, however, are not inherently trusted. In order to prevent malicious
name registrations, the name server requires a cryptographic signature to be
presented and validated before the name can be registered.
The overall goal of this research was to determine whether moving operating
system functionality into hardware is a practical means for improving operating
system security. We define a high-level architecture, Antikernel, for an operating
system which freely mixes hardware and software components as equal peers
connected by a packet-switched network. The architecture takes the ideal of
“least required privilege” to the extreme by having each node in the network
be a fully encapsulated system which manages its own security policy, and only
allows access to its internal state through a well-defined API.
The architecture draws inspiration from numerous existing operating system
architectures, such as the microkernel (minimal privileged functionality with
most services in userspace), the exokernel (drivers as very thin wrappers around
hardware providing nothing but security and sharing), and the separation kernel
(enforcing strong isolation between processes except through a defined interface).
Additionally, the modular structure of an Antikernel system is highly
amenable to piecewise formal verification. If we define security of the entire
system as the condition where all security properties of each node are upheld,
we can then prove security by proving security of the interconnect, as well as
proving that every node’s security policy is internally consistent (in other words,
policy cannot be violated by sending arbitrary messages to the NoC interface or
any external communications interfaces).
We hope that this work will serve to inspire future research at the intersection
of computer architecture and security, and lead to more convergent full-stack
design of critical systems. Blurring the lines between hardware and software
appears to be a promising architectural model and one warranting further study.
By releasing all of our source code we hope to encourage future work building
on our design. We intend to continue actively developing the project.
While the current prototype does show that hardware-based operating sys-
tems are practical and can be highly secure, it is far from usable in real-world
applications. Many features which are necessary in a real-world operating system
254 A. Zonenberg and B. Yener
could not be implemented due to limited manpower so effort was focused on the
most critical core features such as memory and process management.
The current prototype relies on the initialization code starting all software
applications in the same order (and thus receiving the same thread ID since
these are allocated in FIFO order every boot). A more stable system for binding
processes to IDs is, of course, desirable.
As of this writing, neither of the memory controller implementations have
been formally verified. No part of the CPU (other than the NoC transceivers) has
been formally verified to date. While SARATOGA’s architecture was designed
to minimize the risk of accidental data leakage between thread contexts, until full
verification is completed we cannot rule out the possibility that such a bug exists.
Eventually we would like to verify that the CPUs themselves correctly imple-
ment the semantics of our reduced MIPS-1 instruction set. If we then compiled
our application code with a formally verified C compiler (such as CompCert C
[21,22]) we could have full equivalency proofs from C down to RTL.19 This could
then be combined with verification of the C source code, resulting in fully verified
correct execution from application software all the way down to RTL.
Finally, our prototype is intended to be a proof of concept for hardware-based
compartmentalization at the OS level. As a result, we do not incorporate any of
the numerous defensive techniques in the literature for guarding against physical
tampering, hardware faults, or software-based exploits targeting userland. Cur-
rently, implementation of many useful subsystems (such as the networking stack
and filesystem) are missing major features or entirely absent. Although many
of the core components (such as the NoC) have been formally verified, many
higher-level components and peripherals have received basic functional testing
only and the full system should be considered research-grade. Further work could
explore integrating existing software-based mitigations with Antikernel.
The prototype prioritizes ease of verification and implementation over per-
formance: for example, the SARATOGA CPU uses a simple barrel scheduler
which has poor single-threaded performance, lacks support for out-of-order exe-
cution, and has very unoptimized logic for handling L1 cache misses. Although
these factors combine to cause a significant (order of magnitude) performance
reduction compared to a legacy system running the same ISA, these are due to
implementation choices rather than any inherent limitations of the architecture.
We conjecture that a more optimized Antikernel implementation could match or
even exceed the performance of existing OS/hardware combinations due to the
streamlined, exokernel-esque design.
Additionally, although backward compatibility with existing operating sys-
tems was explicitly not a design goal, we have done a small amount of work on
a POSIX compatibility layer. This is unlikely to ever reach “recompile and run”
compatibility with legacy software due to inherent architecture differences, but
we hope that it will help minimize porting effort.
19
The current CompCert compiler does not support the MIPS instruction set - only
x86, ARM, and PowerPC. We plan to explore adding formally verified MIPS code
generation to this or another verified C compiler in the future.
Antikernel: A Decentralized Secure Hardware-Software OS Architecture 255
References
1. Zonenberg, A.D.: Antikernel: a decentralized secure hardware-software operating
system architecture. Ph.D. dissertation, Rensselaer Polytechnic Institute (2015)
2. Engler, D.R., et al.: Exokernel: an operating system architecture for application-
level resource management. SIGOPS Oper. Syst. Rev. 29(5), 251–266 (1995)
3. Rushby, J.M.: Design and verification of secure systems. In: Proceedings of the 8th
ACM Symposium on Operating Systems Principles, pp. 12–21 (1981)
4. Martin, W., White, P., Taylor, F.S., Goldberg, A.: Formal construction of the
mathematically analyzed separation kernel. In: 15th IEEE International Confer-
ence Automated Software Engineering, ASE 2000, pp. 133–141 (2000)
5. Baumann, A., et al.: The multikernel: a new OS architecture for scalable multicore
systems. In: Proceedings of the ACM SIGOPS 22nd Symposium Operating Systems
Principles, New York, NY, USA, pp. 29–44 (2009)
6. Rutkowska, J., Wojtczuk, R.: Qubes OS Architecture, January 2010. http://files.
qubes-os.org/files/doc/arch-spec-0.3.pdf
7. ARM Ltd. TrustZone Technology (2014). http://www.arm.com/products/
processors/technologies/trustzone.php. Accessed 09 Apr 2015
8. Zonenberg, A.: Antikernel source repository, 18 March 2016. http://redmine.
drawersteak.com/projects/achd-soc/repository. Accessed 18 Mar 2016
9. Engel, M., Spinczyk, O.: A radical approach to network-on-chip operatingsys-
tems. In: 42nd Hawaii International Conference on System Sciences, HICSS 2009,
pp. 1–10, January 2009
10. Nordstrom, S., et al.: Application specific real-time microkernel in hardware. In:
14th IEEE-NPSS Real Time Conference 2005, p. 4, June 2005
11. Hu, W., Ma, J., Wu, B., Ju, L., Chan, T.: Distributed on-chip operating systemfor
network on chip. In: 2010 IEEE 10th International Conference on Computer and
Information Technology (CIT), pp. 2760–2767, 1 July 2010
12. Park, S., et al.: A hardware operating system kernel for multi-processor systems.
IEICE Electron. Express 5(9), 296–302 (2008)
13. So, H.K.-H., et al.: A unified hardware/software runtime environment for FPGA-
based reconfigurable computers using BORPH. In: Proceedings of the 4th Interna-
tional Conference Hardware/Software Codesign Systems Synthesis CODES+ISSS
2006, pp. 259–264 (2006)
14. Wasicek, V., et al.: A system-on-a-chip platform for mixed-criticality applications.
In: 2010 13th IEEE International Symposium on Object/Component/Service-
Oriented Real-Time Distributed Computing (ISORC), pp. 210–216, May 2010
15. Thomas, A., et al.: Towards a Zero-Kernel Operating System, 10 January
2013. http://www.infsec.cs.uni-saarland.de/hritcu/publications/zkos draft jan10
2013.pdf. Accessed 09 Apr 2015
16. BiiN Corporation. BiiN Systems Overview, Portland, OR, July 1988. http://
bitsavers.informatik.uni-stuttgart.de/pdf/biin/BiiN Systems Overview.pdf.
Accessed 09 Apr 2015
17. Kim, Y., et al.: Flipping bits in memory without accessing them: an experimen-
tal study of DRAM disturbance errors. In: 2014 ACM/IEEE 41st International
Symposium on Computer Architecture (ISCA), pp. 361–372, June 2014
18. Evans, C.: Project Zero: Exploiting the DRAM rowhammer bug to gain ker-
nel privileges, 9 March 2015. http://googleprojectzero.blogspot.com/2015/03/
exploiting-dram-rowhammer-bug-to-gain.html. Accessed 09 Apr 2015
19. Wolf, C.: Yosys open synthesis suite. http://www.clifford.at/yosys/
256 A. Zonenberg and B. Yener
1 Introduction
Anomalous binary curves, generally referred to as Koblitz curves, are binary
elliptic curves satisfying the Weierstrass equation, Ea : y 2 + xy = x3 + ax2 + 1,
with a ∈ {0, 1}. Since their introduction in 1991 by Koblitz [21], these curves have
been extensively studied for their additional structure that allows, in principle, a
performance speedup in the computation of the elliptic curve point multiplication
operation. As of today, the research works dealing with standardized Koblitz
curves in commercial use, such as the binary curves standardized by NIST [23] or
the suite of elliptic curves supported by the TLS protocol [4,9], have exclusively
analyzed the security and performance of curves defined over binary extension
fields F2m , with m a prime number (for recent examples see [1,5,32,36]).
Nevertheless, Koblitz curves defined over F4 were also proposed in [21]. We
find interesting to explore the cryptographic usage of Koblitz curves defined over
F4 due to their inherent usage of quadratic field arithmetic. Indeed, it has been
recently shown [3,25] that quadratic field arithmetic is extraordinarily efficient
when implemented in software. This is because one can take full advantage of the
Single Instruction Multiple Data (SIMD) paradigm, where a vector instruction
performs simultaneously the same operation on a set of input data items.
Quadratic extensions of a binary finite field Fq2 can be defined by means
of a monic polynomial h(u) of degree two irreducible over Fq . The field Fq2 is
isomorphic to Fq [u]/(h(u)) and its elements can be represented as a0 + a1 u,
with a0 , a1 ∈ Fq . The addition of two elements a, b ∈ Fq2 , can be performed as
c = (a0 + b0 ) + (a1 + b1 )u. By choosing h(u) = u2 + u + 1, the multiplication of
a, b can be computed as, d = a0 b0 + a1 b1 + ((a0 + a1 ) · (b0 + b1 ) + a0 b0 )u. By
carefully organizing the code associated to these arithmetic operations, one can
greatly exploit the pipelines and their inherent instruction-level parallelism that
are available in contemporary high-end processors.
Our Contributions. In this work we designed for the first time, a 128-bit secure
and timing attack resistant scalar multiplication on a Koblitz curve defined over
F4 , as they were proposed by Koblitz in his 1991 seminal paper [21]. We devel-
oped all the required algorithms for performing such a computation. This took us
to reconsider the strategy of using redundant trinomials (also known as almost
irreducible trinomials), which were proposed more than ten years ago in [6,10].
We also report what is perhaps the most comprehensive analysis yet reported
of how to efficiently implement arithmetic operations in binary finite fields and
their quadratic extensions using the vectorized instructions available in high-end
microprocessors. For example, to the best of our knowledge, we report for the
first time a 128-bit AVX implementation of the linear pass technique, which is
useful against side-channel attacks.
The remaining of this paper is organized as follows. In Sect. 2 we formally
introduce the family of Koblitz elliptic curves defined over F4 . In Sects. 3 and 4
a detailed description of the efficient implementation of the base and quadratic
field arithmetic using vectorized instructions is given. We present in Sect. 5 the
scalar multiplication algorithms used in this work, and we present in Sect. 6 the
analysis and discussion of the results obtained by our software library. Finally,
we draw our concluding remarks and future work in Sect. 7.
Ea : y 2 + xy = x3 + aγx2 + γ, (1)
where γ ∈ F22 satisfies γ 2 = γ + 1 and a ∈ {0, 1}. Note that the number of
points in the curves E0 (F4 ) and E1 (F4 ) are, #E0 (F4 ) = 4 and #E1 (F4 ) = 6,
respectively. For cryptographic purposes, one uses Eq. (1) operating over exten-
sion fields of the form Fq , with q = 4m , and m a prime number. The set of affine
Software Implementation of Koblitz Curves over Quadratic Fields 261
points P = (x, y) ∈ Fq × Fq that satisfy Eq. (1) together with a point at infinity
represented as O, forms an abelian group denoted by Ea (F4m ), where its group
law is defined by the point addition operation.
Since for each proper divisor l of k, E(F4l ) is a subgroup of E(F4k ), one has
that #E(F4l ) divides #E(F4k ). Furthermore, by choosing prime extensions m,
it is possible to find Ea (F4m ) with almost-prime order, for instance, E0 (F22·163 )
and E1 (F22·167 ). In the remaining of this paper, we will show that the aforemen-
tioned strategy can be used for the efficient implementation of a 128-bit secure
scalar multiplication on software platforms counting with 64-bit carry-less native
multipliers, such as the ones available in contemporary personal desktops.
The Frobenius map τ : Ea (Fq ) → Ea (Fq ) defined by τ (O) = O, τ (x, y) =
(x4 , y 4 ), is a curve automorphism satisfying (τ 2 + 4)P = μτ (P ) for μ = (−1)a
and all P ∈ Ea (Fq ). By solving the equation √ τ 2 + 4 = μτ , the Frobenius map
can be seen as the complex number τ = (μ ± −15)/2.
Given a Koblitz curve Ea /F22m with group order #Ea (F22m ) = h · ρ · r, where
h is the order #Ea (F4 ), r is the prime order of our subgroup of interest, and ρ
is the order of a group of no cryptographic interest.1 We can express a scalar
k ∈ Zr as an element in Z[τ ] using the now classical partial reduction introduced
by Solinas [31], with a few modifications. The modified version is based on the
fact that τ 2 = μτ − 4.
Given that the norm of τ is N (τ ) = 4, N (τ − 1) = h, N (τ m − 1) = h · ρ · r and
N ((τ m − 1)/(τ − 1)) = ρ · r, the subscalars r0 and r1 resulting from the partial
√
modulo function will be both of size approximately ρ · r. As a consequence,
the corresponding scalar multiplication will need more iterations than expected,
since it will consider the order ρ of a subgroup which is not of cryptographic
interest.
For that reason, we took the design decision of considering that the input
scalar of our point multiplication algorithm is already given in the Z[τ ] domain.
As a result, a partial reduction of the scalar k is no longer required, and the
number of iterations in the point multiplication will be consistent with the scalar
k size. If one needs to retrieve the equivalent value of the scalar k in the ring
Zr , this can be easily computed with one multiplication and one addition in Zr .
This strategy is in line with the degree-2 scalar decomposition method within
the GLS curves context as suggested in [12].
1
Usually the order ρ is composite. Also, every prime factor of ρ is smaller than r.
262 T. Oliveira et al.
Output: ρ = vi τ i(w−1)
i=0
1: for i ← 0 to m+2 - 1 do 14: if r0 = 0 and r1 = 1 then
w−1
2: if w = 2 then 15: vi ← r0 + r1 τ
3: vi ← ((r0 − 4 · r1 ) mod 8) − 4 16: else
4: r0 ← r0 − vi 17: if r1 = 0 then
5: else 18: vi ← r1
6: u ← (r0 + r1 tw mod 22w−1 ) − 22(w−1) 19: else
7: if v > 0 then s ← 1 else s ← −1 20: vi ← r0
8: r0 ← r0 − sβv , r1 ← r1 − sγv , vi ← sαv 21: end if
9: end if 22: end if
10: for j ← 0 to (w − 2) do
11: t ← r0 , r0 ← r1 + (μ · r0 )/4, r1 ← −t/4
12: end for
13: end for
mixed additions and four applications of the Frobenius map for the w = 3 case
and one point doubling, twenty full additions, eleven mixed additions and five
applications of the Frobenius map for the w = 4 case.
5
The symbols , stand for bitwise shift of packed 64-bit integers.
266 T. Oliveira et al.
The overall cost of the modular reduction is ten xors and five bitwise shifts.
At the end of the scalar multiplication, we have to reduce the 192-bit polynomial
to an element of the field F2149 . Note that the trinomial g(x) = x192 + x19 + 1
factorizes into a 69-term irreducible polynomial f (x) of degree 149.
The final reduction is performed via the mul-and-add reduction which, exper-
imentally, performed more efficiently than the shift-and-add reduction.6 Con-
cisely, the mul-and-add technique consists in a series of steps which includes
shift operations (in order to align the bits in the registers), carry-less multipli-
cations and xor operations for eliminating the extra bits.
The basic mul-and-add step is described in Algorithm 3. Here, besides the
usual notation, we represent the 64-bit carry-less multiplication by the symbol
×ij , where i, j = {L, H}, with L and H representing the lowest and highest
64-bit word packed in a 128-bit register, respectively. For example, if one wants
to multiply the 128-bit register A lowest 64-bit word by the 128-bit register B
highest 64-bit word, we would express this operation as T ← A ×LH B.
Algorithm 3 requires four xors, three bitwise shifts and three carry-less mul-
tiplications. In our particular case, the difference between the degrees of the
two most significant monomials of f (x) is three. Also, note that we need to
3 = 15 applications of the
reduce 43 bits (191–148). As a result, it is required 43
Algorithm 3 in order to conclude this reduction.
6
For a more detailed explanation of the shift-and-add and the mul-and-add reduction
methods to binary fields, see [5].
Software Implementation of Koblitz Curves over Quadratic Fields 267
Using this setting, there still exists some overhead in the multiplication and
squaring arithmetic operations, even though the penalty on the latter operation
is almost negligible. In the positive side, the terms of the elements a0 , a1 do not
need to be rearranged and the modular reduction of these two base field elements
can be performed in parallel, as discussed next.
4.2 Multiplication
Given two F22·149 elements a = (a0 + a1 u) and b = (b0 + b1 u), with a0 , a1 , b0 , b1
in F2149 , we perform the multiplication c = a · b as,
c = a · b = (a0 + a1 u) · (b0 + b1 u)
= (a0 b0 ⊕ a1 b1 ) + (a0 b0 ⊕ (a0 ⊕ a1 ) · (b0 ⊕ b1 ))u,
where each element ai , bi ∈ F2149 is composed by three 64-bit words. The analysis
of the Karatsuba algorithm cost for different word sizes was presented in [35].
There, it was shown that the most efficient way to multiply three 64-bit word
polynomials s(x) = s2 x2 +s1 x+s0 and t(x) = t2 x2 +t1 x+t0 as v(x) = s(x)·t(x)
is through the one-level Karatsuba method,
268 T. Oliveira et al.
V0 = s0 · t0 , V1 = s1 · t1 , V2 = s2 · t2 ,
V0,1 = (s0 ⊕ s1 ) · (t0 ⊕ t1 ), V0,2 = (s0 ⊕ s2 ) · (t0 ⊕ t2 ) V1,2 = (s1 ⊕ s2 ) · (t1 ⊕ t2 ),
v(x) = V2 ·x +(V1,2 ⊕V1 ⊕V2 )·x +(V0,2 ⊕V0 ⊕V1 ⊕V2 )·x2 +(V0,1 ⊕V0 ⊕V1 )·x+V0 ,
4 3
which costs six multiplications and twelve additions. The Karatsuba algorithm
as used in this work is presented in Algorithm 4.7
Algorithm 4 requires six carry-less instructions, six vectorized xors and three
bitwise shift instructions. In order to calculate the total multiplication cost, it
is necessary to include the Karatsuba pre-computation operations at the base
field level (twelve vectorized xors and six byte interleaving instructions) and
at the quadratic field level (six vectorized xors). Also, we must consider the
reorganization of the registers in order to proceed with the modular reduction
(six vectorized xors).
4.4 Squaring
Squaring is a very important function in the Koblitz curve point multiplication
algorithm, since it is the building block for computing the τ endomorphism.
In our implementation, we computed the squaring operation through carry-less
multiplication instructions which, experimentally, was an approach less expen-
sive than the bit interleaving method (see [15, Sect. 2.3.4]). The pre-processing
phase is straightforward, we just need to rearrange the 32-bit packed words of the
128-bit registers in order to prepare them for the subsequent modular reduction.
The pre- and post-processing phases require three shuffle instructions, three
vectorized xors and three bitwise shifts. The complete function is described in
Algorithm 6. Given 128-bit registers Ri , we depict the SSE 32-bit shuffle opera-
tion as R0 ← R1 xxxx. For instance, if we compute R0 ← R1 3210, it just
maintains the 32-bit word order of the register R1 , in other words, it just copies
R1 to R0 . The operation R0 ← R1 2103 rotates the register R1 to the left by
32-bits. See [17,18] for more details.
4.5 Inversion
The inversion operation is computed via the Itoh-Tsujii method [19]. Given an
m−1
element c ∈ F2m , we compute c−1 = c(2 −1)·2
through an addition chain,
2 −1 2j
i j
which in each step computes the terms (c ) · c2 −1 with 0 ≤ j ≤ i ≤ m − 1.
For the case m = 149, the following chain is used,
1 → 2 → 4 → 8 → 16 → 32 → 33 → 66 → 74 → 148.
This addition chain is optimal and was found through the procedure described
in [7]. Note that although we compute the inversion operation over polynomials
in F2 [x] (reduced modulo g(x) = x192 + x19 + 1), we still have to perform the
addition chain with m = 149, since we are in fact interested in the embedded
F2149 field element.
270 T. Oliveira et al.
The downside of this algorithm is that the accumulators carry sensitive infor-
mation about the digits of the scalar. Also, the accumulators are read and
written. As a result, it is necessary to apply the linear pass algorithm to the
accumulators Qi twice per iteration.
6.1 Parameters
r = 0x637845F7F8BFAB325B85412FB54061F148B7F6E79AE11CC843ADE1470F7E4E29,
In Table 2, we present the timings for the base and the quadratic field arithmetic.
The multisquaring operation is used to support the Itoh-Tsujii addition chain,
therefore, it is implemented only in the field F2149 (actually, in a 192-bit poly-
nomial in F2 [x]). In addition, we gave timings to reduce a 192-bit polynomial
element in F2 [x] modulo f (x). Finally, all timings of operations in the quadratic
field include the subsequent modular reduction.
Applying the techniques presented in [27], we saw that our machine has a margin
of error of four cycles. This range is not of significance when considering the tim-
ings of the point arithmetic or the scalar multiplication. Nevertheless, for inexpen-
sive functions such as multiplication and squaring, it is recommended to consider
it when comparing the timings between different compilers.
274 T. Oliveira et al.
Table 2. Timings (in clock cycles) for the finite field operations in F22·149 using different
compiler families
Table 3. The ratio between the arithmetic and multiplication in F2149 . The timings
were taken from the code compiled with the clang 3.8 compiler
Table 4. Timings (in clock cycles) for point addition over a Koblitz curve E1 /q 2 using
different compiler families
Table 5. The ratio between the timings of point addition and the field multiplication.
The timings were taken from the code compiled with the clang 3.8 compiler
Here the timings for the left-to-right regular w-τ NAF τ -and-add scalar multipli-
cation, with w = 2, 3, 4 are reported. The setting w = 2 is presented in order to
analyze how the balance between the pre-computation and the main iteration
costs works in practice. Our main result lies in the setting w = 3. Also, among
the scalar multiplication timings, we show, in Table 6, the costs of the regular
recoding and the linear pass functions.
Table 6. A comparison of the scalar multiplication and its support functions timings
(in clock cycles) between different compiler families
6.4 Comparisons
In Table 7, we compare our implementation with the state-of-the-art works. Our
3-τ NAF left-to-right τ -and-add point multiplication outperformed by 29.64%
the work in [24], which is considered the fastest protected 128-bit secure Koblitz
implementation. When compared with prime curves, our work is surpassed by
15.29 % and 13.06 % by the works in [8] and [2], respectively.
Table 7. Scalar multiplication timings (in clock cycles) on 128-bit secure ellitpic curves
7 Conclusion
We have presented a comprehensive study of how to implement efficiently Koblitz
elliptic curves defined over quaternary fields F4m , using vectorized instructions
on the Intel micro-architectures codename Haswell and Skylake.
Software Implementation of Koblitz Curves over Quadratic Fields 277
References
1. Aranha, D.F., Faz-Hernández, A., López, J., Rodrı́guez-Henrı́quez, F.: Faster
implementation of scalar multiplication on Koblitz curves. In: Hevia, A., Neven,
G. (eds.) LatinCrypt 2012. LNCS, vol. 7533, pp. 177–193. Springer, Heidelberg
(2012)
2. Bernstein, D.J., Chuengsatiansup, C., Lange, T., Schwabe, P.: Kummer strikes
back: new DH speed records. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014.
LNCS, vol. 8873, pp. 317–337. Springer, Heidelberg (2014)
3. Birkner, P., Longa, P., Sica, F.: Four-dimensional Gallant-Lambert-Vanstone scalar
multiplication. Cryptology ePrint Archive, Report 2011/608 (2011). http://eprint.
iacr.org/
4. Blake-Wilson, S., Bolyard, N., Gupta, V., Hawk, C., Moeller, B.: Elliptic Curve
Cryptography (ECC) cipher suites for Transport Layer Security (TLS). RFC
4492. Internet Engineering Task Force (IETF) (2006). https://tools.ietf.org/html/
rfc4492
5. Bluhm, M., Gueron, S.: Fast software implementation of binary elliptic curve cryp-
tography. J. Cryptogr. Eng. 5(3), 215–226 (2015)
6. Brent, R.P., Zimmermann, P.: Algorithms for finding almost irreducible and almost
primitive trinomials. In: Primes and Misdemeanours: Lectures in Honour of the
Sixtieth Birthday of Hugh Cowie Williams, Fields Institute, p. 212 (2003)
7. Clift, N.M.: Calculating optimal addition chains. Computing 91(3), 265–284 (2011)
8. Costello, C., Longa, P.: FourQ: four-dimensional decompositions on a Q-curve over
the mersenne prime. In: Iwata, T., et al. (eds.) ASIACRYPT 2015. LNCS, vol. 9452,
pp. 214–235. Springer, Heidelberg (2015). doi:10.1007/978-3-662-48797-6 10
9. Dierks, T., Rescorla, E.: The Transport Layer Security (TLS) Protocol version 1.2.
RFC 5246. Internet Engineering Task Force (IETF) (2008). https://tools.ietf.org/
html/rfc5246
10. Doche, C.: Redundant trinomials for finite fields of characteristic 2. In: Boyd, C.,
González Nieto, J.M. (eds.) ACISP 2005. LNCS, vol. 3574, pp. 122–133. Springer,
Heidelberg (2005)
11. Galbraith, S.D., Gaudry, P.: Recent progress on the elliptic curve discrete logarithm
problem. Des. Codes Cryptogr. 78(1), 51–72 (2016)
12. Galbraith, S.D., Lin, X., Scott, M.: Endomorphisms for faster elliptic curve cryp-
tography on a large class of curves. In: Joux, A. (ed.) EUROCRYPT 2009. LNCS,
vol. 5479, pp. 518–535. Springer, Heidelberg (2009)
13. Gaudry, P., Hess, F., Smart, N.P.: Constructive and destructive facets of Weil
descent on elliptic curves. J. Cryptol. 15, 19–46 (2002)
14. Hankerson, D., Karabina, K., Menezes, A.: Analyzing the Galbraith-Lin-Scott
point multiplication method for elliptic curves over binary fields. IEEE Trans.
Comput. 58(10), 1411–1420 (2009)
15. Hankerson, D., Menezes, A.J., Vanstone, S.: Guide to Elliptic Curve Cryptography.
Springer, Secaucus (2003)
278 T. Oliveira et al.
16. Hess, F.: Generalising the GHS attack on the elliptic curve discrete logarithm
problem. LMS J. Comput. Math. 7, 167–192 (2004)
17. Intel Corporation: Intel Intrinsics Guide. https://software.intel.com/sites/
landingpage/IntrinsicsGuide/. Accessed 18 Feb 2016
18. Intel Corporation: Intel 64 and IA-32 Architectures Software Developers Manual
325462–056US (2015)
19. Itoh, T., Tsujii, S.: A fast algorithm for computing multiplicative inverses in
GF(2m ) using normal bases. Inf. Comput. 78(3), 171–177 (1988)
20. Knudsen, E.W.: Elliptic scalar multiplication using point halving. In: Lam, K.-Y.,
Okamoto, E., Xing, C. (eds.) ASIACRYPT 1999. LNCS, vol. 1716, pp. 135–149.
Springer, Heidelberg (1999)
21. Koblitz, N.: CM-curves with good cryptographic properties. In: Feigenbaum, J.
(ed.) CRYPTO 1991. LNCS, vol. 576, pp. 279–287. Springer, Heidelberg (1992)
22. Menezes, A., Qu, M.: Analysis of the Weil descent attack of Gaudry, Hess and
smart. In: Naccache, D. (ed.) CT-RSA 2001. LNCS, vol. 2020, pp. 308–318.
Springer, Heidelberg (2001)
23. National Institute of Standards and Technology: Recommended elliptic curves for
federal government use. NIST Special Publication (1999). http://csrc.nist.gov/
csrc/fedstandards.html
24. Oliveira, T., Aranha, D.F., López, J., Rodrı́guez-Henrı́quez, F.: Fast point multi-
plication algorithms for binary elliptic curves with and without precomputation.
In: Joux, A., Youssef, A. (eds.) SAC 2014. LNCS, vol. 8781, pp. 324–344. Springer,
Heidelberg (2014)
25. Oliveira, T., López, J., Aranha, D.F., Rodrı́guez-Henrı́quez, F.: Two is the fastest
prime: lambda coordinates for binary elliptic curves. J. Cryptogr. Eng. 4(1), 3–17
(2014)
26. Page, D.: Theoretical use of cache memory as a cryptanalytic side-channel. Cryp-
tology ePrint Archive, Report 2002/169 (2002). http://eprint.iacr.org/
27. Paoloni, G.: How to benchmark code execution times on intel IA-32 and IA-64
instruction set architectures. Technical report, Intel Corporation (2010)
28. Schroeppel, R.: Cryptographic elliptic curve apparatus and method (2000). US
patent 2002/6490352 B1
29. Scott, M.: Optimal irreducible polynomials for GF (2m ) arithmetic. Cryptology
ePrint Archive, Report 2007/192 (2007). http://eprint.iacr.org/
30. Solinas, J.A.: An improved algorithm for arithmetic on a family of elliptic curves.
In: Kaliski Jr., B.S. (ed.) CRYPTO 1997. LNCS, vol. 1294, pp. 357–371. Springer,
Heidelberg (1997)
31. Solinas, J.A.: Efficient arithmetic on Koblitz curves. Des. Codes Cryptogr. 19(2–3),
195–249 (2000)
32. Taverne, J., Faz-Hernández, A., Aranha, D.F., Rodrı́guez-Henrı́quez, F., Hanker-
son, D., López, J.: Software implementation of binary elliptic curves: impact of
the carry-less multiplier on scalar multiplication. In: Preneel, B., Takagi, T. (eds.)
CHES 2011. LNCS, vol. 6917, pp. 108–123. Springer, Heidelberg (2011)
33. Trost, W.R., Xu, G.: On the optimal pre-computation of window τ -NAF for Koblitz
curves. Cryptology ePrint Archive, Report 2014/664 (2014). http://eprint.iacr.org/
34. Tsunoo, Y., Tsujihara, E., Minematsu, K., Miyauchi, H.: Cryptanalysis of block
ciphers implemented on computers with cache. In: International Symposium on
Information Theory and Its Applications, pp. 803–806. IEEE Information Theory
Society (2002)
Software Implementation of Koblitz Curves over Quadratic Fields 279
35. Weimerskirch, A., Paar, C.: Generalizations of the Karatsuba algorithm for efficient
implementations. Cryptology ePrint Archive, Report 2006/224 (2006). http://
eprint.iacr.org/
36. Wenger, E., Wolfger, P.: Solving the discrete logarithm of a 113-bit Koblitz curve
with an FPGA cluster. In: Joux, A., Youssef, A. (eds.) SAC 2014. LNCS, vol. 8781,
pp. 363–379. Springer, Heidelberg (2014)
QcBits: Constant-Time Small-Key
Code-Based Cryptography
Tung Chou(B)
1 Introduction
In 2012, Misoczki et al. proposed to use QC-MDPC codes for code-based cryp-
tography [3]. The main benefit of using QC-MDPC codes is that they allow
small key sizes, as opposed to using binary Goppa codes as proposed in the orig-
inal McEliece paper [1]. Since then, implementation papers for various platforms
have been published; see [4,5] (for FPGA and AVR), [7,9] (for Cortex-M4), and
[11] (for Haswell, includes results from [4,5,7]).
One problem of QC-MDPC codes is that the most widely used decoding
algorithm, when implemented naively, leaks information about secrets through
timing. Even though decoding is only used for decryption, the same problem can
also arise if the key-generation and encryption are not constant-time. Unfortu-
nately, the only software implementation paper that addresses the timing-attack
issue is [7]. [7] offers constant-time encryption and decryption on a platform
without caches (for writable-memory).
This paper presents QcBits (pronounced “quick-bits”), a fully constant-time
implementation of a QC-MDPC-code-based encryption scheme. QcBits provides
This work was supported by the Netherlands Organisation for Scientic Research
(NWO) under grant 639.073.005 and by the Commission of the European Commu-
nities through the Horizon 2020 program under project number 645622 PQCRYPTO.
Permanent ID of this document: 172b0e150c3b6be91b0bdaa0870c1e7d. Date:
2016.03.13.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 280–300, 2016.
DOI: 10.1007/978-3-662-53140-2 14
QcBits: Constant-Time Small-Key Code-Based Cryptography 281
Table 1. Performance results for QcBits, [7, 9], and the vectorized implementation
in [11]. The “key-pair” column shows cycle counts for generating a key pair. The
“encrypt” column shows cycle counts for encryption. The “decrypt” column shows
cycle counts for decryption. For performance numbers of Qcbits, 59-byte plaintexts
are used to follow the eBACS [16] convention. For [9] 32-byte plaintexts are used.
Cycle counts labeled with * mean that the implementation for the operation is not
constant-time on the platform, which means that the worst-case performance can be
much worse (especially for decryption). Note that all the results are for 280 security
(r = 4801, w = 90, t = 84; see Sect. 2.1).
the reference implementation, which can be run on all reasonable 64/32-bit plat-
forms. The implementation “clmul” is a specialized implementation that relies
on the PCLMULQDQ instruction, i.e., the 64 × 64 → 128-bit carry-less multiplica-
tion instruction. The implementation “no-cache” is similar to ref except that
it does not provide full protection against cache-timing attacks. Both “ref”
and ”clmul” are constant-time, even on platforms with caches. “no-cache” is
constant-time only on platforms that do not have cache for writable memory.
Regarding previous works, both the implementations in [11] for Haswell and [9]
for Cortex-M4 are not constant-time. [7] seems to provide constant-time encryp-
tion and decryption, even though the paper argues about resistance against
simple-power analysis instead of being constant-time.
On the Haswell microarchitecture, QcBits is about twice as fast as [11] for
decryption and an order of magnitude faster for key-pair generation, even though
the implementation of [11] is not constant-time. QcBits takes much more cycles
on encryption. This is mainly because QcBits uses a slow source of randomness;
see Sect. 3.1 for more discussions. A minor reason is that KEM/DEM encryp-
tion requires intrinsically some more operations than McEliece encryption, e.g.,
hashing.
For tests on Cortex-M4, STM32F407 is used for QcBits and [7], while [9]
uses STM32F417. Note that there is no cache for writable memory (SRAM)
on these devices. QcBits is faster than [9] for encryption and decryption. The
difference is even bigger when compared to [7]. The STM32F407/417 product
lines provide from 512 kilobytes to 1 megabyte of flash. [9] reports a flash usage
of 16 kilobytes, while the implementation no-cache uses 62 kilobytes of flash
when the symmetric primitives are included and 38 kilobytes without symmetric
primitives. See Sect. 2.3 for more discussions on the symmetric primitives.
It is important to note that, since the decoding algorithm is probabilistic,
each implementation of decryption comes with a failure rate. For QcBits no
decryption failure occurred in 108 trials. I have not found “thresholds” for the
decoding algorithm that achieves the same level of failure rate at a 2128 secu-
rity level, which is why QcBits uses a 280 -security parameter set. For [11], no
decryption failure occurred in 107 trials. For [9] the failure rate is not indicated,
but the decoder seems to be the same as [11]. It is unclear what level of failure
rate [7] achieves. See Sect. 7 for more discussions about failure rates.
Table 2 shows performance results for 128-bit security. Using thresholds
derived from the formulas in [3, Section A] leads to a failure rate of 6.9·10−3 using
12 decoding iterations. Experiments show that there are some sets of thresholds
that achieve a failure rate around 10−5 using 19 decoding iterations, but this is
still far from 10−8 ; see Sect. 6 for the thresholds. Note that [9,11] did not specify
the failure rates they achieved for 128-bit security, and [7] does not have imple-
mentation for 128-bit security. It is reported in [3] that no decryption failure
occurred in 107 trials for all the parameter sets presented in the paper (includ-
ing the ones used for Tables 1 and 2), but they did not provide details such as
how many decoding iterations are required to achieve this.
QcBits: Constant-Time Small-Key Code-Based Cryptography 283
Table 2. Performance results for QcBits, [9], and the vectorized implementation in [11]
for 128-bit security (r = 9857, w = 142, t = 134; see Sect. 2.1). The cycle counts
for QcBits decryption are underlined to indicate that these are cycle counts for one
decoding iteration. Experiments show that QcBits can achieve a failure rate around
10−5 using 19 decoding iterations (see Sect. 6).
2 Preliminaries
This section presents preliminaries for the following sections. Section 2.1 gives
a brief review on the definition of QC-MDPC codes. Section 2.2 describes the
“bit-flipping” algorithm for decoding QC-MDPC codes. Section 2.3 gives a spec-
ification of the KEM/DEM encryption scheme implemented by QcBits.
even though the original paper allows a row permutation on H. Note that being
quasi-cyclic implies that H has a fixed row weight w. The following is a small
parity-check matrix with r = 5, w = 4:
⎛ ⎞
10100 01001
⎜0 1 0 1 0 1 0 1 0 0⎟
⎜ ⎟
⎜0 0 1 0 1 0 1 0 1 0⎟ .
⎜ ⎟
⎝1 0 0 1 0 0 0 1 0 1⎠
01001 10010
As opposed to many other linear codes that allow efficient deterministic decod-
ing, the most popular decoder for (QC-)MDPC code, the “bit-flipping” algo-
rithm, is a probabilistic one. The bit-flipping algorithm shares the same idea
QcBits: Constant-Time Small-Key Code-Based Cryptography 285
• Flip all positions that violate at least max({ui }) − δ parity checks, where δ
is a small integer, say 5.
• Flip all positions that violate at least Ti parity checks, where Ti is a precom-
puted threshold for iteration i.
In previous works several variants have been invented. For example, one variant
based on the first approach simply restarts decoding with a new δ if decoding
fails in 10 iterations.
QcBits uses precomputed thresholds. The number of decoding iterations is
set to be 6, and the thresholds are
32-byte authentication key are then generated as the first and second half
of the 64-byte hash value of the byte stream. The plaintext m is encrypted
and authenticated using the symmetric keys. The ciphertext for the whole
KEM/DEM scheme is then the concatenation of the public syndrome, the cipher-
text under symmetric encryption, and the tag. In total the ciphertext takes
(r + 7)/8 + |m| + 16 bytes.
When receiving an input stream, the decryption process parses it as the
concatenation of a public syndrome, a ciphertext under symmetric encryption,
and a tag. Then an error vector e is computed by feeding the public syndrome
into the decoding algorithm. If P e = s, decoding is successful. Otherwise, a
decoding failure occurs. The symmetric keys are then generated by hashing e
to perform symmetric decryption and verification. QcBits reports a decryption
failure if and only if the verification fails or the decoding fails.
3 Key-Pair Generation
This section shows how QcBits performs key-pair generation using multipli-
cations in F2 [x]/(xr − 1). Section 3.1 shows how the private key is gener-
ated. Section 3.2 shows how key-pair generation is viewed as multiplications in
F2 [x]/(xr − 1). Section 3.3 shows how multiplications in F2 [x]/(xr − 1) are imple-
mented. Section 3.4 shows how squarings in F2 [x]/(xr − 1) are implemented.
The Gaussian elimination induces a linear combination of the rows of H (0) that
(0)
results in P0,: . In other words, there exists a set I of indices such that
xi H0,: (x) = ( xi )H0,: (x),
(0) (0)
1=
i∈I i∈I
i
xi )H0,: (x).
(1) (1) (1)
P0,: (x) = x H0,: (x) =(
i∈I i∈I
(0)
In other words, the public key can be generated by finding the inverse of H0,: (x)
in F2 [x]/(xr − 1) and then multiplying the inverse by H0,: (x). The previous
(1)
Since
pi ∗
F2 [x]/ f (i) (x) = 2deg(f (i) )·pi · (2deg(f (i) ) − 1)/2deg(f (i) )
(i) (i)
)·(pi −1)
= 2deg(f )·pi
− 2deg(f ,
This takes 10 doublings and 5 additions. Using the same approach, it is easy to
find an addition chain for 2109 − 1 that takes 108 doublings and 10 additions.
QcBits then combines the addition chains for 211 − 1 and 2109 − 1 to form an
addition chain for 211·109 −1 = 21199 −1, which takes 10·109+108 = 1198 doubling
and 5+10 = 15 additions. Once the (21199 −1)-th power is computed, the (21200 −
2)-th power can be computed using one squaring. In total, computation of the
(21200 −2)-th power takes 1199 squarings and 15 multiplications in F2 [x]/(x4801 −
1).
(1)
Finally, with the inverse, P0,: (x) can be computed using one multiplication.
(1) (1)
The public key is defined to be a representation of P:,0 instead of P0,: . Qcbits
(1) (1)
thus derives P0,: from P:,0 by noticing
(1)
(1) P0,r−j if j > 0
P0,j = (1)
P0,0 if j = 0.
(1) (1)
Note that the conversion from P:,0 to P0,: does not need to be constant-time
because it can be easily reversed from public data.
The user can choose b to be 32 or 64, but for the best performance b should
be chosen according to the machine architecture. Let y = xb . One can view
this representation as storing each coefficient of the radix-y representation of
f using one b-bit integer. In this paper this representation is called the “dense
representation”.
Using the representation, we can compute the coefficients (each being a 2b-
bit value) of the radix-y representation of h, using carry-less multiplications
on the b-bit words of f and g. Once the 2b-bit values are obtained, the dense
representation of h can be computed with a bit of post-processing. To be precise,
290 T. Chou
given two b-bit numbers (αb−1 αb−2 · · · α0 )2 and (βb−1 βb−2 · · · β0 )2 , a carry-less
multiplication computes the 2b-bit value (having actually only 2b − 1 bits)
⎛ ⎞
⎝ αi βj αi βj · · · αi βj ⎠ .
i+j=2b−2 i+j=2b−3 i+j=0
2
In other words, the input values are considered as elements in F2 [x], and the
output is the product in F2 [x].
The implementations clmul uses the PCLMULQDQ instruction to perform carry-
less multiplications between two 64-bit values. For the implementation ref and
no-cache, the following C code is used to compute the higher and lower b bits
of the 2b-bit value:
low = x * ((y >> 0) & 1);
v1 = x * ((y >> 1) & 1);
low ^= v1 << 1;
high = v1 >> (b-1);
for (i = 2; i < b; i+=2)
{
v0 = x * ((y >> i) & 1);
v1 = x * ((y >> (i+1)) & 1);
low ^= v0 << i;
low ^= v1 << (i+1);
high ^= v0 >> (b-i);
high ^= v1 >> (b-(i+1));
}
4 KEM Encryption
This section shows how QcBits performs the KEM encryption using multipli-
cations in F2 [x]/(xr − 1). Section 4.1 shows how the error vector is generated.
Section 4.2 shows how public-syndrome computation is viewed as multiplications
in F2 [x]/(xr − 1). Section 4.3 shows how these multiplications are implemented.
The error vector e is generated in essentially the same way as the private key.
The only difference is that for e we need t indices ranging from 0 to n − 1, and
there is only one list of indices instead of two. Note that for hashing it is still
required to generate the dense representation of e.
The task here is to compute the public syndrome P e. Let e(0) and e(1) be the
first and second half of e. The public syndrome is then
In other words, the private syndrome can be computed using one multiplication
in F2 [x]/(xr − 1). The multiplication is not generic in the sense that e(1) (x) is
sparse. See below for how the multiplication is implemented in QcBits.
The task here can be formalized as computing f (0) + f (1) g (1) ∈ F2 [x]/(xr − 1),
where g (1) is represented in the dense representation. f (0) and f (1) are repre-
(0) (1)
sented together using an array of indices in I = {i | fi = 1} ∪ {i + r | fi = 1},
where |I| = t.
One can of course perform this multiplication between f (1) and g (1) in a
generic way, as shown in Sect. 3.3. The implementation clmul indeed generates
the dense representation of f (1) and then computes f (1) g (1) using the PCLMULQDQ
instruction. [11] uses essentially the same technique. The implementations ref
292 T. Chou
and no-cache however, make use of the sparsity in f (0) and f (1) ; see below for
details.
Now consider the slightly simpler problem of computing h = f g ∈ F2 [x]/(xr −
1), where f is represented as an array of indices in I = {i | fi = 1}, and g is in
the dense representation. Then we have
fg = xi g.
i∈I
Therefore, the implementations ref and no-cache first set h = 0. Then, for each
i ∈ I, xi g is computed and then added to h. Note that xi g is represented as an
array of r/b b-bit words, so adding xi g to h can be implemented using r/b
bitwise-XOR instructions on b-bit words.
Now the remaining problem is how to compute xi g. It is obvious that xi g
can be obtained by rotating g by i bits. In order to perform a constant-time
rotation, the implementation ref makes use of the idea of the Barrel shifter [27].
The idea is to first represent i in binary representation
(ik−1 ik−2 · · · i0 )2 .
(x8 + x10 + x12 + x14 ) + (x16 + x17 + x20 + x21 ) + (x24 + x25 + x26 + x27 )
+ (x36 + x37 + x38 + x39 ),
Since the most significant bit is not set, the unshifted polynomial is chosen. Next
we proceed to perform a rotation of 0100002 = 16 bits. The result is
Since the second most significant bit is set, we choose the rotated polynomial.
The polynomial is then shifted by 0010002 = 8 bits. However, since the third
QcBits: Constant-Time Small-Key Code-Based Cryptography 293
most significant bit is not set, the unshifted polynomial is chosen. To handle the
least significant lg b = 3 bits of i, a sequence of logical instructions are used to
combine the most significant 0112 and the least significant 1012 bits of the bytes,
resulting in
011000012 , 111111102 , 000000002 , 000010102 , 101001102 .
Note that in [3] r is required to be a prime (which means r is not divis-
ible by b), so the example is showing an easier case. Roughly speaking, the
implementation ref performs a rotation as if the vector length is r − (r mod b)
and then uses more instructions to compensate for the effect of the r mod b
extra bits. The implementation no-cache essentially performs a rotation of
(ik−1 ik−2 · · · ilg b 0 · · · 0)2 bits and then performs a rotation of (ilg b−1 ilg b−2 · · · i0 )2
bits.
With the constant-time rotation, we can now deal with the original problem
of computing f (0) + f (1) g (1) ∈ F2 [x]/(xr − 1). QcBits first sets h = 0. Then for
each i ∈ I, one of either 1 or g (1) is chosen according to whether i < r or not,
which has to be performed in a constant-time way to hide all information about
i. The chosen polynomial is then rotated by i mod r bits, and the result is added
to h. Note that this means the implementations ref and no-cache perform a
dummy polynomial multiplication to hide information about f (0) and f (1) .
5 KEM Decryption
This section shows how QcBits performs the KEM decryption using multiplica-
tions in F2 [x]/(xr − 1) and Z[x]/(xr − 1). The KEM decryption is essentially a
decoding algorithm. Each decoding iteration computes
• the private syndrome Hv and
• the counts of unsatisfied parity checks, i.e., the vector u, using the private
syndrome.
Positions in v are then flipped according the counts. Section 5.1 shows how
private-syndrome computation is implemented as multiplications in F2 [x]/(xr −
1). Section 5.2 shows how counting unsatisfied parity checks is viewed as mul-
tiplications in Z[x]/(xr − 1). Section 5.3 shows how these multiplications in
Z[x]/(xr −1) are implemented. Section 5.4 shows how bit flipping is implemented.
H:,0 (x)v (0) (x) + H:,0 (x)v (1) (x) ∈ F2 [x]/(xr − 1).
(0) (1)
294 T. Chou
The computations of the public syndrome and the private syndrome are still
a bit different. For encryption the matrix P is dense, whereas the vector e is
sparse. For decryption the matrix H is sparse, whereas the vector v is dense.
(i)
However, the multiplications H:,0 (x)v (i) (x) are still sparse-times-dense multi-
plications. QcBits thus computes the private syndrome using the techniques
described in Sect. 4.3.
(i)
Since the secret key is a sparse representation of H0,: , we do not immediately
(i) (1)
have H:,0 . This is similar to the situation in public-key generation, where P:,0
(1) (i) (i)
is derived from P0,: . QcBits thus computes H:,0 from H0,: by adjusting each
index in the sparse representation in constant time.
Let s = Hv. The vector u of counts of unsatisfied parity checks can be viewed
as
uj = Hi,j · si ∈ Zn ,
i
Let u(0) and u(1) be the first and second half of u, respectively. Now we have:
i (0) i (1)
u (x), u (x) =
(0) (1)
x H0,: (x) · si , x H0,: (x) · si
i i
· s(x) ∈ (Z[x]/(xr − 1)) .
(0) (1) 2
= H0,: (x) · s(x), H0,: (x)
(i)
sparse, and the coefficients of H0,: (x) and s(x) can only be 0 or 1. See below for
how such multiplications are implemented in QcBits.
Even though all the operations are now in Z[x]/(xr −1) instead of F2 [x]/(xr −1),
each xi g can still be computed using a constant-time rotation as in Sect. 4.3.
QcBits: Constant-Time Small-Key Code-Based Cryptography 295
8 bits b bits
1011012 0 . . . 1012
1010002 0 . . . 0002
0001012 0 . . . 1012
b 6
.. 0 . . . 0112
.
..
. 0 . . . 0002
0 . . . 0112
0000002
Non-bitsliced Bitsliced
The speed that McBits [17] achieves relies on bitslicing as well. However, the
reader should keep in mind that QcBits, as opposed to McBits, makes use of
parallelism that lies intrinsically in one single decryption instance.
The last step in each decoding iteration is to flip the bits according to the
counts. Since QcBits stores the counts in a bitsliced format, bit flipping is also
accomplished in a bitsliced fashion. At the beginning of each decoding iteration,
the bitsliced form of b copies of −t is generated and stored in lg w/2 + 1 b-bit
words. Once the counts are computed, −t is added to the counts in parallel using
logical instructions on b-bit words. These logical instructions simulate copies of a
circuit for adding (lg w/2 + 1)-bit numbers. Such a circuit takes (lg w/2 + 1)
full adders. Therefore, each ui + (−t) takes roughly 5(lg w/2 + 1)/b logical
instructions.
The additions are used to generate sign bits for all ui − t, which are stored
in two arrays of r/b b-bit words. To flip the bits, QcBits simply XORs the
complement of b-bit words in the two arrays into v (0) and v (1) . It then takes
roughly 1/b logical instructions to update each vi .
For w = 90, we have 5(lg w/2 + 1)/b + 1b = 31/b, which is smaller than
1 for either b = 32 or b = 64. In contrast, when the non-bitsliced format is
used, the naive approach is to use at least one subtraction instruction for each
ui − t and one XOR instruction to flip the bit. One can argue that for the non-
bitsliced format there are probably better ways to compute u and perform bit
flipping. For example, one can probably perform several additions/subtractions
of bytes in parallel in one instruction. However, such an approach seems much
more expensive than one might expect as changes of formats between a sequence
of bits and bytes are required.
tests that succeed after 1 iteration. The third number indicates the number of
tests that succeed after 2 iterations; etc. avg indicates the average number of
iterations for the successful tests.
r = 4801
w = 90
t = 84
sec = 80
T = [29, 27, 25, 24, 23, 23]
S = [0, 0, 752, 69732674, 30232110, 34417, 47]
avg = 3.30
The thresholds are obtained by interactive experiments. QcBits uses this
setting.
r = 4801
w = 90
t = 84
sec = 80
T = [28, 26, 24, 23, 23, 23, 23, 23, 23, 23]
S = [40060, 0, 9794, 87815060, 12079266, 51387, 3833, 519, 70, 10,
1]
avg = 3.12
r = 9857
w = 142
t = 134
sec = 128
T = [44, 42, 40, 37, 36, 36, 36, 36, 36, 36, 36, 36]
S = [689298, 0, 0, 86592, 53307303, 42797368, 2856446, 235479,
24501, 2651, 333, 26, 3]
avg = 4.46
r = 9857
w = 142
t = 134
sec = 128
T = [48, 47, 46, 45, 44, 43, 42, 42, 41, 41, 40, 40, 39, 39, 38,
38, 37, 37, 36]
S = [12, 0, 0, 0, 0, 0, 142, 78876, 578963, 290615, 43180, 6363,
1309, 336, 108, 54, 27, 7, 4, 4]
avg = 8.33
The thresholds are obtained by interactive experiments.
298 T. Chou
References
1. McEliece, R.J.: A public-key cryptosystem based on algebraic coding theory. JPL
DSN Progress Report, pp. 114–116 (1978). http://ipnpr.jpl.nasa.gov/progress
report2/42-44/44N.PDF
2. Niederreiter, H.: Knapsack-type cryptosystems and algebraic coding theory. Probl.
Control Inf. Theor. 15, 159–166 (1986)
3. Misoczki, R., Tillich, J.-P., Sendrier, N., Barreto, P.S.L.M.: MDPC-McEliece: new
McEliece variants from moderate density parity-checkcodes, In: IEEE International
Symposium on Information Theory, pp. 2069–2073 (2013). http://eprint.iacr.org/
2012/409.pdf
4. Heyse, S., von Maurich, I., Güneysu, T.: Smaller keys for code-based cryptogra-
phy: QC-MDPC McEliece implementations on embedded devices. In: Bertoni, G.,
Coron, J.-S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 273–292. Springer, Heidelberg
(2013)
5. von Maurich, I., Güuneysu, T.: Lightweight code-based cryptography: QC-
MDPC McEliece encryption on reconfigurable devices. In: DATE 2014 [6], pp.
1–6 (2014). https://www.sha.rub.de/media/sh/veroeffentlichungen/2014/02/11/
Lightweight Code-based Cryptography.pdf
QcBits: Constant-Time Small-Key Code-Based Cryptography 299
6. Fettweis, G., Nebel, W. (eds.): Design, Automation and Test in Europe Conference
and Exhibition, DATE 2014, Dresden, Germany, 24–28 March 2014. European
Design and Automation Association (2014). ISBN 978-3-9815370-2-4, See [5]
7. von Maurich, I., Güneysu, T.: Towards side-channel resistant implementations
of QC-MDPC McEliece encryption on constrained devices. In: Mosca, M. (ed.)
PQCrypto 2014. LNCS, vol. 8772, pp. 266–282. Springer, Heidelberg (2014)
8. Mosca, M. (ed.): Post-Quantum Cryptography. LNCS, vol. 8772. Springer, Berlin
(2014). See [7]
9. von Maurich, I., Heberle, L., Güneysu, T.: IND-CCA secure hybrid encryption
from QC-MDPC Niederreiter. In: PQCrypto 2016 [10] (2016)
10. Takagi, T. (ed.): Post-Quantum Cryptography. LNCS, vol. 9606. Springer, Berlin
(2016). See [9]
11. von Maurich, I., Oder, T., Güneysu, T.: Implementing QC-MDPC McEliece
encryption. ACM Trans. Embed. Comput. Syst. 14, 44 (2015)
12. Anderson, S.E.: Bit Twiddling Hacks (1997–2005). https://graphics.stanford.edu/
∼seander/bithacks.html
13. Persichetti, E.: Improving the efficiency of code-based cryptography. Ph.D. thesis,
University of Auckland (2012). http://persichetti.webs.com/publications
14. Persichetti, E.: Secure and anonymous hybrid encryption from coding theory. In:
Gaborit, P. (ed.) PQCrypto 2013. LNCS, vol. 7932, pp. 174–187. Springer, Heidel-
berg (2013)
15. Gaborit, P. (ed.): Post-Quantum Cryptography. LNCS, vol. 7932. Springer, Berlin
(2013). See [14]
16. Bernstein, D.J., Lange, T. (eds.): eBACS: ECRYPT Benchmarking of Crypto-
graphic Systems (2016). http://bench.cr.yp.to. Accessed 2 Feb 2016
17. Bernstein, D.J., Chou, T., Schwabe, P.: McBits: fast constant-time code-based
cryptography. In: Bertoni, G., Coron, J.-S. (eds.) CHES 2013. LNCS, vol. 8086,
pp. 250–272. Springer, Heidelberg (2013)
18. Bertoni, G., Coron, J.-S. (eds.): CHES 2013. LNCS, vol. 8086. Springer, Heidelberg
(2013). ISBN 978-3-642-40348-4
19. Al Jabri, A.K.: A statistical decoding algorithm for general linear block codes. In:
[20], pp. 1–8 (2001)
20. Honary, B. (ed.): Cryptography and Coding. LNCS, vol. 2260. Springer, Heidelberg
(2001). ISBN 3-540-43026-1, See [19]
21. Overbeck, R.: Statistical decoding revisited. In: Batten, L.M., Safavi-Naini, R.
(eds.) ACISP 2006. LNCS, vol. 4058, pp. 283–294. Springer, Heidelberg (2006)
22. Batten, L.M., Safavi-Naini, R. (eds.): ACISP 2006. LNCS, vol. 4058. Springer,
Heidelberg (2006). ISBN 3-540-35458-1, See [21]
23. Bertoni, G., Daemen, J.: Peeters, M., Van Assche, G.: Keccak and the SHA-
3 standardization (2013). http://csrc.nist.gov/groups/ST/hash/sha-3/documents/
Keccak-slides-at-NIST.pdf
24. Bernstein, D.J.: The Poly1305-AES message-authentication code. In: FSE 2005
[28], pp. 32–49 (2005). http://cr.yp.to/papers.html#poly1305
25. Bernstein, D.J.: The Salsa20 family of stream ciphers. In: [29], pp. 84–97 (2008).
http://cr.yp.to/papers.html#salsafamily
26. Wikipedia: RdRand — Wikipedia. The Free Encyclopedia (2016). https://en.
wikipedia.org/wiki/RdRand. Accessed 2 Feb 2016
27. Wikipedia: Barrel Shifter — Wikipedia. The Free Encyclopedia (2016). https://
en.wikipedia.org/wiki/Barrel shifter. Accessed 2 Feb 2016
28. Gilbert, H., Handschuh, H. (eds.): FSE 2005. LNCS, vol. 3557. Springer, Heidelberg
(2005). ISBN:3-540-26541-4, See [24]
300 T. Chou
29. Robshaw, M., Billet, O. (eds.): New Stream Cipher Designs. LNCS. Springer,
Heidelberg (2008). ISBN:978-3-540-68350-6, See [25]
30. Georgieva, M., de Portzamparc, F.: Toward secure implementation of McEliece
decryption, In: COSADE 2015 [31], pp. 141–156 (2015). http://eprint.iacr.org/
2015/271.pdf
31. Mangard, S., Poschmann, A.Y. (eds.): Constructive Side-Channel Analysis and
Secure Design. LNCS, vol. 9064. Springer, Heidelberg (2015). See [30]
32. Alkim, E., Ducas, L., Pöppelmann, T., Schwabe, P.: Post-Quantumkey Exchange—
A New Hope, The IACR ePrint Archive (2015). https://eprint.iacr.org/2015/1092
µKummer: Efficient Hyperelliptic Signatures
and Key Exchange on Microcontrollers
1 Introduction
The current state of the art in asymmetric cryptography, not only on microcon-
trollers, is elliptic-curve cryptography; the most widely accepted reasonable secu-
rity is the 128-bit security level. All current speed records for 128-bit secure key
exchange and signatures on microcontrollers are held—until now—by elliptic-
curve-based schemes. Outside the world of microcontrollers, it is well known
that genus-2 hyperelliptic curves and their Kummer surfaces present an attrac-
tive alternative to elliptic curves [1,2]. For example, at Asiacrypt 2014 Bernstein,
Chuengsatiansup, Lange and Schwabe [3] presented speed records for timing-
attack-protected 128-bit-secure scalar multiplication on a range of architectures
L. Batina— This work has been supported by the Netherlands Organisation for
Scientific Research (NWO) through Veni 2013 project 13114 and by the Technol-
ogy Foundation STW (project 13499 - TYPHOON &ASPASIA), from the Dutch
government. Permanent ID of this document: b230ab9b9c664ec4aad0cea0bd6a6732.
Date: 2016-04-07.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 301–320, 2016.
DOI: 10.1007/978-3-662-53140-2 15
302 J. Renes et al.
with Kummer-based software. These speed records are currently only being sur-
passed by the elliptic-curve-based FourQ software by Costello and Longa [4]
presented at Asiacrypt 2015, which makes heavy use of efficiently computable
endomorphisms (i.e., of additional structure of the underlying elliptic curve).
The Kummer-based speed records in [3] were achieved by exploiting the compu-
tational power of vector units of recent “large” processors such as Intel Sandy
Bridge, Ivy Bridge, and Haswell, or the ARM Cortex-A8. Surprisingly, very little
attention has been given to Kummer surfaces on embedded processors. Indeed,
this is the first work showing the feasibility of software-only implementations
of hyperelliptic-curve based crypto on constrained platforms. There have been
some investigations of binary hyperelliptic curves targeting the much lower 80-
bit security level, but those are actually examples of software-hardware co-design
showing that using hardware acceleration for field operations was necessary to
get reasonable performance figures (see eg. [5,6]).
In this paper we investigate the potential of genus-2 hyperelliptic curves for
both key exchange and signatures on the “classical” 8-bit AVR ATmega archi-
tecture, and the more modern 32-bit ARM Cortex-M0 processor. The former has
the most previous results to compare to, while ARM is becoming more relevant
in real-world applications. We show that not only are hyperelliptic curves com-
petitive, they clearly outperform state-of-the art elliptic-curve schemes in terms
of speed and size. For example, our variable-basepoint scalar multiplication on a
127-bit Kummer surface is 31 % faster on AVR and 26 % faster on the M0 than
the recently presented speed records for Curve25519 software by Düll et al. [7];
our implementation is also smaller, and requires less RAM.
We use a recent result by Chung, Costello, and Smith [8] to also set new speed
records for 128-bit secure signatures. Specifically, we present a new signature
scheme based on fast Kummer surface arithmetic. It is inspired by the EdDSA
construction by Bernstein, Duif, Lange, Schwabe, and Yang [9]. On the ATmega,
it produces shorter signatures, achieves higher speeds and needs less RAM than
the Ed25519 implementation presented in [10].
Table 1. Cycle counts and stack usage in bytes of all functions related to the
signature and key exchange schemes, for the AVR ATmega and ARM Cortex M0
microcontrollers.
ATmega Cortex M0
Cycles Stack bytes Cycles Stack bytes
keygen 10 206 181 812 2 774 087 1 056
sign 10 404 033 926 2 865 351 1 360
verify 16 240 510 992 4 453 978 1 432
dh exchange 9 739 059 429 2 644 604 584
Our routines handling secret data are constant-time, and are thus naturally
resistant to timing attacks. These algorithms are built around the Montgomery
μKummer: Efficient Hyperelliptic Signatures 303
Source code. We place all of the software described in this paper into the public
domain, to maximize the reuseability of our results. The software is available at
http://www.cs.ru.nl/∼jrenes/.
2 High-Level Overview
2.1 Signatures
Our signature scheme, defined at the end of this section, adheres closely to the
proposal of [8, Sect. 8], which in turn is a type of Schnorr signature [11]. There
are however some differences and trade-offs, which we discuss below.
Group structure. We build the signature scheme on top of the group structure
from the Jacobian JC (Fq ) of a genus-2 hyperelliptic curve C. More specifically,
C is the Gaudry–Schost curve over the prime field Fq with q = 2127 − 1 (cf.
Sect. 3.2). The Jacobian is a group of order #JC (Fq ) = 24 N , where
N = 2250 − 0x334D69820C75294D2C27FC9F9A154FF47730B4B840C05BD
is a 250-bit prime. For more details on the Jacobian and its elements, see Sect. 3.3.
Hash function. We may use any hash function H with a 128-bit security level.
For our purposes, H(M ) = SHAKE128(M, 512) suffices [12]. While SHAKE128 has
variable-length output, we only use the 512-bit output implementation.
Public generator. The public generator can be any element P of JC (Fq ) such
that [N ]P = 0. In our implementation we have made the arbitrary choice P =
(X 2 + u1 X + u0 , v1 X + v0 ), where
u1 = 0x7D5D9C3307E959BF27B8C76211D35E8A, u0 = 0x2703150F9C594E0CA7E8302F93079CE8,
v1 = 0x444569AF177A9C1C721736D8F288C942, v0 = 0x7F26CFB225F42417316836CFF8AEFB11.
304 J. Renes et al.
This is the point which we use the most for scalar multiplication. Since it remains
fixed, we assume we have its decompressed representation precomputed, so as
to avoid having to perform the relatively expensive decompression operation
whenever we need a scalar multiplication; this gives a low-cost speed gain. We
further assume we have a “wrapped” representation of the projection of P to the
Kummer surface, which is used to speed up the xDBLADD function. See Sect. 4.1
for more details on the xWRAP function.
Public keys. In contrast to the public generator, we assume public keys are
compressed: they are communicated much more frequently, and we therefore
benefit much more from smaller keys. Moreover, we include the public key in
one of the hashes during the sign operation [13,14], computing h = H(R||Q||M )
instead of the h = H(R||M ) originally suggested by Schnorr [11]. This protects
against adversaries attacking multiple public keys simultaneously.
The scheme. We now define our signature scheme, taking the above into account.
Key generation (keygen). Let d be a 256-bit secret key, and P the public
generator. Compute (d ||d ) ← H(d) (with d and d both 256 bits), then
Q ← [16d ]P . The public key is Q.
Signing (sign). Let M be a message, d a 256-bit secret key, P the pub-
lic generator, and Q a compressed public key. Compute (d ||d ) ← H(d)
(with d and d both 256 bits), then r ← H(d ||M ), then R ← [r]P , then
h ← H(R||Q||M ), and finally s ← (r − 16h128 d ) mod N . The signature is
(h128 ||s).
μKummer: Efficient Hyperelliptic Signatures 305
Remark 1. We note that there may be faster algorithms to compute the “one-
and-a-half-dimensional” scalar multiplication in verify, especially since we do
not have to worry about being constant-time. One option might be to adapt
Montgomery’s PRAC [16, Sect. 3.3.1] to make use of the half-size scalar. But
while this may lead to a speed-up, it would also cause an increase in code size
compared to simply re-using the one-dimensional scalar multiplication. We have
chosen not to pursue this line, preferring the solid benefits of reduced code size
instead.
Remark 2. While it might be possible to reduce the key size even further to 256
bits, we would then have to pay the cost of compressing and decompressing,
and also wrapping for xDBLADD (see the discussion in [8, App. A]). We therefore
choose to keep the 384-bit representation, which is consistent with [3].
q := 2127 − 1.
306 J. Renes et al.
choosing the sign so that the square root has least significant bit b. Including
the gfe powminhalf call, this costs 15M + 126S + 1neg.
Table 2. Cycle counts for our field implementation (including function-call overhead).
These are the Rosenhain invariants of the curve C, found by Gaudry and
Schost [18], which we are (finally!) ready to define as
f1 = 0x1EDD6EE48E0C2F16F537CD791E4A8D6E, f2 = 0x73E799E36D9FCC210C9CD1B164C39A35,
f3 = 0x4B9E333F48B6069CC47DC236188DF6E8, f4 = 0x219CC3F8BB9DFE2B39AD9E9F6463E172.
We store the squared theta constants (a : b : c : d), along with (1/a : 1/b :
1/c : 1/d), and (1/A : 1/B : 1/C : 1/D); the Rosenhain invariants λ, μ, and ν,
together with λμ and λν; and the curve constants f1 , f2 , f3 , and f4 , for use in our
Kummer and Jacobian arithmetic functions. Obviously, none of the Rosenhain
or curve constants are small; multiplying by these costs a full M.
(xP : yP : zP : tP ) = ±P
1
We only call ADD once in our algorithms, so for lack of space we omit its description.
μKummer: Efficient Hyperelliptic Signatures 309
for its image in KC . To avoid subscript explosion, we make the following conven-
tion: when points P and Q on JC are clear from the context, we write
where
a2 − b2 − c2 + d2 a2 − b2 + c2 − d2 a2 + b2 − c2 − d2
F = , G= , H= ,
ad − bc ac − bd ab − cd
2
and E = 4abcd (ABCD/((ad − bc)(ac − bd)(ab − cd))) (see eg. [21? ,22]). The
identity point 1, 0 of JC maps to
±0JC = (a : b : c : d).
310 J. Renes et al.
Algorithm 3 (Project) maps general points from JC (Fq ) into KC . The “special”
case where u is linear is treated in [8, Sect. 7.2]; this is not implemented, since
Project only operates on public generators and keys, none of which are special.
3.5 Pseudo-addition on KC
While the points of KC do not form a group, we have a pseudo-addition operation
(differential addition), which computes ±(P ⊕ Q) from ±P , ±Q, and ±(P Q).
The function xADD (Algorithm 4) implements the standard differential addition.
The special case where P = Q yields a pseudo-doubling operation.
To simplify the presentation of our algorithms, we define three operations on
points in P3 . First, M : P3 × P3 → P3 multiplies corresponding coordinates:
S : (x : y : z : t) −→ (x2 : y 2 : z 2 : t2 ).
Clearly M and S cost 4M and 4S, respectively. The Hadamard transform can
easily be implemented with 4a + 4s. However, the additions and subtractions are
relatively cheap, making function call overhead a large factor. To minimize this
we inline the Hadamard transform, trading a bit of code size for efficiency.
2
Note that (A : B : C : D) = H((a : b : c : d)) and (a : b : c : d) = H((A : B : C : D)).
μKummer: Efficient Hyperelliptic Signatures 311
Lines 5 and 6 of Algorithm 4 only involve the third argument, ±(P Q);
essentially, they compute the point (y z t : x z t : x y t : x y z )
(which is projectively equivalent to (1/x : 1/y : 1/z : 1/t ), but requires
no inversions; note that this is generally not a point on KC ). In practice, the
pseudoadditions used in our scalar multiplication all use a fixed third argument,
so it makes sense to precompute this “inverted” point and to scale it by x so
that the first coordinate is 1, thus saving 7M in each subsequent differential
addition for a one-off cost of 1I. The resulting data can be stored as the 3-tuple
(x /y , x /z , x /t ), ignoring the trivial first coordinate: this is the wrapped
form of ±(P Q). The function xWRAP (Algorithm 5) applies this transformation.
4 Scalar Multiplication
All of our cryptographic routines are built around scalar multiplication in JC and
pseudo-scalar multiplication in KC . We implement pseudo-scalar multiplication
using the classic Montgomery ladder in Sect. 4.1. In Sect. 4.2, we extend this to
full scalar multiplication on JC using the point recovery technique proposed in [8].
312 J. Renes et al.
Table 3. Operation and cycle counts of basic functions on the Kummer and Jacobian.
4.1 Pseudomultiplication on KC
Hence, at the end we return ±[m]P , and also ±[m + 1]P as a (free) byproduct.
We suppose we have a constant-time conditional swap routine CSWAP(b, (V1 , V2 )),
which returns (V1 , V2 ) if b = 0 and (V2 , V1 ) if b = 1. This makes the execution of
Algorithm 7 uniform and constant-time, and thus suitable for use with secret m.
Our implementation of crypto scalarmult assumes that its input Kummer
point ±P is wrapped. This follows the approach of [3]. Indeed, many calls
μKummer: Efficient Hyperelliptic Signatures 313
Since the general Kummer K only appears briefly in our recovery procedure
C
(we never use its relatively slow arithmetic operations), we will not investigate
it in detail here—but the curious reader may refer to [27] for the general the-
is, like K , embedded in P3 ;
ory. For our purposes, it suffices to recall that K C C
and the isomorphism KC → KC is defined (in eg. [8, Sect. 7.4]) by the linear
transformation
(xP : yP : zP : tP ) −→ (x̃P : ỹP : z̃P : t̃P ) := (xP : yP : zP : tP )L,
where L is (any scalar multiple of) the matrix
⎛ −1 ⎞
a (ν − λ) a−1 (μν − λ) a−1 λν(μ − 1) a−1 λν(μν − λ)
⎜ b−1 (μ − 1) b−1 (μν − λ) b−1 μ(ν − λ) b−1 μ(μν − λ) ⎟
⎜ ⎟
⎜ −1 ⎟,
⎝ c (λ − μ) c−1 (λ − μν) c−1 λμ(1 − ν) c−1 λμ(λ − μν) ⎠
d−1 (1 − ν) d−1 (λ − μν) d−1 ν(λ − μ) d−1 ν(λ − μν)
denotes its
which we precompute and store. If ±P is a point on KC , then ±P
image on KC ; we compute ±P using Algorithm 10 (fast2genFull).
. Algorithm 11
Sometimes we only require the first three coordinates of ±P
(fast2genPartial) saves 4M + 3a per point by not computing t̃P .
μKummer: Efficient Hyperelliptic Signatures 315
The high-level cryptographic functions for our signature scheme are named
keygen, sign and verify. Their implementations contain no surprises: they do
exactly what was specified in Sect. 2.1, calling the lower-level functions described
in Sects. 3 and 4 as required. Our Diffie-Hellman key generation and key exchange
use only the function dh exchange, which implements exactly what we specified
in Sect. 2.2: one call to crypto scalarmult plus a call to xWRAP to convert to the
correct 384-bit representation. Table 1 (in the introduction) presents the cycle
counts and stack usage for all of our high-level functions.
μKummer: Efficient Hyperelliptic Signatures 317
Code and compilation. For our experiments, we compiled our AVR ATmega code
with avr-gcc -O2, and our ARM Cortex M0 code with clang -O2 (the opti-
mization levels -O3, -O1, and -Os gave fairly similar results). The total program
size is 20 242 bytes for the AVR ATmega, and 19 606 bytes for the ARM Cor-
tex M0. This consists of the full signature and key-exchange code, including the
reference implementation of the hash function SHAKE128 with 512-bit output.3
Results for ARM Cortex M0. As we see in Table 4, genus-2 techniques give great
results for Diffie–Hellman key exchange on the ARM Cortex M0 architecture.
Compared with the current fastest implementation [7], we reduce the number
of clock cycles by about 27 %, while roughly halving code size and stack space.
For signatures, the state-of-the-art is [31]: here we reduce the cycle count for the
underlying scalar multiplications by a very impressive 75 %, at the cost of an
increase in code size and stack usage.
Results for AVR ATmega. Looking at Table 5, on the AVR ATmega architecture
we reduce the cycle count for Diffie–Hellman by about 32 % compared with the
current record [7], again roughly halving the code size, and reducing stack usage
by about 80 %. The cycle count for Jacobian scalar multiplication (for signatures)
is reduced by 71 % compared with [31], while increasing the stack usage by 25 %.
3
We used the reference C implementation for the Cortex M0, and the assembly imple-
mentation for AVR; both are available from [28]. The only change required is to the
padding, which must take domain separation into account according to [12, p. 28].
318 J. Renes et al.
References
1. Bernstein, D.J.: Elliptic vs. hyperelliptic, part 1 (2006). http://cr.yp.to/talks/2006.
09.20/slides.pdf
2. Bos, J.W., Costello, C., Hisil, H., Lauter, K.: Fast cryptography in genus 2. In:
Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013. LNCS, vol. 7881, pp.
194–210. Springer, Heidelberg (2013). https://eprint.iacr.org/2012/670.pdf
3. Bernstein, D.J., Chuengsatiansup, C., Lange, T., Schwabe, P.: Kummer
strikes back: new DH speed records. In: Sarkar, P., Iwata, T. (eds.) ASI-
ACRYPT 2014. LNCS, vol. 8873, pp. 317–337. Springer, Heidelberg (2014).
https://cryptojedi.org/papers/#kummer
4. Costello, C., Longa, P.: FourQ: four-dimensional decompositions on aQ-
curve over the mersenne prime. In: Iwata, T., Cheon, J.H. (eds.) ASI-
ACRYPT 2015. LNCS, vol. 9452, pp. 214–235. Springer, Heidelberg (2015).
https://eprint.iacr.org/2015/565
μKummer: Efficient Hyperelliptic Signatures 319
5. Batina, L., Hwang, D., Hodjat, A., Preneel, B., Verbauwhede, I.: Hard-
ware/Software Co-design for Hyperelliptic Curve Cryptography (HECC) on the
8051 μP . In: Rao, J.R., Sunar, B. (eds.) CHES 2005. LNCS, vol. 3659, pp. 106–
118. Springer, Heidelberg (2005). https://www.iacr.org/archive/ches2005/008.pdf
6. Hodjat, A., Batina, L., Hwang, D., Verbauwhede, I.: HW/SW co-design of
a hyperelliptic curve cryptosystem using amicrocode instruction set coproces-
sor. Integr. VLSI J. 40, 45–51 (2007). https://www.cosic.esat.kuleuven.be/
publications/article-622.pdf
7. Düll, M., Haase, B., Hinterwälder, G., Hutter, M., Paar, C., Sánchez, A.H.,
Schwabe, P.: High-speed curve25519 on 8-bit, 16-bit and 32-bit microcontrollers.
Des. Codes Crypt. 77, 493–514 (2015). http://cryptojedi.org/papers/#mu25519
8. Costello, C., Chung, P.N., Smith, B.: Fast, uniform, and compact scalar multi-
plication for elliptic curves and genus 2 Jacobians with applications to signature
schemes.Cryptology ePrint Archive, Report 2015/983 (2015). https://eprint.iacr.
org/2015/983
9. Bernstein, D.J., Duif, N., Lange, T., Schwabe, P., Yang, B.Y.: High-speed high-
security signatures. J. Cryptogr. Eng. 2, 77–89 (2012). https://cryptojedi.org/
papers/ed25519
10. Nascimento, E., López, J., Dahab, R.: Efficient and secure elliptic curve cryptogra-
phy for 8-bit AVR microcontrollers. In: Chakraborty, R.S., Schwabe, P., Solworth,
J. (eds.) SPACE 2015. LNCS, vol. 9354, pp. 289–309. Springer, Heidelberg (2015)
11. Schnorr, C.-P.: Efficient identification and signatures for smart cards. In: Brassard,
G. (ed.) CRYPTO 1989. LNCS, vol. 435, pp. 239–252. Springer, Heidelberg (1990)
12. Dworkin, M.J.:SHA-3 standard: Permutation-based hash and extendable-
outputfunctions.Technical report, National Institute of Standards and Technol-
ogy(NIST) (2015). http://www.nist.gov/manuscript-publication-search.cfm?pub
id=919061
13. Katz, J., Wang, N.: Efficiency improvements for signature schemes with tight secu-
rityreductions.In: Proceedings of the 10th ACM Conference on Computer and
Communications Security, CCS 2003, pp. 155–164. ACM (2003). https://www.
cs.umd.edu/∼jkatz/papers/CCCS03 sigs.pdf
14. Vitek, J., Naccache, D., Pointcheval, D., Vaudenay, S.: Computational alter-
natives to random number generators. In: Tavares, S., Meijer, H. (eds.)
SAC 1998. LNCS, vol. 1556, pp. 72–80. Springer, Heidelberg (1999).
https://www.di.ens.fr/ pointche/Documents/Papers/1998 sac.pdf
15. Bernstein, D.J.: Differential addition chains (2006). http://cr.yp.to/ecdh/
diffchain-20060219.pdf
16. Stam, M.: Speeding up subgroup cryptosystems. Ph.D. thesis, Technische
Universiteit Eindhoven (2003). http://alexandria.tue.nl/extra2/200311829.pdf?
q=subgroup
17. Hutter, M., Schwabe, P.: Multiprecision multiplication on AVR revisited. J. Cryp-
togr. Eng. 5, 201–214 (2015). http://cryptojedi.org/papers/#avrmul
18. Gaudry, P., Schost, E.: Genus 2 point counting over prime fields. J Symb Comput
47, 368–400 (2012). https://cs.uwaterloo.ca/∼eschost/publications/countg2.pdf
19. Hisil, H., Costello, C.: Jacobian coordinates on genus 2 curves. In: Sarkar, P., Iwata,
T. (eds.) ASIACRYPT 2014. LNCS, vol. 8873, pp. 338–357. Springer, Heidelberg
(2014). https://eprint.iacr.org/2014/385.pdf
20. Stahlke, C.: Point compression on jacobians of hyperelliptic curves
overFq .Cryptology ePrint Archive, Report 2004/030 (2004). https://eprint.
iacr.org/2004/030
320 J. Renes et al.
1 Introduction
This work was supported in part by the Commission of the European Communi-
ties through the Horizon 2020 program under project number 645622 PQCRYPTO.
Permanent ID of this document: da245c8568290e4a0f45c704cc62a2b8.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 323–345, 2016.
DOI: 10.1007/978-3-662-53140-2 16
324 Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom
On a high level, our attacks exploit cache access patterns of the implemen-
tations to learn a few coefficients of y per observed signature. We then develop
mathematical attacks to use this partial knowledge of different yj s together with
the public signature values (zj , cj ) to compute the secret key, given observations
from sufficiently many signatures.
In detail, there is an interplay between requirements for the offline attack
and restrictions on the sampling. First, restricting to cache access patterns that
provide relatively precise information means that the online phase only allows to
extract a few coefficients of yj per signature. This means that trying all guesses
for the bits b per signature becomes a bottleneck. We circumvent this issue by
only collecting coefficients of yj in situations where the respective coefficient of
s · cj is zero as in these cases the bit bj has no effect.
Second, each such collected coefficient of yj leads to an equation with some
coefficients of s as unknowns. However, it turns out that for CDT sampling
the cache patterns do not give exact equations. Instead, we learn equations
which hold with high probability, but might be off by ±1 with non-negligible
probability. We managed to turn the computation of s into a lattice problem
and show how to solve it using the LLL algorithm [20]. For Bernoulli sampling
we can obtain exact equations but at the expense of requiring more signatures.
We first tweaked the BLISS implementation to provide us with the exact
cache lines used, modeling a perfect side-channel. For BLISS-I, designed for 128
bits of security, the attack on CDT needs to observe on average 441 signatures
during the online phase. Afterwards, the offline phase succeeds after 37.6 seconds
with probability 0.66. This corresponds to running LLL once. If the attack does
not succeed at first, a few more signatures (on average a total of 446) are sampled
and LLL is run with some randomized selection of inputs. The combined attack
succeeds with probability 0.96, taking a total of 85.8 seconds. Similar results
hold for other BLISS versions. In the case of Bernoulli sampling, we are given
exact equations and can use simple linear algebra to finalize the attack, given a
success probability of 1.0 after observing 1671 signatures on average and taking
14.7 seconds in total.
To remove the assumption of a perfect side-channel we performed a proof-of-
concept attack using the Flush+Reload technique on a modern laptop. This
attack achieves similar success rates, albeit requiring 3438 signatures on average
for BLISS-I with CDT sampling. For Bernoulli sampling, we now had to deal
with measurement errors. We did this again by formulating a lattice problem
and using LLL in the final step. The attack succeeds with a probability of 0.88
after observing an average of 3294 signatures.
1.3. Structure. In Section 2, we give brief introductions to lattices, BLISS, and
the used methods for discrete Gaussian sampling as well as to cache-attacks.
In Section 3, we present two information leakages through cache-memory for
CDT sampling and provide a strategy to exploit this information for secret key
extraction. In Section 4, we present an attack strategy for the case of Bernoulli
sampling. In Section 5, we present experimental results for both strategies assum-
326 Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom
2 Preliminaries
This section describes the BLISS signature scheme and the used discrete
Gaussian samplers. It also provides some background on lattices and cache
attacks.
2.1. Lattices. We define a lattice Λ as a discrete subgroup of Rn : given m ≤ n
linearly independent vectors b1 , . . . , bm ∈ Rn , the lattice Λ is given by the set
Λ(b1 , . . . , bm ) of all integer linear combinations of the bi ’s:
m
Λ(b1 , . . . , bm ) = xi bi | xi ∈ Z .
i=1
2.2. BLISS. We provide the basic algorithms of BLISS, as given in [9]. Details of
the motivation behind the construction and associated security proofs are given
in the original work. All arithmetic for BLISS is performed in R and possibly
with each coefficient reduced modulo q or 2q. We follow notation of BLISS and
also use boldface notation for polynomials.
By Dσ we denote the discrete Gaussian distribution with standard deviation
σ. In the next subsection, we will zoom in on this distribution and how to
sample from it in practice. The main parameters of BLISS are dimension n,
modulus q and standard deviation σ. BLISS uses a cryptographic hash function
H, which outputs binary vectors of length n and weight κ; parameters d1 and
d2 determining the density of the polynomials forming the secret key; and d,
determining the length of the second signature component.
Note that when an attacker has a candidate for key s1 = f, he can validate
correctness by checking the distributions of f and aq · f ≡ 2g + 1 mod 2q, and
lastly verifying that a1 · f + a2 · (aq · f ) ≡ q mod 2q, where aq is obtained by
halving a1 .
Signature generation (Algorithm 2.2) uses p = 2q/2d , which is the highest
order bits of the modulus 2q, and constant ζ = q−2 1
mod 2q. In general, with
.d we denote the d highest order bits of a number. In Step 1 of Algorithm 2.2,
two integer vectors are sampled, where each coordinate is drawn independently
and according to the discrete Gaussian distribution Dσ . This is denoted by
y ← DZn ,σ .
In the attacks, we concentrate on the first signature vector z1 , since z†2 only
contains the d highest order bits and therefore lost information about s2 · c;
furthermore, A and f determine s2 as shown above. So in the following, we only
consider z1 , y1 and s1 , and thus will leave out the indices.
In lines 5 and 6 of Algorithm 2.2, we compute s · c over R2q . However, since
secret s is sparse and challenge c is sparse and binary, the absolute value of
||s · c||∞ ≤ 5κ 2q, with || · ||∞ the ∞ -norm. This means these computations
are simply additions over Z, and we can therefore model this computation as a
328 Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom
s · c = sC,
where C ∈ {−1, 0, 1}n×n is the matrix whose columns are the rotations of chal-
lenge c (with minus signs matching reduction modulo xn + 1). In the attacks we
access individual coefficients of s · c; note that the jth coefficient equals
s, cj ,
where cj is the jth column of C.
For completeness, we also show the verification procedure (Algorithm 2.3),
although we do not use it further in this paper. Note that reductions modulo 2q
are done before truncating and reducing modulo p.
ρ (x)
∞ σ ,
y=−∞ ρσ (y)
−x2
where ρσ (x) = exp 2σ 2 . Note that the sum in the denominator ensures that
this is actually a probability distribution. We denote the denominator by ρσ (Z).
To make sampling practical, most lattice-based schemes use three simplifi-
cations: First, a tail-cut τ is used, restricting the support of the Gaussian to
a finite interval [−τ σ, τ σ]. The tail-cut τ is chosen such that the probability
of a real discrete Gaussian sample landing outside this interval is negligible in
Flush, Gauss, and Reload 329
the security parameter. Second, values are sampled from the positive half of the
support and then a bit is flipped to determine the sign. For this the probability
of obtaining zero in [0, τ σ] needs to be halved. The resulting distribution on the
positive numbers is denoted by Dσ+ . Finally, the precision of the sampler is cho-
sen such that the statistical distance between the output distribution and the
exact distribution is negligible in the security parameter.
There are two generic ways to sample from a discrete Gaussian distribution:
using the cumulative distribution function [25] or via rejection sampling [11].
Both these methods are deployed with some improvements which we describe
next. These modified versions are implemented in [8]. We note that there are
also other ways [5,10,30,31] of efficiently sampling discrete Gaussians.
CDT Sampling. The basic idea of using the cumulative distribution function
in the sampler, is to approximate the probabilities py = P[x ≤ y| x ← Dσ ],
computed with λ bits of precision, and save them in a large table. At sampling
time, one samples a uniformly random r ∈ [0, 1), and performs a binary search
through the table to locate y ∈ [−τ σ, τ σ] such that r ∈ [py−1 , py ). Restricting to
the non-negative part [0, τ σ] corresponds to using the probabilities p∗y = P[|x| ≤
y| x ← Dσ ], sampling r ∈ [0, 1) and locating y ∈ [0, τ σ]. While this is the most
efficient approach, it requires a large table. We denote the method that uses the
approximate cumulative distribution function with tail cut and the modifications
described next, as the CDT sampling method.
One can speed up the binary search for the correct sample y in the table,
by using an additional guide table I [6,19,29]. The BLISS implementation we
attack uses I with 256 entries. The guide table stores for each u ∈ {0, . . . , 255}
the smallest interval I[u] = (au , bu ) such that p∗au ≤ u/256 and p∗bu ≥ (u+1)/256.
The first byte of r is used to select I[u] leading to a much smaller interval for the
binary search. Effectively, r is picked byte-by-byte, stopping once a unique value
for y is obtained. The CDT sampling algorithm with guide table is summarized
in Algorithm 2.4.
K = σσ2 + 1, is then distributed according to the target discrete Gaussian dis-
tribution Dσ , by rejecting with a certain probability (Step 4 of Algorithm 2.5).
The number of rejections in this case is much lower than in the original method.
This step still requires computing a bit, whose probability is an exponential
Flush, Gauss, and Reload 331
value. However, it can be done more efficiently using Algorithm 2.7, where ET
is a small table.
2.4. Cache Attacks. The cache is a small bank of memory which exploits
the temporal and the spatial locality of memory access to bridge the speed gap
between the faster processor and the slower memory. The cache consists of cache
lines, which, on modern Intel architectures, can store a 64-byte aligned block of
memory of size 64 bytes.
In a typical processor there are several cache levels. At the top level, closest
to the execution core, is the L1 cache, which is the smallest and the fastest of
the hierarchy. Each successive level (L2, L3, etc.) is bigger and slower than the
preceding level.
When the processor accesses a memory address it looks for the block con-
taining the address in the L1 cache. In a cache hit, the block is found in the
cache and the data is accessed. Otherwise, in a cache miss, the search continues
on lower levels, eventually retrieving the memory block from the lower levels or
from the memory. The cache then evicts a cache line and replaces its contents
with the retrieved block, allowing faster future access to the block.
Because cache misses require searches in lower cache levels, they are slower
than cache hits. Cache timing attacks exploit this timing difference to leak infor-
mation [2,13,22,24,27]. In a nutshell, when an attacker uses the same cache as a
victim, victim memory accesses change the state of the cache. The attacker can
then use the timing variations to check which memory blocks are cached and
from that deduce which memory addresses the victim has accessed. Ultimately,
the attacker learns the cache line of the victim’s table access: a range of possible
values for the index of the access.
In this work we use the Flush+Reload attack [13,36]. A Flush+Reload
attack uses the clflush instruction of the x86-64 architecture to evict a memory
block from the cache. The attacker then lets the victim execute before measuring
the time to access the memory block. If during its execution the victim has
accessed an address within the block, the block will be cached and the attacker’s
access will be fast. If, however, the victim has not accessed the block, the attacker
will reload the block from memory, and the access will take much longer. Thus,
the attacker learns whether the victim accessed the memory block during its
332 Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom
This section presents the mathematical foundations of our cache attack on the
CDT sampling. We first explain the phenomena we can observe from cache
misses and hits in Algorithm 2.4 and then show how to exploit them to derive
the secret signing key of BLISS using LLL. Sampling of the first noise polynomial
y ∈ DZn ,σ is done coefficientwise. Similarly the cache attack targets coefficients
yi for i = 0, . . . , n − 1 independently.
3.1. Weaknesses in Cache. Sampling from a discrete Gaussian distribution
using both an interval table I and a table with the actual values T , might leak
information via cache memory. The best we can hope for is to learn the cache-
lines of index r of the interval and of index Iz of the table lookup in T . Note
that we cannot learn the sign of the sampled coefficient yi . Also, the cache line of
T [Iz ] always leaves a range of values for |yi |. However, in some cases we can get
more precise information combining cache-lines of table lookups in both tables.
Here are two observations that narrow down the possibilities:
We will restrict ourselves to only look for cache access patterns that give
even more precision, at the expense of requiring more signatures:
Flush, Gauss, and Reload 333
1. The first restriction is to only look at cache weaknesses (of type Intersection
or Last-Jump), in which the number of possible values for sample |yi | is two.
Since we do a binary search within an interval, this is the most precision one
can get (unless an interval is unique): after the last comparisons (table lookup
in T ), one of two values will be returned. This means that by picking either
of these two values we limit the error of |yi | to at most 1.
2. The probabilities of sampling values using CDT sampling with guide table I
are known to match the following probability requirement:
255
ρσ (x)
P[X = x | X ∈ I[r]] = . (1)
r=0
ρσ (Z)
Due to the above condition, it is possible that adjacent intervals are partially
overlapping. That is, for some r, s we have that I[r] ∩ I[s] = ∅. In practice,
this only happens for r = s+1, meaning adjacent intervals might overlap. For
example, if the probability of sampling x is greater than 1/256, then x has to
be an element in at least two intervals I[r]. Because of this, it is possible that
for certain parts of an interval I[r], there is a biased outcome of the sample.
The second restriction is to only consider cache weaknesses for which addi-
tionally one of the two values is significantly more likely to be sampled, i.e., if
|yi | ∈ {γ1 , γ2 } ⊂ I[r] is the outcome of cache access patterns, then we further
insist on
sign(yi ) = sign(zi ) ↔
s, c > (yi + zi )
noticed that LLL works very well on these lattices, probably because the basis
used is sparse. This implies that the vectors are already relatively short and
orthogonal. The parameter α determines the shortness of the vector we look
for, and therefore influences if an algorithm like LLL finds our vector. For the
experiments described in Section 5, we required α ≤ 0.1. This made it possible
for every parameter set we used in the experiments to always have at least one
cache-access pattern to use.
Parameter β influences the probability that one makes a huge mistake when
comparing the values of yi and zi . However, for the parameters we used in the
experiments, we did not find recognizable cache-access patterns which corre-
spond to small yi . This means, we did not need to use this last restriction to
reject certain cache-access patterns.
3.2. Exploitation. For simplicity, we assume we have one specific cache access
pattern, which reveals if yi ∈ {γ1 , γ2 } for i = 0, . . . , n − 1 of polynomial y, and
if this is the case, yi has probability (1 − α) to be value γ1 , with small α. In
practice however, there might be more than one cache weakness, satisfying the
above requirements. This would allow the attacker to search for more than one
cache access pattern done by the victim. For the attack, we assume the victim
is creating N signatures1 (zj , cj ) for j = 1, . . . , N , and an attacker is gathering
these signatures with associated cache information for noise polynomial yj . We
assume the attacker can search for the specific cache access pattern, for which he
can determine if yji ∈ {γ1 , γ2 }. For the cases revealed by cache access patterns,
the attacker ends up with the following equation:
where the attacker knows coefficient zji of zj , rotated coefficient vectors cji of
challenge cj (both from the signatures) and yji ∈ {γ1 , γ2 } of noise polynomial
yj (from the side-channel attack). Unknowns to the attacker are bit bj and s.
If zji = γ1 , the attacker knows that
s, cji ∈ {0, 1, −1}. Moreover, with high
probability (1 − α) the value will be 0, as by the second restriction yji is biased
to be value γ1 . So if zji = γ1 , the attacker adds ξk = cji to a list of good vectors.
The restriction zji = γ1 means that the attacker will in some cases not use the
information in Eq. (2), although he knows that yji ∈ {γ1 , γ2 }.
When the attacker collects enough of these vectors ξk = cji ; 0 ≤ i ≤ n−1, 1 ≤
j ≤ N, 1 ≤ k ≤ n, he can build a matrix L ∈ {−1, 0, 1}n×n , whose columns are
the ξk ’s. This matrix satisfies:
sL = v (3)
for some unknown but short vector v. The attacker does not know
√ v, so he cannot
simply solve for s, but he does know that v has norm about αn, and lies in the
lattice spanned by the rows of L. He can use a lattice reduction algorithm, like
LLL, on L to search for v. LLL also outputs the unimodular matrix U satisfying
UL = L . The attack tests for each row of U (and its rotations) whether it is
1
Here zj refers to the first signature polynomial zj1 of the jth signature (zj1 , z†j2 , cj ).
Flush, Gauss, and Reload 335
In this section, we discuss the foundations and strategy of our second cache
attack on the Bernoulli-based sampler (Algorithms 2.5, 2.6, and 2.7). We show
how to exploit the fact that this method uses a small table ET, leaking very
precise information about the sampled value.
336 Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom
2
Again, zj refers to the first signature polynomial zj1 of the jth signature (zj1 , z†j2 , cj ).
Flush, Gauss, and Reload 337
Similar to the first attack, an attacker might also use vectors ξk = cji , where
s, cji ∈ {−1, 0, 1}, in combination with LLL and possibly randomization. This
approach might help if fewer signatures are available, but the easiest way is to
require exact knowledge, which comes at the expense of needing more signa-
tures, but has a very fast and efficient offline part. Section 6.3 deals with this
approximate information.
Our test machine is an AMD FX-8350 Eight-Core CPU running at 4.1 GHz.
We use the research oriented C++ implementation of BLISS, made available by
the authors on their webpage [8]. Both of the analyzed sampling methods are
provided by the implementation, where the tables T, I and ET are constructed
dependent on σ. We use the NTL library [33] for LLL and kernel calculations.
The authors of BLISS [9] proposed several parameter sets for the signature
scheme (see full version [4, Table A.1]). We present attacks against all combi-
nations of parameter sets and sampling methods; the full results of the perfect
side-channel attacks are given in the full version [4, Appendix B].
5.2. CDT Sampling. When the signing algorithm uses CDT sampling as
described in Algorithm 2.4, the perfect side-channel provides the values of r/8
and Iz /8 of the table accesses for r and Iz in tables I and T . We apply the
attack strategy of Section 3.
We first need to find cache-line patterns, of type intersection or last-jump,
which reveal that |yi | ∈ {γ1 , γ2 } and P[|yi | = γ1 | |yi | ∈ {γ1 , γ2 }] = 1 − α with
α ≤ 0.1. One way to do that is to construct two tables: one table that lists
elements I[r], that belong to certain cache-lines of table I, and one table that
lists the accessed elements Iz inside these intervals I[r], that belong to certain
cache-lines of table T . We can then brute-force search for all cache weaknesses of
type intersection or last-jump. For example, in BLISS-I the first eight elements of
I (meaning I[0], . . . , I[7]) belong to the first cache-line of I, but for the elements
in I[7] = {7, 8}, the sampler accesses element Iz = 8, which is part of the
second cache-line of T . This is an intersection weakness: if the first cache-line of
I is accessed and the second cache-line of T is accessed, we know yi ∈ {7, 8}.
Similarly, one can find last-jump weaknesses, by searching for intervals I[r] that
access multiple cache-lines of T . Once we have these weaknesses, we need to use
the biased restriction with α ≤ 0.1. This can be done by looking at all bytes
except the first of the entry T [Iz ] (this is already used to determine interval I[r]).
If we denote the integer value of these 7 bytes by T [Iz ]byte=1 , then we need to
check if T [Iz ] has property
(or (T [Iz ])byte=1 /(256 − 1) ≥ (1 − α)). If one of these properties holds, then we
have yi ∈ {Iz − 1, Iz } and P[|yi | = Iz | |yi | ∈ {Iz − 1, Iz }] = 1 − α (or with Iz
and Iz − 1 swapped). For each set of parameters we found at least one of these
weaknesses using the above method (see the full version [4, Table B.1] for the
values).
We collect m (possibly rotated) coefficient vectors cj and then run LLL at
most t = 2(m − n) + 1 times, each time searching for s in the unimodular
transformation matrix using the public key. We consider the experiment failed if
the secret key is not found after this number of trials; the randomly constructed
lattices have a lot of overlap in their basis vectors which means that increasing t
further is not likely to help. We performed 1000 repetitions of each experiment
(different parameters and sizes for m) and measured the success probability psucc ,
the average number of required signatures N to retrieve m usable challenges,
Flush, Gauss, and Reload 339
and the average length of v if it was found. The expected number of required
signatures E[N ] is also given, as well as the running time for the LLL trials. This
expected number of required signatures can be computed as:
m
E[N ] = ,
n · P[CP] · P[
s1 , c = 0]
for K = σσ2 +1 and tail-cut τ ≥ 1. Note that the number of required signatures
is smaller for BLISS-II than for BLISS-I. This might seem surprising as one might
expect it to increase or be about the same as BLISS-I because the dimensions
and security level are the same for these two parameter sets. However, σ is
chosen a lot smaller in BLISS-II, which means that also value K is smaller. This
influences N significantly as the probability to sample values xK is larger for
small σ.
340 Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom
6 Proof-of-Concept Implementation
So far, the experimental results were based on the assumption of a perfect side-
channel: we assumed that we would get the cache-line of every table look-up
in the CDT sampling and Bernoulli sampling. In this section, we reduce the
assumption and discuss the results of more realistic experiments using the Flu-
sh+Reload technique.
When moving to real hardware some of the assumptions made in Section 5
no longer hold. In particular, allocation does not always ensure that tables are
aligned at the start of cache lines and processor optimizations may pre-load
memory into the cache, resulting in false positives. One such optimization is
the spatial prefetcher, which pairs adjacent cache lines into 128-byte chunks and
prefetches a cache line if an access to its pair results in a cache miss [16].
6.1. FLUSH+RELOAD on CDT Sampling. Due to the spatial prefetcher,
Flush+Reload cannot be used consistently to probe two paired cache lines.
Consequently, to determine access to two consecutive CDT table elements, we
must use a pair that spans two unpaired cache lines. In the full version [4, Table
C.3], we show that when the CDT table is aligned at 16 bytes, we can always
find such a pair for BLISS-I. Although this is not a proof that our attack works
in all scenarios, i.e. for all σ and all offsets, it would also not be a solid defence to
pick exactly those scenarios for which our attack would not work, e.g., because
α could be increased.
The attack was carried out on an HP Elite 8300 with an i5-3470 processor.
running CentOS 6.6. Before sampling each coordinate yi , for i = 0, . . . , n − 1, we
flush the monitored cache lines using the clflush instruction. After sampling the
coordinate, we reload the monitored cache lines and measure the response time.
We compare the response times to a pre-defined threshold value to determine
whether the cache lines were accessed by the sampling algorithm.
A visualization of the Flush+Reload measurements for CDT sampling
is given in Fig. 6.1. Using the intersection and last-jump weakness of the CDT
sampler in cache-memory, we can determine which value is sampled by the victim
by probing two locations in memory. To reduce the number of false positives, we
focus on one of the weaknesses (given in the full version [4, Table B.1]) as a target
for the Flush+Reload. This means that the other weaknesses are not detected
and we need to observe more signatures than with a perfect side-channel, before
we collect enough columns to start with the offline part of the attack.
We executed 50 repeated attacks against BLISS-I, probing the last-jump
weakness for {γ1 , γ2 } = {55, 56}. We completely recovered the private key in 46
out of the 50 cases. On average we require 3438 signatures for the attack, to
collect m = 2n = 1024 equations. We tried LLL five times after the collection
and considered the experiment a failure if we did not find the secret key in these
five times. We stress that this is not the optimal strategy to minimize the number
of required signatures or to maximize the success probability. However, it is an
indication that this proof-of-concept attack is feasible.
Flush, Gauss, and Reload 341
yi ∈ {γ1 , γ2 }
The experiment is considered a failure if we did not find the secret key after
trying LLL five times.
6.4. Conclusion. Our proof-of-concept implementation demonstrates that in
many cases we can overcome the limitations of processor optimizations and
perform the attack on BLISS. The attack, however, requires a high degree of
synchronization between the attacker and the victim, which we achieve by mod-
ifying the victim code. For a similar level of synchronization in a real attack
scenario, the attacker will have to be able to find out when each coordinate is
sampled. One possible approach for achieving this is to use the attack of Gul-
lasch et al. [13] against the Linux Completely Fair Scheduler. The combination
of a cache attack with the attack on the scheduler allows the attacker to monitor
each and every table access made by the victim, which is more than required for
our attacks.
Acknowledgements. The authors would like to thank Daniel J. Bernstein and Léo
Ducas for fruitful discussions and suggestions.
References
1. Alkim, E., Ducas, L., Pöppelmann, T., Schwabe, P.: Post-quantum key exchange -
a new hope. IACR Cryptology ePrint Archive 2015/1092 (2015)
2. Bernstein, D.J.: Cache-timing attacks on AES (2005). Preprint available at http://
cr.yp.to/antiforgery/cachetiming-20050414.pdf
3. Bos, J.W., Costello, C., Naehrig, M., Stebila, D.: Post-quantum key exchange for
the TLS protocol from the ring learning with errors problem. In: S&P 2015, pp.
553–570. IEEE Computer Society (2015)
4. Groot Bruinderink, L., Hülsing, A., Lange, T., Yarom, Y.: Flush, Gauss, and
reload - a cache attack on the BLISS lattice-based signature scheme. IACR Cryp-
tology ePrint Archive 2016/300 (2016)
5. Buchmann, J., Cabarcas, D., Göpfert, F., Hülsing, A., Weiden, P.: Discrete Ziggu-
rat: a time-memory trade-off for sampling from a Gaussian distribution over the
integers. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC 2013. LNCS, vol. 8282,
pp. 402–418. Springer, Heidelberg (2014)
6. Chen, H.-C., Asau, Y.: On generating random variates from an empirical distrib-
ution. AIIE Trans. 6(2), 163–166 (1974)
7. Chen, L., Liu, Y.-K., Jordan, S., Moody, D., Peralta, R., Perlner, R., Smith-Tone,
D.: Report on post-quantum cryptography. NISTIR 8105, Draft, February 2016
8. Ducas, L., Durmus, A., Lepoint, T., Lyubashevsky, V.: BLISS: Bimodal Lattice
Signature Schemes (2013). http://bliss.di.ens.fr/
9. Ducas, L., Durmus, A., Lepoint, T., Lyubashevsky, V.: Lattice signatures and
bimodal Gaussians. In: Canetti, R., Garay, J.A. (eds.) CRYPTO 2013, Part I.
LNCS, vol. 8042, pp. 40–56. Springer, Heidelberg (2013)
10. Dwarakanath, N.C., Galbraith, S.D.: Sampling from discrete Gaussians for lattice-
based cryptography on a constrained device. Appl. Algebra Eng. Commun. Com-
put. 25(3), 159–180 (2014)
11. Gentry, C., Peikert, C., Vaikuntanathan, V.: Trapdoors for hard lattices and new
cryptographic constructions. In: Dwork, C. (ed.) STOC 2008, pp. 197–206. ACM
(2008)
12. Gruss, D., Spreitzer, R., Mangard, S.: Cache template attacks: Automating attacks
on inclusive last-level caches. In: Jung, J., Holz, T. (eds.) USENIX Security 2015,
pp. 897–912. USENIX Association (2015)
13. Gullasch, D., Bangerter, E., Krenn, S.: Cache games – bringing access-based cache
attacks on AES to practice. In: S&P 2011, pp. 490–505. IEEE Computer Society
(2011)
344 Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom
14. Güneysu, T., Lyubashevsky, V., Pöppelmann, T.: Practical lattice-based cryptog-
raphy: a signature scheme for embedded systems. In: Prouff, E., Schaumont, P.
(eds.) CHES 2012. LNCS, vol. 7428, pp. 530–547. Springer, Heidelberg (2012)
15. Hoffstein, J., Pipher, J., Silverman, J.H.: NTRU: a ring-based public key cryptosys-
tem. In: Buhler, J.P. (ed.) ANTS 1998. LNCS, vol. 1423, pp. 267–288. Springer,
Heidelberg (1998)
16. Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Man-
ual, April 2012
17. Irazoqui, G., Inci, M.S., Eisenbarth, T., Sunar, B.: Wait a minute! A fast, cross-
VM attack on AES. In: Stavrou, A., Bos, H., Portokalidis, G. (eds.) RAID 2014.
LNCS, vol. 8688, pp. 299–319. Springer, Heidelberg (2014)
18. ETSI Quantum-Safe Cryptography (QSC) ISG. Quantum-safe cryptography. ETSI
working group (2015). http://www.etsi.org/technologies-clusters/technologies/
quantum-safe-cryptography
19. L’Ecuyer, P.: Non-uniform random variate generations. In: Lovric, M. (ed.) Inter-
national Encyclopedia of Statistical Science, pp. 991–995. Springer, Heidelberg
(2011)
20. Lenstra, A.K., Lenstra, H.W., Lovász, L.: Factoring polynomials with rational
coefficients. Math. Ann. 261(4), 515–534 (1982)
21. Lindner, R., Peikert, C.: Better key sizes (and attacks) for LWE-based encryp-
tion. In: Kiayias, A. (ed.) CT-RSA 2011. LNCS, vol. 6558, pp. 319–339. Springer,
Heidelberg (2011)
22. Liu, F., Yarom, Y., Ge, Q., Heiser, G., Lee, R.B.: Last-level cache side-channel
attacks are practical. In: S&P 2015, pp. 605–622. IEEE Computer Society (2015)
23. NSA. NSA Suite B Cryptography. NSA website (2015). https://www.nsa.gov/ia/
programs/suiteb cryptography/
24. Osvik, D.A., Shamir, A., Tromer, E.: Cache attacks and countermeasures: the
case of AES. In: Pointcheval, D. (ed.) CT-RSA 2006. LNCS, vol. 3860, pp. 1–20.
Springer, Heidelberg (2006)
25. Peikert, C.: An efficient and parallel Gaussian sampler for lattices. In: Rabin, T.
(ed.) CRYPTO 2010. LNCS, vol. 6223, pp. 80–97. Springer, Heidelberg (2010)
26. Peikert, C.: Lattice cryptography for the internet. In: Mosca, M. (ed.) PQCrypto
2014. LNCS, vol. 8772, pp. 197–219. Springer, Heidelberg (2014)
27. Percival, C.: Cache missing for fun and profit. In: BSDCan 2005 (2005)
28. van de Pol, J., Smart, N.P., Yarom, Y.: Just a little bit more. In: Nyberg, K. (ed.)
CT-RSA 2015. LNCS, vol. 9048, pp. 3–21. Springer, Heidelberg (2015)
29. Pöppelmann, T., Ducas, L., Güneysu, T.: Enhanced lattice-based signatures on
reconfigurable hardware. In: Batina, L., Robshaw, M. (eds.) CHES 2014. LNCS,
vol. 8731, pp. 353–370. Springer, Heidelberg (2014)
30. Pöppelmann, T., Güneysu, T.: Towards practical lattice-based public-key encryp-
tion on reconfigurable hardware. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC
2013. LNCS, vol. 8282, pp. 68–86. Springer, Heidelberg (2014)
31. Roy, S.S., Vercauteren, F., Verbauwhede, I.: High precision discrete Gaussian sam-
pling on FPGAs. In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC 2013. LNCS,
vol. 8282, pp. 383–401. Springer, Heidelberg (2014)
32. Saarinen, M.-J.O.: Arithmetic coding and blinding countermeasures for ring-LWE.
IACR Cryptology ePrint Archive 2016/276 (2016)
33. Shoup, V.: NTL: a library for doing number theory (2015). http://www.shoup.
net/ntl/
34. strongSwan. strongSwan 5.2.2 released, January 2015. https://www.strongswan.
org/blog/2015/01/05/strongswan-5.2.2-released.html
Flush, Gauss, and Reload 345
35. Yarom, Y., Benger, N.: Recovering OpenSSL ECDSA nonces using the Flush+
Reload cache side-channel attack. IACR Cryptology ePrint Archive 2014/140
(2014)
36. Yarom, Y., Falkner, K.: Flush+Reload: a high resolution, low noise, L3 cache
side-channel attack. In: Fu, K., Jung, J. (eds.) USENIX Security 2014, pp. 719–732.
USENIX Association (2014)
37. Zhang, J., Zhang, Z., Ding, J., Snook, M., Dagdelen, Ö.: Authenticated key
exchange from ideal lattices. In: Oswald, E., Fischlin, M. (eds.) EUROCRYPT
2015. LNCS, vol. 9057, pp. 719–751. Springer, Heidelberg (2015)
38. Zhang, Y., Juels, A., Reiter, M.K., Ristenpart, T.: Cross-tenant side-channel
attacks in PaaS clouds. In: CCS 2014, pp. 990–1003. ACM (2014)
CacheBleed: A Timing Attack on OpenSSL
Constant Time RSA
1 Introduction
1.1 Overview
Side-channel attacks are a powerful method for breaking theoretically secure
cryptographic primitives. Since the first works by Kocher [33], these attacks
have been used extensively to break the security of numerous cryptographic
implementations. At a high level, it is possible to distinguish between two types
of side-channel attacks, based on the methods used by the attacker: hardware-
based attacks, which monitor the leakage through measurements (usually using
dedicated lab equipment) of physical phenomena such as electromagnetic radia-
tion [43], power consumption [31,32], or acoustic emanation [22], and software-
based attacks, which do not require additional equipment but rely instead on
the attacker software running on or interacting with the target machine. Exam-
ples of the latter include timing attacks which measure timing variations of
cryptographic operations [7,16,17] and cache attacks which observe cache access
patterns [40,41,49].
Percival [41] published in 2005 a cache attack, which targeted the OpenSSL [39]
0.9.7c implementation of RSA. In this attack, the attacker and the victim pro-
grams are colocated on the same machine and processor, and thus share the
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 346–367, 2016.
DOI: 10.1007/978-3-662-53140-2 17
CacheBleed: A Timing Attack on OpenSSL Constant Time RSA 347
same processor cache. The attack exploits the structure of the processor cache by
observing minute timing variations due to cache contention. The cache consists
of fixed-size cache lines. When a program accesses a memory address, the cache-
line-sized block of memory that contains this address is stored in the cache and
is available for future use. The attack traces the changes that the victim program
execution makes in the cache and, from this trace, the attacker is able to recover
the private key used for the decryption.
In order to implement the modular exponentiation routine required for per-
forming RSA public and secret key operations, OpenSSL 0.9.7c uses a sliding-
window exponentiation algorithm [11]. This algorithm precomputes some values,
called multipliers, which are used throughout the exponentiation. The access
pattern to these precomputed multipliers depends on the exponent, which, in
the case of decryption and digital signature operations, should be kept secret.
Because each multiplier occupies a different set of cache lines, Percival [41] was
able to identify the accessed multipliers and from that recover the private key.
To mitigate this attack, Intel implemented a countermeasure that changes the
memory layout of the precomputed multipliers. The countermeasure, often called
scatter-gather, interleaves the multipliers in memory to ensure that the same
cache lines are accessed irrespective of the multiplier used [14]. While this coun-
termeasure ensures that the same cache lines are always accessed, the offsets of
the accessed addresses within these cache lines depend on the multiplier used
and, ultimately, on the private key.
Both Bernstein [7] and Osvik et al. [40] have warned that accesses to different
offsets within cache lines may leak information through timing variations due to
cache-bank conflicts. To facilitate concurrent access to the cache, the cache is
often divided into multiple cache banks. Concurrent accesses to different cache
banks can always be handled, however each cache bank can only handle a limited
number of concurrent requests—often a single request at a time. A cache-bank
conflict occurs when too many requests are made concurrently to the same cache
bank. In the case of a conflict, some of the conflicting requests are delayed.
While timing variations due to cache-bank conflicts are documented in the Intel
Optimization Manual [28], no attack exploiting these has ever been published. In
the absence of a demonstrated risk, Intel continued to contribute code that uses
scatter-gather to OpenSSL [23,24] and to recommend the use of the technique
for side-channel mitigation [12,13]. Consequently, the technique is in widespread
use in the current versions of OpenSSL and its forks, such as LibreSSL [35]
and BoringSSL [10]. It is also used in other cryptographic libraries, such as the
Mozilla Network Security Services (NSS) [38].
2 Background
2.1 OpenSSL’s RSA Implementation
RSA [44] is a public-key cryptosystem which supports both encryption and dig-
ital signatures. To generate an RSA key pair, the user generates two prime
numbers p, q and computes N = pq. Next, given a public exponent e (OpenSSL
uses e = 65537), the user computes the secret exponent d ≡ e−1 mod φ(N ).
The public key is the integers e and N and the secret key is d and N . In text-
book RSA encryption, a message m is encrypted by computing me mod N and
a ciphertext c is decrypted by computing cd mod N .
CacheBleed: A Timing Attack on OpenSSL Constant Time RSA 349
being either 0 or an odd number 0 < bi < 2w . The algorithm first precomputes
a1 , a3 , . . . a2w−1 as in the fixed-window case. It then scans the exponent from the
most significant to the least significant digit. For each digit, the algorithm squares
the intermediate result. For non-zero digit bi , it also multiplies the intermediate
result by abi .
The main advantages of the sliding-window algorithm over the fixed-window
algorithm are that, for the same window size, sliding window needs to precom-
pute half the number of multipliers, and that fewer multiplications are required
during the exponentiation. The sliding-window algorithm, however, leaks the
position of the non-zero multipliers to adversaries who can distinguish between
squaring and multiplication operations. Furthermore, the number of squaring
operations between consecutive multipliers may leak the values of some zero
bits. Up to version 0.9.7c, OpenSSL used sliding-window exponentiation. As part
of the mitigation of the Percival [41] cache attack, which exploits these leaks,
OpenSSL changed their implementation to use the fixed-window exponentiation
algorithm.
Since both algorithms precompute a set of multipliers and access them
throughout the exponentiation, a side-channel attack that can discover which
multiplier is used in the multiplication operations can recover the digits bi and
from them obtain the secret exponent b.
We now turn our attention to the cache hierarchy in modern Intel processors.
The cache is a small, fast memory that exploits the temporal and spatial locality
of memory accesses to bridge the speed gap between the faster CPU and slower
memory. In the processors we are interested in, the cache hierarchy consists of
three levels of caching. The top level, known as the L1 cache, is the closest to the
execution core and is the smallest and the fastest cache. Each successive cache
level is larger and slower than the preceding one, with the last-level cache (LLC)
being the largest and slowest.
Cache Structure. The cache stores fixed-sized chunks of memory called cache
lines. Each cache line holds 64 bytes of data that come from a 64-byte aligned
block in memory. The cache is organized as multiple cache sets, each consisting
of a fixed number of ways. A block of memory can be stored in any of the ways of
a single cache set. For the higher cache levels, the mapping of memory blocks to
cache sets is done by selecting a range of address bits. For the LLC, Intel uses an
undisclosed hash function to map memory blocks to cache sets [30,37,50]. The
L1 cache is divided into two sub caches: the L1 data cache (L1-D) which caches
the data the program accesses, and the L1 instruction cache (L1-I) which caches
the code the program executes. In multi-core processors, each of the cores has a
dedicated L1 cache. However, multithreaded cores share the L1 cache between
the two threads.
CacheBleed: A Timing Attack on OpenSSL Constant Time RSA 351
Cache Sizes. In the Intel Sandy Bridge microarchitecture, each of the L1-D
and L1-I caches has 64 sets and 8 ways to a total capacity of 64 · 8 · 64 = 32, 768
bytes. The L2 cache has 512 sets and 8 ways, with a size of 256 KiB. The L2
cache is unified, storing both data and instructions. Like the L1 cache, each core
has a dedicated L2 cache. The L3 cache, or the LLC, is shared by all of the cores
of the processor. It has 2,048 sets per core, i.e. the LLC of a four core processor
has 8,192 cache sets. The number of ways varies between processor models and
ranges between 12 and 20. Hence the size of the LLC of a small dual core
processor is 3 MiB, whereas the LLC of an 8-cores processor can be in the order
of 20 MiB. The Intel Xeon E5-2430 processor we used for our experiments is a 6-
core processor with a 20-way LLC of size 15 MiB. More recent microarchitectures
support more cores and more ways, yielding significantly larger LLCs.
Cache Lookup Policy. When the processor attempts to access data in mem-
ory, it first looks for the data in the L1 cache. In a cache hit, the data is found
in the cache. Otherwise, in a cache miss, the processor searches for the data in
the next level of the cache hierarchy. By measuring the time to access data, a
process can distinguish cache hits from misses and identify whether the data was
cached prior to the access.
1
https://github.com/openssl/openssl/commit/46a643763de6d8e39ecf6f76fa79b
4d04885aa59.
CacheBleed: A Timing Attack on OpenSSL Constant Time RSA 353
victim accesses data in a monitored cache bank by measuring the delays caused
by contention on the cache bank.
In our attack scenario, we assume that the victim and the attacker run con-
currently on two hyperthreads of the same processor core. Thus, the victim and
the attacker share the L1 data cache. Recall that the Sandy Bridge L1 data cache
is divided into multiple banks and that the banks cannot handle concurrent load
accesses. The attacker issues a large number of load accesses to a cache bank
and measures the time to fulfill these accesses. If during the attack the victim
also accesses the same cache bank, the victim accesses will contend with the
attacker for cache bank access, causing delays in the attack. Hence, when the
victim accesses the monitored cache bank the attack will take longer than when
the victim accesses other cache banks.
To implement CacheBleed we use the code in Listing 1. The bulk of the code
(Lines 4–259) consists of 256 addl instructions that read data from addresses
that are all in the same cache bank. (The cache bank is selected by the low bits
of the memory address in register r9.) We use four different destination registers
to avoid contention on the registers themselves. Before starting the accesses, the
code takes the value of the current cycle counter (Line 1) and stores it in register
r10 (Line 2). After performing 256 accesses, the previously stored value of the
cycle counter is subtracted from the current value, resulting in the number of
cycles that passed during the attack.
1 rdtscp
2 movq %rax , %r 1 0
3
4 addl 0 x000(%r 9 ) , %eax
5 addl 0 x040(%r 9 ) , %ecx
6 addl 0 x080(%r 9 ) , %edx
7 addl 0 x0c0(%r 9 ) , %e d i
8 addl 0 x100(%r 9 ) , %eax
9 addl 0 x140(%r 9 ) , %ecx
10 addl 0 x180(%r 9 ) , %edx
11 addl 0 x1c0(%r 9 ) , %e d i
.
.
.
256 addl 0 x f 0 0 (%r 9 ) , %eax
257 addl 0 x f 4 0 (%r 9 ) , %ecx
258 addl 0 x f 8 0 (%r 9 ) , %edx
259 addl 0 x f c 0 (%r 9 ) , %e d i
260
261 rdtscp
262 subq %r10 , %rax
We run the attack code on an Intel Xeon E5-2430 processor—a six-core Sandy
Bridge processor, with a clock rate of 2.20 GHz. Figure 3 shows the histogram of
the running times of the attack code under several scenarios.2
Scenario 1: Idle. In the first scenario, idle hyperthread, the attacker is the only
program executing on the core. That is, one of the two hyperthreads executes
the attack code while the other hyperthread is idle. As we can see, the attack
takes around 230 cycles, clearly showing that the Intel processor is superscalar
and that the cache can handle more than one access in a CPU cycle.
Scenario 3: Pure Memory. At the other extreme is the pure memory victim,
which continuously accesses the cache bank that the attacker monitors. As we
can see, the attack code takes almost twice as long to run in this scenario. The
distribution of attack times is completely distinct from any of the other scenarios.
Hence identifying the victim in this scenario is trivial. This scenario is, however,
not realistic—programs usually perform some calculation.
Scenarios 4 and 5: Mixed Load. The last two scenarios aim to measure a
slightly more realistic scenario. In this case, one in four victim operations is a
memory access, where all of these memory accesses are to the same cache bank.
In this scenario we measure both the case that the victim accesses the monitored
cache line (mixed-load ) and when there is no cache-bank contention between
the victim and the attacker (mixed-load–NC ). We see that the two scenarios
are distinguishable, but there is some overlap between the two distributions.
Consequently, a single measurement may be insufficient to distinguish between
the two scenarios.
In practice, even this mixed-load scenario is not particularly realistic. Typical
programs will access memory in multiple cache banks. Hence the differences
between measurement distributions may be much smaller than those presented
in Fig. 3. In the next section we show how we overcome this limitation and
correctly identify a small bias in the cache-bank access patterns of the victim.
2
For clarity, the presented histograms show the envelope of the measured data.
CacheBleed: A Timing Attack on OpenSSL Constant Time RSA 357
40%
35% Idle hyperthread
Pure compute
Number of Cases
320
Bin 0
310 Bin 1
Time (Cycles)
300
290
280
270
260
0 20000 40000 60000 80000 100000
Measurement number
302 Multiplications
300
Time (Cycles)
298
296
294
292
290
1000 1100 1200 1300 1400 1500
Measurement number
The figure clearly shows the two exponentiations executed as part of the
RSA-CRT calculation. Another interesting feature is that the measurements for
the two bins differ by about 4 cycles. The difference is the result of the OpenSSL
modular reduction algorithm, which accesses even bins more often than odd bins.
Consequently, there is more contention on even bins, and measurements on even
bins take slightly longer than those on odd bins.
298
Bin 1 Bin 3 Bin 5 Bin 7
297.5
297
Time (Cycles)
296.5
296
295.5
295
41100 41200 41300 41400 41500 41600
Measurement number
Identifying Multiplier Values. Note that in the second and fourth multipli-
cations, the measurements in the trace of bin 3 (yellow) take slightly longer than
the measurements of the other bins. This indicates that the three least signifi-
cant digits of the multiplier used in these multiplications are 011. Similarly, the
spike in the green trace observed during the third multiplication indicates that
the three least significant bits of the multiplier used are 001. This corresponds
to the ground truth where the multipliers used in the traced sections are 2, 11,
1, 11.
As we can see, we can extract the multipliers from the trace. However, there
are some practical challenges that complicate both the generation of the traces
and their analysis. We now discuss these issues.
Relative Clock Drift. Aligning the CacheBleed sequences at the start of the
exponentiation does not result in a clean signal. This is because both the victim
and the attacker are user processes, and they may be interrupted by the oper-
ating system. The most common interruption is due to timer interrupts, which
360 Y. Yarom et al.
298
Bin 1 Bin 3 Bin 5 Bin 7
297.5
297
Time (Cycles)
296.5
296
295.5
295
41100 41200 41300 41400 41500 41600
Measurement number
3000
Amplitude 2500 Bin 1 Bin 2
2000
1500
1000
500
0
100 200 300 400 500 600
Frequency
0.8
Bin 0 Bin 2 Bin 4 Bin 6
0.6 Bin 1 Bin 3 Bin 5 Bin 7
Normalised Time (Cycles)
0.4
0.2
-0.2
-0.4
7 7 7 4 0 1 2 4 6 7 5 1 7 3 3
frequency domain. This effectively subtracts the trace’s average from each trace
measurement, thereby making all the traces be at the same length.
We then find the frequency of multiplications in the trace by looking at the
frequency domain of the trace. Figure 8 shows the frequency spectrum of two of
the traces. For a 4096-bit key, OpenSSL performs two exponentiations with 2048-
bit exponents. With a window size of 5, there are 2048/5 ≈ 410 multiplications.
As we can see, there is a spike around the frequency 410 matching the number
of multiplications. Using the frequency extracted from the trace, rather than the
expected number of multiplications, allows us to better adjust to the effects of
noise at the start and end of the exponentiation which might otherwise result in
a loss of some multiplications.
We manage to recover the three least significant bits of almost all of the
multipliers. Due to noise at the start and the end of the exponentiations, we
miss one or two of the leading and trailing multiplications of each exponentiation.
Next, in Sect. 5, we show that the information we obtain about the three least
significant bits of almost all of the multipliers is enough for key extraction.
Branch and Prune Algorithm. For each candidate kp and kq , we will use
Eq. 1 to iteratively solve for dp and dq starting from the least or most significant
bits, branching to generate multiple potential solutions when bits are unknown
and pruning potential solutions when known bits contradict a given solution. In
contrast to [25], the bits we know are not randomly distributed. Instead, they
are synchronized to the three least significant bits of every five, with one or
two full windows of five missing at the least and most significant positions of
each exponent. This makes our analysis much simpler: when a bit of dp and dq is
unknown at a location i, we branch to generate two new solutions. When a bit of
dp and dq is known at a particular location, using the same heuristic assumption
as in [25], an incorrect solution will fail to match the known bit of dp and dq
with probability 0.5. When kp and kq are correct, we expect our algorithm to
generate four new solutions for every pair of unknown bits, and prune these to a
single correct solution at every string of three known bits. When kp and kq are
incorrect, we expect no solutions to remain after a few steps.
CacheBleed: A Timing Attack on OpenSSL Constant Time RSA 363
Empirical Results. We tested key recovery on the output of our attack run
on a 4096-bit key, which correctly recovered the three least significant bits of
every window of five, but missed the two least significant windows and one most
significant window for both dp and dq . We implemented this algorithm in Sage
and ran it on a Cisco UCS Server with two 2.30 GHz Intel E5-2699 processors
and 128 GiB of RAM. For the correct values of kp and kq , our branch-and-
prune implementation recovered the full key in 1 second on a single core after
examining 6,093 candidate partial solutions, and took about 160 ms to eliminate
an incorrect candidate pair of kp and kq after examining an average of 1,500
candidate partial solutions. A full search of all 65,537 candidate pairs of kp
and kq parallelized across 36 hyperthreaded cores took 3.5 min. We assumed
the positions of the missing windows at the most and least significant bits were
known. If the relative positions are unknown, searching over more possible offsets
would increase the total search time by a factor of 9.
6 Mitigation
Countermeasures for the CacheBleed attack can operate at the hardware, the
system or the software level. Hardware-based mitigations include increasing the
bandwidth of the cache banks. Our attack does not work on Haswell processors,
which do not seem to suffer from cache-bank conflicts [20,28]. But, as Haswell
does show timing variations that depend on low address bits [20], it may be
vulnerable to similar attacks. Furthermore, this solution does not apply to the
Sandy Bridge processors currently in the market.
7 Conclusions
In this work, we presented CacheBleed, the first timing attack to recover low
address bits from secret-dependent memory accesses. We demonstrate that the
attack is effective against state-of-the-art cryptographic software, widely thought
to be immune to timing attacks.
The timing variations that underlie this attack and the risk associated with
them have been known for over a decade. Osvik et al. [40] warn that “Cache bank
collisions (e.g., in Athlon 64 processors) likewise cause timing to be affected by
low address bits.” Bernstein [7] mentions that “ For example, the Pentium 1 has
similar cache-bank conflicts.” A specific warning about the cache-bank conflicts
and the scatter-gather technique appears in Footnote 38 of Tromer et al. [45].
Our research illustrates the risk to users when cryptographic software devel-
opers dismiss a widely hypothesized potential attack merely because no proof-of-
concept has yet been demonstrated. This is the prevailing approach for security
CacheBleed: A Timing Attack on OpenSSL Constant Time RSA 365
References
1. Acıiçmez, O.: Yet another microarchitectural attack: exploiting I-cache. In: CSAW,
Fairfax, VA, US (2007)
2. Acıiçmez, O., Gueron, S., Seifert, J.-P.: New branch prediction vulnerabilities in
openSSL and necessary software countermeasures. In: Galbraith, S.D. (ed.) Cryp-
tography and Coding 2007. LNCS, vol. 4887, pp. 185–203. Springer, Heidelberg
(2007)
3. Acıiçmez, O., Koç, Ç.K., Seifert, J.-P.: Predicting secret keys via branch predic-
tion. In: Abe, M. (ed.) CT-RSA 2007. LNCS, vol. 4377, pp. 225–242. Springer,
Heidelberg (2006)
4. Acıiçmez, O., Brumley, B.B., Grabher, P.: New results on instruction cache attacks.
In: Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol. 6225, pp. 110–124.
Springer, Heidelberg (2010)
5. Acıiçmez, O., Seifert, J.-P.: Cheap hardware parallelism implies cheap security. In:
4th International Workshop on Fault Diagnosis and Tolerance in Cryptography,
Vienna, AT, pp. 80–91 (2007)
6. Alpert, D.B., Choudhury, M.R., Mills, J.D.: Interleaved cache for multiple accesses
per clock cycle in a microprocessor. US Patent 5559986, September 1996
7. Bernstein, D.J.: Cache-timing attacks on AES (2005). Preprint http://cr.yp.to/
papers.html#cachetiming
8. Bernstein, D.J., Schwabe, P.: A word of warning. In: CHES 2013 Rump Session,
August 2013
9. Blömer, J., May, A.: New partial key exposure attacks on RSA. In: Boneh, D. (ed.)
CRYPTO 2003. LNCS, vol. 2729, pp. 27–43. Springer, Heidelberg (2003)
10. BoringSSL. https://boringssl.googlesource.com/boringssl/
11. Bos, J.N.E., Coster, M.J.: Addition chain heuristics. In: Brassard, G. (ed.)
CRYPTO 1989. LNCS, vol. 435, pp. 400–407. Springer, Heidelberg (1990)
12. Brickell, E.: Technologies to improve platform security. In: CHES 2011 Invited Talk,
September 2011. http://www.iacr.org/workshops/ches/ches2011/presentations/
Invited%201/CHES2011 Invited 1.pdf
366 Y. Yarom et al.
13. Brickell, E.: The impact of cryptography on platform security. In: CT-
RSA 2012 Invited Talk, February 2012. http://www.rsaconference.com/writable/
presentations/file upload/cryp-106.pdf
14. Brickell, E., Graunke, G., Seifert, J.-P.: Mitigating cache/timing based side-
channels in AES and RSA software implementations. In: RSA Conference 2006
Session DEV-203, February 2006
15. Brumley, B.B., Hakala, R.M.: Cache-timing template attacks. In: Matsui, M. (ed.)
ASIACRYPT 2009. LNCS, vol. 5912, pp. 667–684. Springer, Heidelberg (2009)
16. Brumley, B.B., Tuveri, N.: Remote timing attacks are still practical. In: Atluri, V.,
Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 355–371. Springer, Heidelberg
(2011)
17. Brumley, D., Boneh, D.: Remote timing attacks are practical. In: 12th USENIX
Security, Washington, DC, US, pp. 1–14 (2003)
18. Fog, A.: How to optimize for the Pentium processor, August 1996. https://
notendur.hi.is/hh/kennsla/sti/h96/pentopt.txt
19. Fog, A.: How to optimize for the Pentium family of microprocessors, April 2004.
https://cr.yp.to/2005-590/fog.pdf
20. Fog, A.: The microarchitecture of Intel, AMD and VIA CPUs: an optimization
guide for assembly programmers and compiler makers, January 2016. http://www.
agner.org/optimize/microarchitecture.pdf
21. Garner, H.L.: The residue number system. IRE Trans. Electron. Comput. EC–
8(2), 140–147 (1959)
22. Genkin, D., Shamir, A., Tromer, E.: RSA key extraction via low-bandwidth
acoustic cryptanalysis. In: Garay, J.A., Gennaro, R. (eds.) CRYPTO 2014, Part I.
LNCS, vol. 8616, pp. 444–461. Springer, Heidelberg (2014)
23. Gopal, V., Guilford, J., Ozturk, E., Feghali, W., Wolrich, G., Dixon, M.: Fast and
constant-time implementation of modular exponentiation. In: Embedded Systems
and Communications Security, Niagara Falls, NY, US (2009)
24. Gueron, S.: Efficient software implementations of modular exponentiation. J.
Crypt. Eng. 2(1), 31–43 (2012)
25. Heninger, N., Shacham, H.: Reconstructing RSA private keys from random key
bits. In: Halevi, S. (ed.) CRYPTO 2009. LNCS, vol. 5677, pp. 1–17. Springer,
Heidelberg (2009)
26. Wei-Ming, H.: Reducing timing channels with fuzzy time. In: 1991 Computer Soci-
ety Symposium on Research Security and Privacy, Oakland, CA, US, pp. 8–20
(1991)
27. İnci, M.S., Gülmezoğlu, B., Irazoqui, G., Eisenbarth, T., Sunar, B.: Seriously, get
off my cloud! Cross-VM RSA key recovery in a public cloud. IACR Cryptology
ePrint Archive, Report 2015/898, September 2015
28. Intel 64 & IA-32 AORM: Intel 64 and IA-32 Architectures Optimization Reference
Manual. Intel Corporation, April 2012
29. Irazoqui, G., Eisenbarth, T., Sunar, B.: S$A: a shared cache attack that works
across cores and defies VM sandboxing - and its application to AES. In: S&P, San
Jose, CA, US (2015)
30. Irazoqui, G., Eisenbarth, T., Sunar, B.: Systematic reverse engineering of cache
slice selection in Intel processors. IACR Cryptology ePrint Archive, Report
2015/690, July 2015
31. Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.)
CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999)
32. Kocher, P., Jaffe, J., Jun, B., Rohatgi, P.: Introduction to differential power analy-
sis. J. Cryptogr. Eng. 1, 5–27 (2011)
CacheBleed: A Timing Attack on OpenSSL Constant Time RSA 367
1 Motivation
Cloud computing services are more popular than ever with their ease of access,
low cost and real-time scalability. With increasing adoption of cloud, concerns
over cloud specific attacks have been rising and so has the number of research
studies exploring potential security risks in the cloud domain. A main enabler
for cloud security is the seminal work of Ristenpart et al. [40]. The work demon-
strated the possibility of co-location as well as the security risks that come
with it. The co-location is the result of resource sharing between tenant Virtual
Machines (VMs). Under certain conditions, the same mechanism can also be
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 368–388, 2016.
DOI: 10.1007/978-3-662-53140-2 18
Cache Attacks Enable Bulk Key Recovery on the Cloud 369
Our Contribution
Unlike in earlier bulk key recovery attacks [10,21] we do not rely on faulty
random number generators but instead exploit hardware level leakages.
2 Related Work
This work combines techniques needed for co-location in a public cloud with
state-of-the art techniques in cache based cross-VM side channel attacks.
Co-location Detection: In 2009 Ristenpart et al. [40] demonstrated that a
potential attacker has the ability to co-locate and detect co-location in public
IaaS clouds. In 2011, Zhang et al. [47] demonstrated that a tenant can detect
co-location in the same core by monitoring the L2 cache. Shortly after, Bates
et al. [7] implemented a co-location test based on network traffic analysis. In
2015, Zhang et al. [48] demonstrated that de-duplication enables co-location
detection from co-located VMs in PaaS clouds. In follow-up to Ristenpart et
al.’s work [40], Zhang et al. [44] and Varadarajan et al. [42] explored co-location
detection in commercial public cloud in 2015. Both studies use the memory bus
Cache Attacks Enable Bulk Key Recovery on the Cloud 371
determined by the cache line size and the number of sets in the cache, respec-
tively. The more sets a cache has, more bits are needed from the page frame
number to select the set that a memory block occupies in the cache.
The Prime and Probe attack has been widely studied in upper level
caches [5,49], but was first introduced for the LLC in [14,26] with the use of
hugepages. Unlike regular memory pages that reveal only 12 bits of the phys-
ical address, hugepages reveal 21 bits, allowing the LLC monitoring. Also, pro-
filing the LLC in contrast to the L1 or L2 cache has various advantages. Firstly,
unlike the upper level caches, the LLC is shared across cores, providing a cross-
core covert channel. Moreover, the time distinguishability of accesses in upper
level caches is much lower than those between the LLC and memory. On the
other hand, due to the size of LLCs, we cannot simultaneously profile the whole
cache, but rather a small portion of it at a time. In addition to that, modern
processors divide their LLC into slices with a non-public hash algorithm, mak-
ing it difficult to predict where the data will be located. Taking all these into
account, the Prime and Probe attack is divided in two main stages:
Prime Stage: The attacker fills a portion of the LLC with his own data and
waits for a period of time for the victim to access the cache.
Probe Stage: The attacker probes (reloads) the primed data. If the victim
accessed the monitored set of the cache, one (or more) of the attacker’s lines will
not reside in the cache anymore, and will have to be retrieved from the memory.
As stated before, profiling a portion of the cache becomes more difficult when
the LLC is divided into slices. However, as observed by [14] we can create an
eviction set without knowing the algorithm implemented. This involves a step
prior to the attack where the attacker finds the memory blocks colliding in a
specific set/slice. This can be done by creating a large pool of memory blocks,
and access them until we observe that one of them is fetched from the memory.
The procedure will be further explained in Sect. 4. A group of memory blocks
that fill one set/slice in the LLC will form an eviction set for that set/slice.
with the target. Note that all co-locations were between machines from different
accounts. The experiments did not aim at obtaining co-location with a single
instance, for which the probability of obtaining co-location would be lower.
The LLC is shared across all cores of most modern Intel CPUs, including the
Intel Xeon E5-2670 v2 used (among others) in Amazon EC2. Accesses to LLC
are thus transparent to all VMs co-located on the same machine, making it the
perfect domain for covert communication and co-location detection.
Our LLC test is designed to detect cache lines that are needed to fill a
specific set in the cache. In order to control the location that our data will
occupy in the cache, the test allocates and works with hugepages.1 In normal
operation with moderate noise, the number of lines to fill one set is equal to
LLC associativity, which is 20 in Intel Xeon E5-2670 v2 used in our Amazon
EC2 instances. However, with more than one user trying to fill the same set at
the same time, one will observe that fewer than 20 lines are needed to fill one
set. By running this test concurrently on a co-located VM pair, both controlled
by the same user, it is possible to verify co-location with high certainty. The test
performs the following steps:
The LLC is not the only method that we have tried in order to verify co-
location (see extended version of this paper for more information [25]). However,
the experiments show that the LLC test is the only decisive and reliable test that
can detect whether two of our instances are running in the same CPU in Amazon
EC2. We performed the LLC test in two steps as follows:
1
The co-location test has to be implemented carefully, since the heavy usage of
hugepages may yield into performance degradation. In fact, while trying to achieve
a quadruple co-location Amazon EC2 stopped our VMs due to performance issues.
For a more detailed explanation, see [25].
374 M.S. İnci et al.
1. Single Instance Elimination: The first step of the LLC test is the elimina-
tion of single instances i.e. the ones that are not co-located with any other in
the instance pool. To do so, we schedule the LLC test to run at all instances at
the same time. Instances not detecting co-location is retired. For the remain-
ing ones, the pairs need to be further processed as explained in the next step.
Note that without this preliminary step, one would have to perform n(n−1)/2
pair detection tests to find co-located pairs, i.e. 3160 tests for 80 instances.
This step yielded 22 co-located instances out of 80.
2. Pair Detection: Next we identify pairs for the possibly co-located instances.
The test is performed as a binary search tree where each instance is tested
against all the others for co-location.
Friday
Monday
Tuesday
Average
LLC Noise
01:00 03:00 05:00 07:00 09:00 11:00 13:00 15:00 17:00 19:00 21:00 23:00
Hour of Day (Local Time)
Fig. 1. LLC noise over time of day, by day (dotted lines) and on average (bold line).
25
20
Noise (%)
15
10
0
0 32 64 96 128 160 192
Set Number
Fig. 2. Average noise for the first 200 sets in a day. Red lines are the starting points
of pages. Sets 0 and 3 feature the highest amount of noise, with a repeating pattern
every 64 sets (which is the width of a page in the LLC). (Color figure online)
even after a year. If the instances were to reside in dual socket machines and
the VM processes moved between CPUs, the co-location test would have failed.
However, even in that case, repeated experiments would still reveal co-location
just by repeating the test after a time period enough to allow a socket migration.
This fact was used in [14,26] to perform LLC side channel attacks. This pecu-
liarity is not observed in non-linear slices, i.e., the same b1, b2, .., bn will only
slice-collide for a small number of sets. The slice colliding blocks can either be
empirically observed for each set, or guessed if the non-linear slice selection algo-
rithm is known. Our particular EC2 instance type utilizes a Intel Xeon E5-2670
v2, which features a 25 MB LLC distributed over 10 LLC slices (i.e., non power
of two). We decide to reverse-engineer the non-linear slice selection algorithm to
speed up our eviction set creation algorithm. Note that the approach that we
follow can be utilized to reverse engineer any non-linear slice selection algorithm.
We describe the slice selection algorithm as
H(p) = h3 (p)h2 (p)h1 (p)h0 (p) (1)
where each H(p) is a concatenation of 4 different functions corresponding to
the 4 necessary bits to represent 10 slices. Note that H(p) will output results
from 0000 to 1001 if we label the slices from 0–9. Thus, a non-linear function is
needed that excludes outputs 10–15. Further note that p is the physical address
and will be represented as a bit string: p = p0 p1 . . . p35 . In order to recover the
non-linear hash function implemented by the Intel Xeon E5-2670 v2, we use
a fully controlled machine featuring the same Intel Xeon E5-2670 v2 found in
Amazon’s EC2 servers. We first generate ten equation systems (one per slice)
based on slice colliding addresses by applying the same methodology explained
to achieve co-location and generating up to 100,000 additional memory blocks.
Up to this point, one can solve the non-linear function after a re-linearization
step given sufficiently many equations. Since we are not be able to recover enough
addresses (due to RAM limitations) we take a different approach. Figure 3 shows
the distribution of the 100,000 addresses over the 10 slices. Note that 8 slices are
mapped to by 81.25 % of the addresses, while 2 slices get only about 18.75 %,
i.e., a 3/16 proportion. We will refer to these two slices as the non-linear slices.
We proceed to first solve the first 8 slices and the last 2 slices separately
using linear functions. For each we try to find solutions to the equation systems
Pi · Ĥi = 0̂, (2)
Pi · Ĥi = 1̂ . (3)
Here Pi is the equation system obtained by arranging the slice colliding addresses
into a matrix form, Ĥi is the matrix containing the slice selection functions and
0̂ and 1̂ are the all zero and all one solutions, respectively. This outputs two sets
of linear solutions both for the first 8 linear slices and the last 2 slices.
Given that we can model the slice selection functions separately using linear
functions, and given that the distribution is non-uniform, we model the hash
function is implemented in two levels. In the first level a non-linear function
chooses between either of the 3 linear functions describing the 8 linear slices or
the linear functions describing the 2 non-linear slices. Therefore, we speculate
that the 4 bits selecting the slice looks like:
h0 (p) = h0 (p) h1 (p) = ¬(nl(p)) · h1 (p)
H(p) =
h2 (p) = ¬(nl(p)) · h2 (p) h3 (p) = nl(p)
Cache Attacks Enable Bulk Key Recovery on the Cloud 377
11000
Number of addresses
10000
9000
8000
0 1 2 3 4 5 6 7 8 9
Slice number
Fig. 3. Number of addresses that each slice takes out of 100,000. The non-linear slices
take less addresses than the linear ones.
Table 1. Results for the hash selection algorithm implemented by the Intel Xeon
E5-2670 v2
where h0 ,h1 and h2 are the hash functions selecting bits 0,1 and 2 respectively,
h3 is the function selecting the 3rd bit and nl is a nonlinear function of an
unknown degree. We recall that the proportion of the occurrence of the last two
slices is 3/16. To obtain this proportion we need a degree 4 nonlinear function
where two inputs are negated, i.e.:
nl = v0 · v1 · ¬(v2 · v3 ) (4)
where nl is 0 for the 8 linear slices and 1 for the 2 non-linear slices. Observe that
nl will be 1 with probability 3/16 while it will be zero with probability 13/16,
matching the distributions seen in our experiments. Consequently, to find v0
and v1 we only have to solve Eq. (3) for slices 8 and 9 together to obtain a 1
output. To find v2 and v3 , we first separate those addresses where v0 and v1
output 1 for the linear slices 0 − 7. For those cases, we solve Eq. (3) for slices
0−7. The result is summarized in Table 1. We show both the non-linear function
vectors v0 , v1 , v2 , v3 and the linear functions h0 , h1 , h2 . These results describe the
behavior of the slice selection algorithm implemented in the Intel Xeon E5-2670
v2. With this result, we can now easily predict the slice selection on the target
processor in the EC2 cloud.
378 M.S. İnci et al.
– We make use of the fact that the offset of the address of each table position
entry does not change when a new decryption process is executed. Therefore,
we only need to monitor a subsection of all possible sets, yielding a lower
number of traces.
– Instead of the monitoring both the multiplication and the table entry set (as
in [14] for El-Gamal), we only monitor a table entry set in one slice. This
avoids the step where the attacker has to locate the multiplication set and
avoids an additional source of noise.
when we run repeated decryptions. Thanks to the knowledge of the non linear
slice selection algorithm, we can easily change our monitored set/slice if we see a
high amount of noise in one particular set/slice. Since we also have to monitor a
different set per table entry, it also helps us to change our eviction set accordingly.
The threshold is different for each of the sets, since the time to access different
slices usually varies. Thus, the threshold for each of the sets has to be calculated
before the monitoring phase. In order to improve the applicability of the attack
the LLC can be monitored to detect whether there are RSA decryptions or not
in the co-located VMs as proposed in [24]. After it is proven that there are RSA
decryptions the attack can be performed.
In order to obtain high quality timing leakage, we synchronize the spy process
and the RSA decryption by initiating a communication between the victim and
attacker, e.g. by sending a TLS request. Note that we are looking for a particular
pattern observed for the RSA table entry multiplications, and therefore processes
scheduled before the RSA decryption will not be counted as valid traces. In
short, the attacker will communicate with the victim before the decryption.
After this initial communication, the victim will start the decryption while the
attacker starts monitoring the cache usage. In this way, we monitor 4,000 RSA
decryptions with the same key and same ciphertext for each of the 16 different
sets related to the 16 table entries.
We investigate a hypothetical case where a system with dual CPU sockets
is used. In such a system, depending on the hypervisor CPU management, two
scenarios can play out; processes moving between sockets and processes assigned
to specific CPUs. In the former scenario, we can observe the necessary number
of decryption samples simply by waiting over a longer period of time. In this
scenario, the attacker would collect traces and only use the information obtained
during the times the attacker and the victim share sockets and discard the rest
as missed traces. In the latter scenario, once the attacker achieves co-location,
as we have in Amazon EC2, the attacker will always run on the same CPU as
the target hence the attack will succeed in a shorter span of time.
Once the online phase of the attack has been performed, we proceed to analyze
the leakage observed. There are three main steps to process the obtained data.
The first step is to identify the traces that contain information about the key.
Then we need to synchronize and correct the misalignment observed in the cho-
sen traces. The last step is to eliminate the noise and combine different graphs
to recover the usage of the multiplication entries. Among the 4,000 observations
for each monitored set, only a small portion contains information about the mul-
tiplication operations with the corresponding table entry. These are recognized
because their exponentiation trace pattern differs from that of unrelated sets. In
order to identify where each exponentiation occurs, we inspected 100 traces and
created the timeline shown in Fig. 4(b). It can be observed that the first expo-
nentiation starts after 37 % of the overall decryption time. Note that among
380 M.S. İnci et al.
250 250
Decryption First Secret Second Secret
200 200 Start Exponent (dp) Exponent (dq)
Reload time
Reload time
150 150
100 100
50 50
0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
timeslot timeslot
Fig. 4. Different sets of data where we find (a) trace that does not contain information
(b) trace that contains information about the key
12 12
10 10
8 8
6 6
4 4
2 2
0 0
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
timeslot timeslot
Fig. 5. 10 traces from the same set where (a) they are divided into blocks for a cor-
relation alignment process (b) they have been aligned and the peaks can be extracted
all the traces recovered, only those that have more than 20 and less than 100
peaks are considered. The remaining ones are discarded as noise. Figure 4 shows
measurements where no correct pattern was detected (Fig. 4(a)), and where a
correct pattern was measured (Fig. 4(b)).
In general, after the elimination step, there are 8−12 correct traces left per
set. We observe that data obtained from each of these sets corresponds to 2
consecutive table positions. This is a direct result of CPU cache prefetching.
When a cache line that holds a table position is loaded into the cache, the
neighboring table position is also loaded due to cache locality principle.
For each graph to be processed, we first need to align the creation of the
look-up table with the traces. Identifying the table creation step is trivial since
each table position is used twice, taking two or more time slots. Figure 5(a)
shows the table access position indexes aligned with the table creation. In the
figure, the top graph shows the true table accesses while the rest of the graphs
show the measured data. It can be observed that the measured traces suffer from
misalignment due to noise from various sources e.g. RSA or co-located neighbors.
To fix the misalignment, we take most common peaks as reference and apply
a correlation step. To increase the efficiency, the graphs are divided into blocks
and processed separately as seen in Fig. 5(a). At the same time, Gaussian filter-
ing is applied to peaks. In our filter, the variance of the distribution is 1 and the
Cache Attacks Enable Bulk Key Recovery on the Cloud 381
Magnitude
10
0
0 500 1000 1500 2000 2500 3000
timeslot
Fig. 6. Eliminating false detections using a threshold (red dashed line) on the combined
detection graph. (Color figure online)
2
Data
1
Real
0
0 200 400 600 800 1000 1200
Normalized timeslot
Fig. 7. Comparison of the final obtained peaks with the correct peaks with adjusted
timeslot resolution
mean is aligned to the peak position. Then for each block, the cross-correlation is
calculated with respect to the most common hit graph i.e. the intersection set of
all graphs. After that, all graphs are shifted to the position where they have the
highest correlation and aligned with each other. After the cross-correlation cal-
culation and the alignment, the common patterns are observable as in Fig. 5(b).
Observe that the alignment step successfully aligns measured graphs with the
true access graph at the top, leaving only the combining and the noise removal
steps. We combine the graphs by simple averaging and obtain a single combined
graph.
In order to get rid of the noise in the combined graph, we applied a threshold
filter as can be seen in Fig. 6. We used 35 % of the maximum peak value observed
in graphs as the threshold value. Note that a simple threshold was sufficient to
remove noise terms since they are not common between graphs.
Now we convert scaled time slots of the filtered graph to real time slot indexes.
We do so by dividing them with the spy process resolution ratio, obtaining the
Fig. 7. In the figure, the top and the bottom graphs represent the true access
indexes and the measured graph, respectively. Also, note that even if additional
noise peaks are observed in the obtained graph, it is very unlikely that two
graphs monitoring consecutive table positions have noise peaks at the same
time slot. Therefore, we can filter out the noise stemming from the prefetching
by combining two graphs that belong to consecutive table positions. Thus, the
resulting indexes are the corresponding timing slots for look-up table positions.
The very last step of the leakage analysis is finding the intersections of two
graphs that monitor consecutive sets. By doing so, we obtain accesses to a single
382 M.S. İnci et al.
0
0 100 200 300 400 500 600 700 800 900 1000
normalized timeslot
table position as seen in Fig. 8 with high accuracy. At the same time, we have
total of three positions in two graphs. Therefore, we also get the positions of
the neighbors. A summary of the result of the leakage analysis is presented in
Table 2. We observe that more than 92 % of the recovered peaks are in the correct
position. However, note that by combining two different sets, the wrong peaks
will disappear with high probability, since the chance of having wrong peaks in
the same time slot in two different sets is very low.
We divide the section in two different scenarios, i.e., the scenario where the
identity and public key of the target is known (targeted co-location) and the
scenario where we have no information about the public key (bulk key recovery).
In cases where the noise on dp and dq is too high for a direct recovery with
the above-mentioned method, their relation to the known public key can be
exploited if the used public exponent e is small [20].
Almost all RSA implementations currently use e = 216 + 1 due to the heavy
performance boost over a random and full size e. For CRT exponents it holds
that edp = 1 mod (p − 1) and hence edp = kp (p − 1) + 1 for some 1 ≤ kp < e
and similarly for dq , yielding kp p = edp + kp − 1 and kq p = edq + kq − 1.
This means we have a simple technique to check the correctness of the least-
significant t bits of dp , dq for a choice of kp . We can
– Check parts of dp and dq by verifying if the test δ(dp (t), dq (t), t) = 0 holds
for t ∈ [1, log(p)].
– Fix alignment and minor errors by shifting and varying dp (t) and dq (t),
and then sieving working cases by checking if δ(dp (t), dq (t), t) = 0,
– Recover parts of dq given dp (and vice versa) by solving the error equation
δ(dp (t), dq (t), t) = 0 in case the data is missing or too noisy to correct.
Note that the algorithm may need to try all 216 values of kp in a loop.
Further, in the last case where we recover a missing data part using the checking
equation we need to speculatively continue the iteration for a few more steps.
If we observe too many mistakes we may early terminate the execution thread
without reaching the end of dp and dq .
To see how this approach can be adapted into our setting, we need to con-
sider the error distribution observed in dp and dq as recovered by cache timing.
Furthermore, since the sliding window algorithm was used in the RSA expo-
nentiation operation, we are dealing with variable size (1–5 bit) windows with
contents wp, wq, and window positions ip, iq for dp and dq , respectively.
The windows are separated by 0 strings. We observed:
– The window wp contents for dp had no errors and were in the correct order.
There were slight misalignments in the window positions ip with extra or
missing zeros in between.
– In contrast, dq had not only alignment problems but also few windows with
incorrect content, extra windows, and missing windows (overwritten by zeros).
The missing windows were detectable since we do not expect unusually long
zero strings in a random dq .
– Since the iterations proceed from the most significant windows to the least we
observed more errors towards the least significant words, especially in dq .
In this scenario, the attacker spins multiple instances and monitors the LLC,
looking in all of them for RSA leakages. If viable leakages are observed, the
Cache Attacks Enable Bulk Key Recovery on the Cloud 385
attacker might not know the corresponding public key. However, she can build
up a database of public keys by mapping the entire IP range of the targeted
Amazon EC2 region and retrieve all the public keys of hosts that have the TLS
port open. The attacker then runs the above described algorithm for each of
the recovered private keys and the entire public key database. Having the list
of ’neighboring’ IPs with an open TLS port also allows the attacker to initiate
TLS handshakes to make the servers use their private keys with high frequency.
In the South America Amazon EC2 region, we have found 36000+ IP
addresses with the TLS port open (shown in more detail in [25]) using nmap.
With a public key database of that size, our algorithm takes between less than
a second (for noise-free dp s) and 30 CPU hours (noisy dp s) to check each private
key with the public key database. This approach recovers the public/private key
pair, and consequently, the identity of the key owner.
9 Countermeasures
10 Conclusion
References
1. Fix Flush and Reload in RSA. https://lists.gnupg.org/pipermail/gnupg-announce/
2013q3/000329.html
2. Intel Xeon 2670–v2. http://ark.intel.com/es/products/75275/
Intel-Xeon-Processor-E5-2670-v2-25M-Cache-2 50-GHz
3. OpenSSL fix flush and reload ECDSA nonces. https://git.openssl.org/gitweb/?
p=openssl.git;a=commitdiff;h=2198be3483259de374f91e57d247d0fc667aef29
4. Transparent Page Sharing: Additional management capabilities and new default
settings. http://blogs.vmware.com/security/vmware-security-response-center/
page/2
5. Acıiçmez, O.: Yet another microarchitectural attack: exploiting I-cache. In: Pro-
ceedings of the 2007 ACM Workshop on Computer Security Architecture
6. Acıiçmez, O., Koç, Ç.K., Seifert, J.-P.: Predicting secret keys via branch predic-
tion. In: Abe, M. (ed.) CT-RSA 2007. LNCS, vol. 4377, pp. 225–242. Springer,
Heidelberg (2006)
7. Bates, A., Mood, B., Pletcher, J., Pruse, H., Valafar, M., Butler, K.: Detecting
co-residency with active traffic analysis techniques. In: Proceedings of the 2012
ACM Workshop on Cloud Computing Security Workshop
8. Benger, N., van de Pol, J., Smart, N.P., Yarom, Y.: “Ooh Aah.. Just a Little Bit”
: a small amount of side channel can go a long way. In: Batina, L., Robshaw, M.
(eds.) CHES 2014. LNCS, vol. 8731, pp. 75–92. Springer, Heidelberg (2014)
9. Bernstein, D.J.: Cache-timing attacks on AES (2004). http://cr.yp.to/papers.
html#cachetiming
10. Bernstein, D.J., Chang, Y.-A., Cheng, C.-M., Chou, L.-P., Heninger, N., Lange,
T., van Someren, N.: Factoring RSA keys from certified smart cards: coppersmith
in the wild. In: Sako, K., Sarkar, P. (eds.) ASIACRYPT 2013, Part II. LNCS, vol.
8270, pp. 341–360. Springer, Heidelberg (2013)
11. Bhattacharya, S., Mukhopadhyay, D.: Who watches the watchmen?: utilizing per-
formance monitors for compromising keys of RSA on Intel platforms. In: Güneysu,
T., Handschuh, H. (eds.) CHES 2015. LNCS, vol. 9293, pp. 248–266. Springer,
Heidelberg (2015)
12. Brumley, D., Boneh, D.: Remote timing attacks are practical. In: Proceedings of
the 12th USENIX Security Symposium, pp. 1–14 (2003)
13. Campagna, M.J., Sethi, A.: Key recovery method for CRT implementation of RSA.
Cryptology ePrint Archive, Report 2004/147. http://eprint.iacr.org/
14. Liu, F., Yarom, Y., Ge, Q., Heiser, G., Lee, R.B.: Last level cache side channel
attacks are practical, September 2015
15. Gandolfi, K., Mourtel, C., Olivier, F.: Electromagnetic analysis: concrete results.
In: Koç, Ç.K., Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp.
251–261. Springer, Heidelberg (2001)
16. Genkin, D., Pachmanov, L., Pipman, I., Tromer, E.: Stealing keys from PCs using
a radio: cheap electromagnetic attacks on windowed exponentiation. In: Güneysu,
T., Handschuh, H. (eds.) CHES 2015. LNCS, vol. 9293, pp. 207–228. Springer,
Heidelberg (2015)
17. Genkin, D., Shamir, A., Tromer, E.: RSA key extraction via low-bandwidth
acoustic cryptanalysis. In: Garay, J.A., Gennaro, R. (eds.) CRYPTO 2014, Part I.
LNCS, vol. 8616, pp. 444–461. Springer, Heidelberg (2014)
18. Gruss, D., Spreitzer, R., Mangard, S.: Cache.: template attacks: automating attacks
on inclusive last-level caches. In: 24th USENIX Security Symposium, pp. 897–912.
USENIX Association (2015)
Cache Attacks Enable Bulk Key Recovery on the Cloud 387
19. Gullasch, D., Bangerter, E., Krenn, S.: Cache games - bringing access-based cache
attacks on AES to practice. In: SP 2011, pp. 490–505
20. Hamburg, M.: Bit level error correction algorithm for RSA keys. Personal Com-
munication. Cryptography Research Inc. (2013)
21. Heninger, N., Durumeric, Z., Wustrow, E., Halderman, J.A.: Mining your Ps and
Qs: detection of widespread weak keys in network devices. In: Presented as Part
of the 21st USENIX Security Symposium (USENIX Security 2012), Bellevue, WA.
USENIX, pp. 205–220 (2012)
22. Hu, W.-M.: Lattice scheduling and covert channels. In: Proceedings of the 1992
IEEE Symposium on Security and Privacy
23. Hund, R., Willems, C.,Holz, T.: Practical timing side channel attacks against ker-
nel space ASLR. In: Proceedings of the 2013 IEEE Symposium on Security and
Privacy, pp. 191–205
24. İncİ, M.S., Gülmezoglu, B., Eisenbarth, T., Sunar, B.: Co-location detection on
the cloud. In: COSADE (2016)
25. İncİ, M.S., Gülmezoglu, B., Irazoqui, G., Eisenbarth, T., Sunar, B.: Cache attacks
enable bulk key recovery on the cloud (extended version) (2016). http://v.wpi.
edu/wp-content/uploads/Papers/Publications/bulk extended.pdf
26. Irazoqui, G., Eisenbarth, T., Sunar, B.: S$A: a shared cache attack that works
across cores and defies VM sandboxing and its application to AES. In: 36th IEEE
Symposium on Security and Privacy, S&P (2015)
27. Irazoqui, G., Eisenbarth, T., Sunar, B.: Systematic reverse engineering of cache
slice selection in Intel processors. In: Euromicro DSD (2015)
28. Irazoqui, G., Eisenbarth, T., Sunar, B.: Cross processor cache attacks. In: Proceed-
ings of the 11th ACM Symposium on Information, Computer and Communications
Security, ASIA CCS 2016. ACM (2016)
29. Irazoqui, G., İncİ, M.S., Eisenbarth, T., Sunar, B.: Know thy neighbor: crypto
library detection in cloud. Proc. Priv. Enhancing Technol. 1(1), 25–40 (2015)
30. Irazoqui, G., İncİ, M.S., Eisenbarth, T., Sunar, B.: Wait a minute! a fast, cross-VM
attack on AES. In: RAID, pp. 299–319 (2014)
31. Irazoqui, G., İncİ, M.S., Eisenbarth, T., Sunar, B.: Lucky 13 strikes back. In:
Proceedings of the 10th ACM Symposium on Information, Computer and Com-
munications Security, ASIA CCS 2015, pp. 85–96 (2015)
32. Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.)
CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999)
33. Kocher, P.C.: Timing attacks on implementations of Diffie-Hellman, RSA, DSS,
and other systems. In: Koblitz, N. (ed.) CRYPTO 1996. LNCS, vol. 1109, pp.
104–113. Springer, Heidelberg (1996)
34. Libgcrypt: The Libgcrypt reference manual. http://www.gnupg.org/
documentation/manuals/gcrypt/
35. Lipp, M., Gruss, D., Spreitzer, R., Mangard, S. ARMageddon : last-level cache
attacks on mobile devices. CoRR abs/1511.04897 (2015)
36. Maurice, C., Scouarnec, N.L., Neumann, C., Heen, O., Francillon, A.: Reverse
engineering intel last-level cache complex addressing using performance counters.
In: RAID 2015 (2015)
37. Oren, Y., Kemerlis, V.P., Sethumadhavan, S., Keromytis, A.D.: The spy in the
sandbox : practical cache attacks in javascript and their implications. In: Proceed-
ings of the 22nd ACM SIGSAC Conference on Computer and Communications
Security, New York, NY, USA, CCS 2015, pp. 1406–1418. ACM (2015)
388 M.S. İnci et al.
38. Osvik, D.A., Shamir, A., Tromer, E.: Cache attacks countermeasures.: the case of
AES. In: Proceedings of the 2006 The Cryptographers’ Track at the RSA Confer-
ence on Topics in Cryptology, CT-RSA 2006
39. Page, D.: Theoretical use of cache memory as a cryptanalytic side-channel (2002)
40. Ristenpart, T., Tromer, E., Shacham, H., Savage, S.: Hey you, get off of my cloud:
exploring information leakage in third-party compute clouds. In: Proceedings of
the 16th ACM Conference on Computer and Communications Security, CCS 2009,
pp. 199–212
41. Suzaki, K., Iijima, K., Toshiki, Y., Artho, C.: Implementation of a memory disclo-
sure attack on memory deduplication of virtual machines. IEICE Trans. Fundam.
Electron., Commun. Comput. Sci. 96, 215–224 (2013)
42. Varadarajan, V., Zhang, Y., Ristenpart, T., Swift, M.: A placement vulnerabil-
ity study in multi-tenant public clouds. In: 24th USENIX Security Symposium
(USENIX Security 2015), Washington, D.C., August 2015, pp. 913–928. USENIX
Association
43. Wu, Z., Xu, Z., Wang, H.: Whispers in the hyper-space: high-speed covert channel
attacks in the cloud. In: USENIX Security Symposium, pp. 159–173 (2012)
44. Xu, Z., Wang, H., Wu, Z.: A measurement study on co-residence threat inside the
cloud. In: 24th USENIX Security Symposium (USENIX Security 2015), Washing-
ton, D.C., August 2015, pp. 929–944. USENIX Association
45. Yarom, Y., Falkner, K.: FLUSH+RELOAD: a high resolution, low noise, L3 cache
side-channel attack. In: 23rd USENIX Security Symposium (USENIX Security
2014), pp. 719–732
46. Yarom, Y., Ge, Q., Liu, F., Lee, R.B., Heiser, G.: Mapping the Intel last-level
cache. Cryptology ePrint Archive, Report 2015/905 (2015). http://eprint.iacr.org/
47. Zhang, Y., Juels, A., Oprea, A., Reiter, M.K.: HomeAlone : co-residency detection
in the cloud via side-channel analysis. In: Proceedings of the 2011 IEEE Symposium
on Security and Privacy
48. Zhang, Y., Juels, A., Reiter, M. K., Ristenpart, T.: Cross-tenant side-channel
attacks in paas clouds. In: Proceedings of the 2014 ACM SIGSAC Conference
on Computer and Communications Security
49. Zhang, Y., Juels, A., Reiter, M.K., Ristenpart, T.: Cross-VM side channels and
their use to extract private keys. In: Proceedings of the 2012 ACM Conference on
Computer and Communications Security
Physical Unclonable Functions
Strong Machine Learning Attack Against PUFs
with No Mathematical Model
1 Introduction
Nowadays, it is broadly accepted that Integrated Circuits (ICs) are subject to
overbuilding and piracy due to the adaption of authentication methods relying on
insecure key storage techniques [24]. To overcome the problem of secure key stor-
age, Physically Unclonable Functions (PUFs) have been introduced as promising
solutions [15,30]. For PUFs, the manufacturing process variations lead eventually
to instance-specific, and inherent physical properties that can generate virtually
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 391–411, 2016.
DOI: 10.1007/978-3-662-53140-2 19
392 F. Ganji et al.
unique responses, when the instance is given some challenges. Therefore, PUFs can
be utilized as either device fingerprints for secure authentication or as a source of
entropy in secure key generation scenarios. In this case, there is no need for per-
manent key storage, since the desired key is generated instantly upon powering
up the device. Regarding the instance-specific, and inherent physical properties
of the PUFs, they are assumed to be unclonable and unpredictable, and therefore
trustworthy and robust against attacks [26]. However, after more than a decade of
the invention of PUFs, the design of a really unclonable physical function is still
a challenging task. Most of the security schemes relying on the notion of PUFs
are designed based on a “design-break-patch” rule, instead of a thorough crypto-
graphic approach.
Along with the construction of a wide variety of PUFs, several different types
of attacks, ranging from non-invasive to semi-invasive attacks [18,19,33,39],
have been launched on these primitives. Machine learning (ML) attacks are
one of the most common types of non-invasive attacks against PUFs, whose
popularity stems from their characteristics, namely being cost-effective and non-
destructive. Moreover, these attacks require the adversary to solely observe the
input-output (i.e., so called challenge-response) behavior of the targeted PUF.
In this attack scenario, a relatively small subset of challenges along with their
respective responses is collected by the adversary, attempting to come up with
a model describing the challenge-response behavior of the PUF. In addition to
heuristic learning techniques, e.g., what has been proposed in [33,34], the authors
of [12–14] have proposed the probably approximately correct (PAC) learning
framework to ensure the delivery of a model for prespecified levels of accuracy
and confidence. One of the key results reported in [12–14] is that knowing about
the mathematical model of the PUF functionality enables the adversary to estab-
lish a proper hypothesis representation (i.e., mathematical model of the PUF),
and then try to PAC learn this representation. This gives rise to the question
of whether a PUF can be PAC learned without prior knowledge of a precise
mathematical model of the PUF.
Bistable Ring PUFs (BR-PUF) [7] and Twisted Bistable Ring PUFs (TBR-
PUF) [37] are examples of PUFs, whose functionality cannot be easily translated
to a precise mathematical model. In an attempt, the authors of [37,41] suggested
simplified mathematical models for BR-PUFs and TBR-PUFs. However, their
models do not precisely reflect the physical behavior of these architectures.
In this paper, we present a sound mathematical machine learning framework,
which enables us to PAC learn the BR-PUF family (i.e., including BR- and
TBR-PUFs) without knowing their precise mathematical model. Particularly,
our framework contributes to the following novel aspects related to the security
assessment of PUFs in general:
Exploring the inherent mathematical properties of PUFs. One of the
most natural and commonly accepted mathematical representation of a PUF is
a Boolean function. This representation enables us to investigate properties of
PUFs, which are observed in practice, although they have not been precisely
and mathematically described. One of these properties exhaustively studied in
Strong Machine Learning Attack Against PUFs 393
our paper is related to the “silent” assumption that each and every bit of a
challenge has equal influence on the respective response of a PUF. We prove that
this assumption is invalid for all PUFs. While this phenomenon has been already
occasionally observed in practice and is most often attributed to implementation
imperfections, we will give a rigorous mathematical proof on the existence of
influential bit positions, which holds for every PUF.
Strong ML attacks against PUFs without available mathematical
model. We prove that even in a worst case scenario, where the internal function-
ality of the BR-PUF family cannot be mathematically modeled, the challenge-
response behavior of these PUFs can be PAC learned for given levels of accuracy
and confidence.
Evaluation of the applicability of our framework in practice. In order
to evaluate the effectiveness of our theoretical framework, we conduct extensive
experiments on BR-PUFs and TBR-PUFs, implemented on a commonly used
Field Programmable Gate Array (FPGA).
2.1 PUFs
Note that elaborate and formal definitions as well as formalizations of PUFs
are beyond the scope of this paper, and for more details on them we refer the
reader to [3,4]. In general, PUFs are physical input to output mappings, which
map given challenges to responses. Intrinsic properties of the physical primitive
embodying the PUF determine the characteristics of this mapping. Two main
classes of PUFs, namely strong PUFs and weak PUFs have been discussed in the
literature [16]. In this paper we consider the strong PUFs, briefly called PUFs.
Here we focus only on two characteristics of PUFs, namely unclonablity and
unpredictability (i.e., so called unforgeability). Let a PUF be described by the
mapping fPUF : C → Y, where fPUF (c) = y. In this paper, we assume that the
issue with noisy responses (i.e., the output is not stable for a given input) must
have been resolved by the PUF manufacturer. For an ideal PUF, unclonablity
means that for a given PUF fPUF it is virtually impossible to create another
physical mapping gPUF = fPUF , whose challenge-response behavior is similar to
fPUF [3].
Moreover, an ideal PUF is unpredictable. This property of PUFs is closely
related to the notion of learnability. More precisely, given a single PUF fPUF and
a set of challenge response pairs (CRPs) U = {(c, y) | y = fPUF (c) and c ∈ C}, it
is (almost) impossible to predict y = fPUF (c ), where c is a random challenge
so that (c , ·) ∈
/ U . In this paper we stick to this (simple, but) classical definition
of unpredictability of a PUF, and refer the reader to [3,4] for more refined
definitions.
394 F. Ganji et al.
Defining PUFs as mappings (see Sect. 2.1), the most natural mathematical model
for them are Boolean functions over the finite field F2 . Let Vn = {c1 , c2 , . . . , cn }
denote the set of Boolean attributes or variables, where each attribute can be
true or false, commonly denoted by “1” and “0”, respectively. In addition,
Cn = {0, 1}n contains all binary strings with n bits. We associate each Boolean
attribute ci with two literals, i.e., ci , and ci (complement of ci ). An assignment
is a mapping from Vn to {0, 1}, i.e., the mapping from each Boolean attribute to
either “0” or “1”. In other words, an assignment is an n-bits string, where the
ith bit of this string indicates the value of ci (i.e., “0” or “1”).
An assignment is mapped by a Boolean formula into the set {0, 1}. Thus,
each Boolean attribute can also be thought of as a formula, i.e., ci and ci are
two possible formulas. If by evaluating a Boolean formula under an assignment
we obtain “1”, it is called a positive example of the “concept represented by
the formula” or otherwise a negative example. Each Boolean formula defines
a respective Boolean function f : Cn → {0, 1}. The conjunction of Boolean
attributes (i.e., a Boolean formula) is called a term, and it can be true or false
(“1” or “0”) depending on the value of its Boolean attributes. Similarly, a clause
that is the disjunction of Boolean attributes can be defined. The number of
literals forming a term or a clause is called its size. The size 0 is associated with
only the term true, and the clause false.
In the related literature several representations of Boolean functions have
been introduced, e.g., juntas, Monomials (Mn ), Decision Trees (DTs), and Deci-
sion Lists (DLs), cf. [29,31].
A Boolean function depending on solely an unknown set of k variables is
called a k-junta. A monomial Mn,k defined over Vn is the conjunction of at most
k clauses each having only one literal. A DT is a binary tree, whose internal
nodes are labeled with a Boolean variable, and each leaf with either “1” or “0”.
A DT can be built from a Boolean function in this way: for each assignment a
unique path form the root to a leaf should be defined. At each internal node,
e.g., at the ith level of the tree, depending on the value of the ith literal, the
labeled edge is chosen. The leaf is labeled with the value of the function, given
the respective assignment as the input. The depth of a DT is the maximum
length of the paths from the root to the leafs. The set of Boolean functions
represented by decision trees of depth at most k is denoted by k-DT. A DL is
a list L that contains r pairs (f1 , v1 ), . . . , (fr , vr ), where the Boolean formula fi
is a term and vi ∈ {0, 1} with 1 ≤ i ≤ r − 1. For i = r, the formula fr is the
constant function vr = 1. A Boolean function can be transformed into a decision
list, where for a string c ∈ Cn we have L(c) = vj , where j is the smallest index
in L so that fj (c) = 1. k-DL denotes the set of all DLs, where each fi is a term
of maximum size k.
A linear Boolean function f : {0, 1}n → {0, 1} features the following equiva-
lent properties, cf. [29]:
where [n] := {1, . . . , n}, χS (c) := i∈S ci , and fˆ(S) := Ec∈U [f (c)χS (c)]. Here,
Ec∈U [·] denotes the expectation over uniformly chosen random examples. The
influence of variable i on f : Fn2 → F2 is defined as
where c⊕i is obtained by flipping the i-th bit of c. Note that Infi (f ) =
ˆ
Si (f (S)) , cf. [29]. Next we define the average sensitivity of a Boolean
2
function f as
n
I(f ) := Infi (f ).
i=1
The Probably Approximately Correct (PAC) model provides a firm basis for
analyzing the efficiency and effectiveness of machine learning algorithms. We
briefly introduce the model and refer the reader to [23] for more details. In the
396 F. Ganji et al.
PAC model the learner, i.e., the learning algorithm, is given a set of examples to
generate with high probability an approximately correct hypothesis. This can be
formally defined as follows. Let F = ∪n≥1 Fn denote a target concept class that
is a collection of Boolean functions defined over the instance space Cn = {0, 1}n .
Moreover, according to an arbitrary probability distribution D on the instance
space Cn each example is drawn. Assume that hypothesis h ∈ Fn is a Boolean
function over Cn , it is called an ε-approximator for f ∈ Fn , if
Pr [f (c) = h(c)] ≥ 1 − ε.
c∈D Cn
Let the mapping size : {0, 1}n → N associate a natural number size(f ) with
a target concept f ∈ F that is a measure of complexity of f under a target
representation, e.g., k-DT. The learner is a polynomial-time algorithm denoted
by A, which is given labeled examples (c, f (c)), where c ∈ Cn and f ∈ Fn . The
examples are drawn independently according to distribution D. Now we can
define strong and weak PAC learning algorithms.
The weak learning framework was developed to answer the question whether
a PAC learning algorithm with constant but insufficiently low levels of ε and δ
can be useful at all. This notion is defined as follows.
Definition 2. For some constant δ > 0 let algorithm A return with probability
at least 1 − δ an (1/2 − γ)-approximator for f , where γ > 0. A is called a weak
PAC learning algorithm, if γ = Ω (1/p(n, size(f )) for some polynomial p(·).
The equivalence of weak PAC learning and strong PAC learning has been
proved by Freund and Schapire in the early nineties in their seminal papers [9,35].
For that purpose boosting algorithms have been introduced.
of fPUF , being a PUF. Hence, fPUF cannot be linear over F2 . In other words,
for every PUF fPUF we have deg(fPUF ) ≥ 2. Moreover, in conjunction with the
above mentioned Siegenthaler Theorem, we deduce that every PUF is at most
an n − 2-correlation immune function, which indeed means that not all of its
challenge bits have an equal influence on the respective PUF response.
Theorem 2 states that every PUF has some challenge bits, which have some
larger influence on the responses than other challenge bits. We call these bits
“loosely” as influential bits 1 .
3 PUF Architectures
In this section, we explain the architectures of two intrinsic silicon PUFs, namely
the BR- and TBR-PUFs, whose internal mathematical models are more compli-
cated than other intrinsic PUF constructions. In an attempt, we apply simple
models to describe the functionality of these PUFs. However, we believe that
these models cannot completely reflect the real characteristics of the BR-PUF
family, and their concrete, yet unknown model should be much more complex.
Vinitial
R
V(t) G
(a) (b)
Fig. 1. (a) The logical circuit of an SRAM cell. (b) The small signal model of bistable
element in metastability
1
Note that the existence of such influential bits has been also noticed by several other
experimental research papers. However, none of them has been able to correctly and
precisely pinpoint the mathematical origin of this phenomenon.
Strong Machine Learning Attack Against PUFs 399
where V (0) is a small signal offset from the metastable point. To derive V (t) we
can write the equation of the circuit as follows.
reset r
Fig. 2. The schematic of a BR-PUF with n stages. The response of the PUF can be
read between two arbitrary stages. For a given challenge, the reset signal can be set
low to activate the PUF. After a transient period, the BR-PUF might be settled to an
allowed logical state.
400 F. Ganji et al.
challenge applied to the ith stage, one of the NOR gates is selected. Setting the reset
signal to low, the signal propagates in the ring, which behaves like an SRAM cell
with a larger number inverters. The response of the PUF is a binary value, which
can be read from a predefined location on the ring between two stages, see Fig. 2.
The final state of the inverter ring is a function of the gains and the propa-
gation delays of the gates. According to the model of the SRAM circuit in the
metastable state provided in Sect. 3.1, one might be able to extend the electrical
model and analyze the behavior of the inverter ring. Applying a challenge, the
ring may settle at a stable state after a oscillation time period. However, for
a specific set of challenges the ring might stay in the metastable state for an
infinite time, and the oscillation can be observed in the output of the PUF.
The analytical models of the metastable circuits introduced in Sect. 3.1 are
valid for an ASIC implementation and respective simulations. Although few
simulation results of BR-PUF are available in the literature, to the best of our
knowledge there are no results for a BR-PUF implemented on an ASIC, and
experimental results have been limited to FPGA implementations. In this case,
the BR-PUF model can be further simplified by considering the internal archi-
tecture of the FPGAs. The NOR gates of the BR-PUF are realized by dedicated
Lookup Tables (LUTs) inside an FPGA. The output of the LUTs are read from
one of the memory cells of the LUT, which have always stable conditions. Hence,
it can be assumed that there is almost no difference in the gains of different LUTs.
As a result, the random behavior of the BR-PUF could be defined by the delay
differences between the LUTs.
reset
Fig. 3. The schematic of a TBR-PUF with n stages. The response of the PUF is read
after the last stage. For a given challenge, the reset signal can be set low to activate
the PUF. After a transient period, the BR-PUF might be settled to an allowed logical
state.
family. Second, due to the lack of a precise mathematical model of the respective
PUF functionality, to learn the PUF a more sophisticated approach is required.
Therefore, the following question arises: is it possible to PAC learn a PUF fam-
ily, even if we have no mathematical model of the physical functionality of the
respective PUF family? We answer this question at least for the BR-PUF family.
Our roadmap for answering this question, more specifically, the steps taken to
prove the PAC learnability of BR-PUF family in the second scenario, is illus-
trated in Fig. 4. While theoretical insights into the notions related to the first
two blocks have been presented in Sect. 2.4, which are valid for all PUF families,
Sect. 4.1 provides more specific results for the BR-PUF family (i.e., According to
these new insights, in Sect. 4.2 we eventually prove that BR-PUF family (which
lack a precise mathematical model) can nevertheless be PAC learned (see last
two blocks in Fig. 4).
Fig. 4. Our roadmap for proving the PAC learnability of BR-PUF family, whose math-
ematical model is unknown
bits on the respective responses [42]. They have explicitly underlined the exis-
tence of influential bits, and found so called prediction rules. Table 1 summarizes
their results, where for each type of the rules (monomials of different sizes) we
report only the one with the highest estimated response prediction probability.
In addition to providing evidence for the existence of influential bits, the size of
the respective monomials is of particular importance for us. As shown in Table 1,
their size is surprisingly small, i.e., only five.
Table 1. Statistical analysis of the 2048 CRPs, given to a 64-bit BR-PUF [42]. The
first column shows the rule found in the samples, whereas the second column indicates
the estimated probability of predicting the response.
Similarly, the authors of [37] translate the influence of the challenge bits to
the weights needed in artificial neural networks that represent the challenge-
response behavior of BR-PUFs and the TBR-PUFs. They observed that there
is a pattern in these weights, which models the influence of the challenge bits. It
clearly reflects the fact that there are influential bits determining the response of
the respective PUF to a given challenge. From the results presented in [37], we
conclude that there is at least one influential bit, however, the precise number
of influential bits has not been further investigated by the authors.
Inspired by the above results from [37,42], we conduct further experiments.
We collect 30000 CRPs from BR-PUFs and TBR-PUFs implemented on Altera
Cyclone IV FPGAs. In all of our PUF instances at least one influential bit is
found, and the maximum number of influential bits (corresponding to the size of
the monomials) is just a constant value in all cases. For the sake of readability,
we present here only the results obtained for one arbitrary PUF instance.
Our results shown in Table 2 are not only aligned with the results reported
in [37,42], but also reflect our previous theoretical findings. We could conclude
this section as follows. There is at least one influential bit determining the
response of a BR-PUF (respectively, TBR-PUF) to a given challenge. However,
for the purpose of our framework their existence is not enough, and we need an
upper bound on the number of influential bits.
Looking more carefully into the three different datasets, namely our own and
the data reported in [37,42], we observe that the total number of influential
bits is always only a very small value. Motivated by this commonly observed
phenomenon, we compute for our PUFs (implemented on FPGAs) the average
Strong Machine Learning Attack Against PUFs 403
Table 2. Our statistical analysis of the 30000 CRPs, given to a 64-bit BR-PUF. The
first column shows the rule found in the sample, whereas the second column indicates
the estimated probability of predicting the response.
2
As explained in Sect. 2.2, for a Boolean function f , the influence of a variable and the
total average sensitivity can be calculated by employing Fourier analysis. However,
in practice this analysis is computationally expensive. Instead, it suffices to simply
approximate the respective average sensitivity. This idea has been extensively stud-
ied in the learning theory-related and property testing-related literature (see [22], for
a survey). Here we describe how the average sensitivity of a Boolean function, repre-
senting a PUF, can be approximated. We follow the simple and effective algorithm
as explained in [32]. The central idea behind their algorithm is to collect enough
random pairs of labeled examples from the Boolean function, which have the follow-
ing property: (c, f (c)) and (c⊕i , f (c⊕i )), i.e., the inputs differ on a single Boolean
variable.
3
Note that it is a known result and being folklore, cf. [29], that randomly chosen n-bit
Boolean functions have an expected average sensitivity of exactly n/2.
404 F. Ganji et al.
Finally, some relation between the average sensitivity and the strict avalanche
criterion (SAC) can be recognized, although we believe that the average sensi-
tivity is a more direct metric to evaluate the security of PUFs under ML attacks.
Theorem 3. Every Boolean function f : {0, 1}n → {0, 1} with I(f ) = k can
be ε-approximated by another Boolean function h depending
on only a constant
number of Boolean variables K, where K = exp (2 + 2ε log2 (4k/ε)/k) kε , and
ε > 0 is an arbitrary constant.
5 Results
5.1 PUF Implementation
We implement BR and TBR-PUFs with 64 stages on an Altera Cyclone IV
FPGA, manufactured on a 60nm technology [1]. It turns out that most PUF
implementations are highly biased towards one of the responses. Therefore, we
apply different manual routing and placement configurations to identify PUFs
with a minimum bias in their responses. However, it is known that by reducing
the bias in PUF responses, the number of noisy responses increases [27].
Finding and resolving the noisy responses are two of the main challenges
in the CRP measurement process. In almost all PUF constructions it can be
predicted, at which point in time a generated response is valid and can be
measured. For instance, for an arbiter PUF one can estimate the maximum
propagation delay (evaluation period) between the enable point and the arbiter.
After this time period the response is in a valid logical level (either “0” or
“1”) and does not change, and afterwards by doing majority voting on the
responses generated for a given challenge the stable CRPs can be collected.
However, in the case of BR-PUF family, for a given challenge the settling time
of the response to a valid logical level is not known a priori, see Fig. 5. Fur-
thermore, it is not known whether the response to a given challenge would
Fig. 5. The settling time of the BR-PUF response: (a) the PUF response after a tran-
sient time reaches a stable logical state “1”. (b) after a transient time the PUF response
is “0”. (c) the PUF response does not settle and oscillates for an undefined time period.
406 F. Ganji et al.
not be unstable after observing the stable response during some time period
(see Sect. 3.1). Therefore, the majority voting technique cannot be employed for
BR-PUFs and TBR-PUFs. To deal with this problem, for a given challenge we
read the response of the PUF at different points in time, where at each point
in time 11 measurements are conducted additionally. We consider a response
being stable, if it is the same at all these different measurement time points.
Otherwise, the response is considered being unstable, and the respective CRP is
excluded from our dataset.
In order to observe the impact of the existing influential bits on our PUF
responses, first we apply a large set of challenges chosen uniformly at ran-
dom, and then measure their respective responses. Afterwards, for both possible
responses of the PUF (i.e., “0” and “1”) we count the number of challenge bits,
which are set to either “0” or “1”, see Fig. 6. It can be seen that some challenge
bits are more influential towards a certain response. These results are the basis
for our statistical analysis presented in Sect. 4.1. We also repeat this experiment
in the scenario, where the response of the PUF is unstable — in this case we
observe almost no influential challenge bits. The most important conclusion that
we can draw from these experiments is that a PUF with stable responses has
at least one influential bit, which can already predict with low probability the
response of the PUF to a respective challenge.
(a) (b)
Fig. 6. The impact of the influential bits on the responses of the PUF: (a) the response
of the PUF is “0”. (b) unstable responses. Here the y-axis shows the percentage of the
challenges, whose bits are set to either “0” or “1”, whereas the x-axis shows the bit
position.
5.2 ML Results
small constant. However, it is known that every efficient algorithm for learning
K-DTs (i.e., the number of leaves is 2K ) is an efficient algorithm for learning K-
juntas, see, e.g., [28]. Furthermore, it is known that DLs generalize K-DTs [31].
Moreover, a monomial Mn,K is a very simple type of a K-junta, where only
the conjunction of the relevant variables is taken into account. Therefore, for
our experiments we decide to let our weak learning algorithms deliver DLs,
Monomials, and DTs.
To learn the challenge-response behavior of BR- and TBR-PUFs using these
representations, we use the open source machine learning software Weka [17]. One
may argue that more advanced tools might be available, but here we only aim to
demonstrate that publicly accessible, and off-the-shelf software can be used to
launch our proposed attacks. All experiments are conducted on a MacBook Pro
with 2.6 GHz Intel Core i5 processor and 10GB of RAM. To boost the prediction
accuracy of the model established by our weak learners, we apply the Adaptive
Boosting (AdaBoost) algorithm [10]; nevertheless, any other boosting framework
can be employed as well. For Adaboost, it is known that the error of the final
model delivered
T by the boosted algorithm after T iteration is theoretically upper
bounded by t=1 1 − 4γ 2 , c.f. [36]. To provide a better understanding of the
relation between K, the number of iterations, and the theoretical bound on the
error of the final model, a corresponding graph4 is shown in Fig. 7.
Fig. 7. The relation between the theoretical upper bound on the error of the final
model returned by Adaboost, the number of iterations, and K. The graph is plotted
for k = 2, ε = 0.01, and n = 64. Here, ε = 0.01 denotes the error of the K-junta
learner.
Table 4. Experimental results for learning 64-bit BR-PUF and TBR-PUF, when m =
100. The accuracy (1 − ε) is reported for three weak learners. The first row shows the
accuracy of the weak learner, whereas the other rows show the accuracy of the boosted
learner.
Table 5. Experimental results for m = 1000 (the same setting as for the Table 4).
experiments are 100 and 1000, whereas the test set contains 30000 CRPs. Our
experiments demonstrate that the weak learning of our test set always results
in the delivery of a model with more than 50 % accuracy as shown in the first
rows of Tables 4 and 5.
By boosting the respective models with AdaBoost, the accuracy is dramat-
ically increased, see Tables 4 and 5. It can be observed that after 50 iterations
of Adaboost applied to the weak model generated from 100 CRPs, the predic-
tion accuracy of the boosted model is increased to more than 80 % for all three
representations. By increasing the number of samples to 1000 CRPs, the predic-
tion accuracy is further increased up to 98.32 % for learning the BR-PUFs, and
99.37 % for learning the TBR-PUFs under DL representations. It is interesting
to observe that the simplest representation class, i.e., Monomials clearly present
the greatest advantage given by the boosting technique. As explained in [36] this
is due to avoiding any overfitting tendency.
6 Conclusion
As a central result, which speaks for itself, we have proved that in general the
responses of all PUF families are not equally determined by each and every bit
Strong Machine Learning Attack Against PUFs 409
of their respective challenges. Moreover, the present paper has further addressed
the issue of strong PAC learning of the challenge-response behavior of PUFs,
whose functionality lacks a precise mathematical model. We have demonstrated
that representing BR- and TBR-PUFs by Boolean functions, we are able to
precisely describe the characteristics of these PUFs as observed in practice. This
fact results in developing a new and generic machine learning framework that
strongly PAC learns the challenge-response behavior of the BR-PUF family.
The effectiveness and applicability of our framework have also been evaluated by
conducting extensive experiments on BR-PUFs and TBR-PUFs implemented on
FPGAs, similar to experimental platforms used in the most relevant literature.
Last but not least, although our strong PAC learning framework has its own
novelty value, we feel that our Theorem 3 and the precise mathematical descrip-
tion of the characteristics of BR-PUFs and TBR-PUFs are the most important
aspects of our paper. We strongly believe that this description can help to fill
the gap between the mathematical design of cryptographic primitives and the
design of PUFs in real world. As an evidence thereof, we feel that the Siegen-
thaler Theorem and the Fourier analysis that are well-known and widely used
in modern cryptography may provide special insights into the physical design of
secure PUFs in the future.
Acknowledgements. We would like to thank Prof. Dr. Frederik Armknecht for the
fruitful discussion as well as pointing out the Siegenthaler’s paper. Furthermore, the
authors greatly appreciate the support that they received from Helmholtz Research
School on Security Technologies.
References
1. Altera: Cyclone IV Device Handbook. Altera Corporation, San Jose (2014)
2. Angluin, D.: Queries and concept learning. Mach. Learn. 2(4), 319–342 (1988)
3. Armknecht, F., Maes, R., Sadeghi, A., Standaert, O.X., Wachsmann, C.: A formal-
ization of the security features of physical functions. In: 2011 IEEE Symposium
on Security and Privacy (SP), pp. 397–412 (2011)
4. Armknecht, F., Moriyama, D., Sadeghi, A.R., Yung, M.: Towards a unified secu-
rity model for physically unclonable functions. In: Sako, K. (ed.) CT-RSA 2016.
LNCS, vol. 9610, pp. 271–287. Springer, Heidelberg (2016)
5. Arvind, V., Köbler, J., Lindner, W.: Parameterized learnability of k -juntas and
related problems. In: Hutter, M., Servedio, R.A., Takimoto, E. (eds.) ALT 2007.
LNCS (LNAI), vol. 4754, pp. 120–134. Springer, Heidelberg (2007)
6. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine
learning. Artif. Intell. 97(1), 245–271 (1997)
7. Chen, Q., Csaba, G., Lugli, P., Schlichtmann, U., Rührmair, U.: The bistable ring
PUF: a new architecture for strong physical unclonable functions. In: 2011 IEEE
International Symposium on Hardware-Oriented Security and Trust (HOST), pp.
134–141. IEEE (2011)
8. Fischer, P., Simon, H.U.: On learning ring-sum-expansions. SIAM J. Comput.
21(1), 181–192 (1992)
410 F. Ganji et al.
9. Freund, Y.: Boosting a weak learning algorithm by majority. Inf. Comput. 121(2),
256–285 (1995)
10. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning
and an application to boosting. J. Comp. Syst. Sci. 55(1), 119–139 (1997)
11. Friedgut, E.: Boolean functions with low average sensitivity depend on few coor-
dinates. Combinatorica 18(1), 27–35 (1998)
12. Ganji, F., Tajik, S., Seifert, J.P.: Let me prove it to you: RO PUFs are provably
learnable. In: The 18th Annual International Conference on Information Security
and Cryptology (2015)
13. Ganji, F., Tajik, S., Seifert, J.-P.: Why attackers win: on the learnability of XOR
arbiter PUFs. In: Conti, M., Schunter, M., Askoxylakis, I. (eds.) TRUST 2015.
LNCS, vol. 9229, pp. 22–39. Springer, Heidelberg (2015)
14. Ganji, F., Tajik, S., Seifert, J.P.: PAC learning of arbiter PUFs. J. Cryptographic
Eng. Spec. Sect. Proofs 2014, 1–10 (2016)
15. Gassend, B., Clarke, D., Van Dijk, M., Devadas, S.: Silicon physical random func-
tions. In: Proceedings of the 9th ACM Conference on Computer and Communi-
cations Security, pp. 148–160 (2002)
16. Guajardo, J., Kumar, S.S., Schrijen, G.-J., Tuyls, P.: FPGA intrinsic PUFs and
their use for IP protection. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007.
LNCS, vol. 4727, pp. 63–80. Springer, Heidelberg (2007)
17. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.:
The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett.
11(1), 10–18 (2009)
18. Helfmeier, C., Boit, C., Nedospasov, D., Seifert, J.P.: Cloning physically unclon-
able functions. In: 2013 IEEE International Symposium on Hardware-Oriented
Security and Trust (HOST), pp. 1–6 (2013)
19. Helfmeier, C., Nedospasov, D., Tarnovsky, C., Krissler, J.S., Boit, C., Seifert,
J.P.: Breaking and entering through the silicon. In: Proceedings of the 2013 ACM
SIGSAC Conference on Computer and Communications Security, pp. 733–744.
ACM (2013)
20. Helmbold, D., Sloan, R., Warmuth, M.K.: Learning integer lattices. SIAM J.
Comput. 21(2), 240–266 (1992)
21. Holcomb, D.E., Burleson, W.P., Fu, K.: Initial SRAM state as a fingerprint and
source of true random numbers for RFID tags. In: Proceedings of the Conference
on RFID Security, vol. 7 (2007)
22. Kalai, G., Safra, S.: Threshold phenomena and influence: perspectives from math-
ematics, computer science, and economics. In: Computational Complexity and
Statistical Physics, Santa Fe Institute Studies on the Sciences of Complexity, pp.
25–60 (2006)
23. Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning The-
ory. MIT Press, Cambridge (1994)
24. Koushanfar, F.: Hardware metering: a survey. In: Tehranipoor, M., Wang, C.
(eds.) Introduction to Hardware Security and Trust, pp. 103–122. Springer, New
York (2012)
25. Lee, J.W., Lim, D., Gassend, B., Suh, G.E., Van Dijk, M., Devadas, S.: A tech-
nique to build a secret key in integrated circuits for identification and authenti-
cation applications. In: 2004 Symposium on VLSI Circuits. Digest of Technical
Papers, pp. 176–179 (2004)
26. Maes, R.: Physically Unclonable Functions: Constructions, Properties and Appli-
cations. Springer, Heidelberg (2013)
Strong Machine Learning Attack Against PUFs 411
27. Maes, R., van der Leest, V., van der Sluis, E., Willems, F.: Secure key generation
from biased PUFs. In: Güneysu, T., Handschuh, H. (eds.) CHES 2015. LNCS,
vol. 9293, pp. 517–534. Springer, Heidelberg (2015)
28. Mossel, E., O’Donnell, R., Servedio, R.A.: Learning functions of k relevant vari-
ables. J. Comp. Syst. Sci. 69(3), 421–434 (2004)
29. O’Donnell, R.: Analysis of Boolean Functions. Cambridge University Press, Cam-
bridge (2014)
30. Pappu, R., Recht, B., Taylor, J., Gershenfeld, N.: Physical one-way functions.
Science 297(5589), 2026–2030 (2002)
31. Rivest, R.L.: Learning decision lists. Mach. Learn. 2(3), 229–246 (1987)
32. Ron, D., Rubinfeld, R., Safra, M., Samorodnitsky, A., Weinstein, O.: Approxi-
√
mating the influence of monotone boolean functions in O( n) query complexity.
ACM Trans. Comput. Theory (TOCT) 4(4), 11 (2012)
33. Rührmair, U., Sehnke, F., Sölter, J., Dror, G., Devadas, S., Schmidhuber, J.:
Modeling attacks on physical unclonable functions. In: Proceedings of the 17th
ACM Conference on Computer and Communications Security, pp. 237–249 (2010)
34. Saha, I., Jeldi, R.R., Chakraborty, R.S.: Model building attacks on physically
unclonable functions using genetic programming. In: 2013 IEEE International
Symposium on Hardware-Oriented Security and Trust (HOST), pp. 41–44. IEEE
(2013)
35. Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5(2), 197–227
(1990)
36. Schapire, R.E., Freund, Y.: Boosting: Foundations and Algorithms. MIT Press,
Cambridge (2012)
37. Schuster, D., Hesselbarth, R.: Evaluation of bistable ring PUFs using single layer
neural networks. In: Holz, T., Ioannidis, S. (eds.) Trust 2014. LNCS, vol. 8564,
pp. 101–109. Springer, Heidelberg (2014)
38. Siegenthaler, T.: Correlation-immunity of nonlinear combining functions for cryp-
tographic applications (Corresp.). IEEE Trans. Inf. Theory 30(5), 776–780 (1984)
39. Tajik, S., Dietz, E., Frohmann, S., Seifert, J.-P., Nedospasov, D., Helfmeier, C.,
Boit, C., Dittrich, H.: Physical characterization of arbiter PUFs. In: Batina,
L., Robshaw, M. (eds.) CHES 2014. LNCS, vol. 8731, pp. 493–509. Springer,
Heidelberg (2014)
40. Weste, N.H.E., Harris, D.: CMOS VLSI Design: A Circuits and Systems Perspec-
tive, 4th edn. Addison Wesley, Boston (2010)
41. Xu, X., Rührmair, U., Holcomb, D.E., Burleson, W.P.: Security evaluation and
enhancement of bistable ring PUFs. In: Mangard, S., Schaumont, P. (eds.) Radio
Frequency Identification. LNCS, vol. 9440, pp. 3–16. Springer, Heidelberg (2015)
42. Yamamoto, D., Takenaka, M., Sakiyama, K., Torii, N.: Security evaluation of
Bistable Ring PUFs on FPGAs using differential and linear analysis. In: 2014
Federated Conference on Computer Science and Information Systems (FedCSIS),
pp. 911–918 (2014)
Efficient Fuzzy Extraction of PUF-Induced
Secrets: Theory and Applications
1 Introduction
Cryptography relies on reproducible uniformly distributed secret keys. Obtain-
ing affordable physically secure key-storage in embedded non-volatile memory is
hard though. Harvesting entropy from physically unclonable functions (PUFs)
comprehends an alternative that lowers the vulnerability during the power-off
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 412–431, 2016.
DOI: 10.1007/978-3-662-53140-2 20
Efficient Fuzzy Extraction of PUF-Induced Secrets: Theory and Applications 413
1.1 Contribution
The novelty of our work is twofold:
– First, we derive new bounds on the secure sketch min-entropy loss for PUF-
induced distributions with practical relevance. Our bounds are considerably
tighter than the well-known (n − k) formula, hereby improving the implemen-
tation efficiency of PUF-based key generators. The discrepancy is showcased
for two predominant PUF imperfections, i.e., biased and spatially correlated
response bits. It is important to note that a variety of commonly used codes
is covered, e.g., BCH and Reed-Muller codes, regardless of their algebraic
complexity. Furthermore, a large variety of distributions could be supported.
Therefore, our scope reaches considerably further than related work in [8,22],
focussing on simple repetition codes and biased distributions only. As in the
latter works, our bounds are easy-to-evaluate and able to support large codes.
– Second, the newly developed theory is applied to state-of-the-art error-
correction methods for PUFs. As such, we reveal a fundamental flaw in the
reverse fuzzy extractor, proposed by Van Herrewege et al. [28] at Financial
Crypto 2012. The latter lightweight primitive is gaining momentum and has
also been adopted in the CHES 2015 protocol of Aysu et al. [1]. We debunk
the main security claim that repeated helper data exposure does not result
in additional min-entropy loss. Furthermore, we contribute to the motiva-
tion of debiasing schemes such as the index-based syndrome (IBS) proposal
of Yu et al. [30], and the CHES 2015 proposal of Maes et al. [22]. The latter
proposals assume that a stand-alone sketch cannot handle biased distribu-
tions. We eliminate the need for an educated guess that originates from the
extrapolation of repetition code insights and/or the application of the overly
conservative (n − k) bound.
1.2 Organization
The remainder of this manuscript is organized as follows. Section 2 introduces
notation and preliminaries. Section 3 derives new tight bounds on the secure
414 J. Delvaux et al.
2 Preliminaries
2.1 Notation
Binary vectors are denoted with a bold lowercase character, e.g., x. All vec-
tors are row vectors. All-zeros and all-ones vectors are denoted with 0 and
1 respectively. Binary matrices are denoted with a bold uppercase character,
e.g., H. A random variable and its corresponding set of outcomes are denoted
with an uppercase italic and calligraphic character respectively, e.g., X and X .
Variable assignment is denoted with an arrow, e.g., x ← X. Custom-defined
procedure names are printed in a sans-serif font, e.g., Hamming weight HW(x)
and Hamming distance HD(x, x ). The probability of an event A is denoted as
P(A). The expected value of a function g(X) of random variable X is denoted as
Ex←X [g(X)]. The probability density function and cumulative distribution func-
tion of a standard normal distribution N (0, 1) are denoted as fnorm (·) and Fnorm (·)
respectively. For a binomial distribution with n trials and success probability p,
we use fbino (·; n, p) and Fbino (·; n, p) respectively.
A binary [n, k, d] block code C restricts the message length k = log2 (|M|) to
an integer. For a linear block code, any linear combination of codewords is again
a codeword. A k × n generator matrix G, having full rank, can then implement
1×n
the encoding procedure, i.e., w = m · G. For any translation τ ∈ {0, 1} and
linear code C, the set {τ ⊕ w : w ∈ W} is referred to as a coset. Two cosets are
either disjoint or coincide. Therefore, the vector space {0, 1}1×n is fully covered
by 2n−k cosets, referred to as the standard array. The minimum weight vector
in a coset is called the coset leader. In case of conflict, i.e., a common minimum
HW() > t, an arbitrary leader can be selected. The minimum distance d of a
linear code equals the minimum Hamming weight of its nonzero codewords. A
linear code C is cyclic if every circular shift of a codeword is again a codeword
belonging to C.
Several secure sketch constructions rely on a binary code C. For ease of under-
standing, we focus on the code-offset method of Dodis et al. [11] exclusively.
Nevertheless, equivalencies in the extended version of this manuscript (Cryptol-
ogy ePrint Archive, Report 2015/854) prove that all results apply to six other
constructions equally well. The code C that instantiates the code-offset method
in Fig. 1 is not necessarily linear. Even more, it is not required be a block code
either. Linear codes (BCH, Hamming, repetition, etc.) remain the most fre-
quently used though due to their efficient decoding algorithms [18]. Correctness
of reconstruction is guaranteed if HD(x, x ) ≤ t, with t the error-correcting capa-
bility of the code.
Min-entropy loss can be understood as a one-time pad imperfection. Sketch
input x is masked with a random codeword w, i.e., an inherent entropy defi-
ciency: H∞ (W ) = log2 (|M|) < n. For linear codes in particular, we highlight
a convenient interpretation using cosets. Helper data p then reveals in which
coset reference x resides. It can be seen easily that p is equal to a random
vector in the same coset as x. The residual min-entropy in (2) hence reduces
to (4) for linear codes, with a coset leader. We emphasize that the min-entropy
Efficient Fuzzy Extraction of PUF-Induced Secrets: Theory and Applications 417
p ← SSGen(x) ← SSRep(
x x, p)
Random w ∈ C ←x
w ⊕p=w⊕e
p←x⊕w ← p ⊕ Correct(w)
x
loss ΔH∞ does not depend on the decoding method, simply because the helper
data is not affected. For [n, k, d] block codes in particular, the well-known upper
bound ΔH∞ ≤ (n − k) holds, as proven in [11]. More generally, this extends to
ΔH∞ ≤ n − log2 (|M|).
∞ (X|P ) = − log E←E max P((X = ⊕ w)|(E = )) .
H (4)
2
w∈W
3.1 Distributions
Our work is generic in the sense that a large variety of distributions X could
be covered. We only require that X = {0, 1}1×n can be partitioned in a limited
number of subsets ϕj , with j ∈ [1, J], so that all elements of ϕj have the same
probability of occurrence qj . Formally, P(X = x) = qj if and only if x ∈ ϕj .
These probabilities are strictly monotonically decreasing, i.e., q1 > q2 > . . . > qJ .
Occasionally, qJ = 0. The ingoing min-entropy is easily computed as H∞ (X) =
− log2 (q1 ).
We determine bounds on H ∞ (X|P ). The runtime of the corresponding algo-
rithms is roughly proportional to J. The crucial observation is that even a very
small J might suffice to capture realistic PUF models. Below, we describe a para-
meterized distribution X for both biased and spatially correlated PUFs. Both
distributions are to be considered as proof-of-concept models, used in showcasing
the feasibility of a new research direction. In case a given PUF is not approx-
imated accurately enough, one can opt for an alternative and possibly more
complicated second-order distribution. As long as J is limited, bounds can be
evaluated in milliseconds-minutes on a standard desktop computer.
P(X(i) = X(j)) = fbino (2u; |i − j|, 1 − c), with i, j ∈ [1, n]. (6)
u=0
Figure 2 specifies the subsets ϕj for both distributions. For the biased dis-
tribution, we partition according to HW(x). This corresponds to a binomial
distribution with j − 1 successes for n Bernoulli trials, each having success prob-
ability b = min(b, 1 − b). For the correlated distribution, we partition accord-
ing to HD(x(1 : n − 1), x(2 : n)), i.e., the number of transitions in x. Inputs
in subset ϕj exhibit j − 1 transitions and obey either one out of two forms,
i.e., x = (010 . . .) and x = (101 . . .). A related observation is that if
Efficient Fuzzy Extraction of PUF-Induced Secrets: Theory and Applications 419
j |ϕj | qj j |ϕj | qj
n
1 1 (1 − b ) 1 2 1
2
(1 − c )n−1
2 n b (1 − b )n−1 2 2(n − 1) 1
c (1 − c )n−2
2
... ... ... ... ... ...
n
j j−1 (b )j−1 n−j+1
(1 − b ) j 2 n−1
j−1
1
(c )j−1 (1
2
− c )n−j
... ... ... ... ... ...
n n (b )n−1
(1 − b ) n−1 2(n − 1) 1
(c )n−2 (1
2
− c )
n
n+1 1 (b ) n 2 1
(c )n−1
2
Fig. 2. Subsets ϕj for a biased and correlated distribution X, left and right respectively.
We define b = min(b, 1 − b) and c = min(c, 1 − c).
P(X = x)P((P = p)|(X = x))
∞ (X|P ) = − log
H P(P
= p) max
2 x∈X
P(P =
p)
p∈P
(8)
1
For linear codes, the workload can be reduced substantially. With a similar
derivation as before, we rewrite (4) as shown in (9). Up to 2n operations suffice.
420 J. Delvaux et al.
Equation (8) iterates over all p’s and selects each time the most likely x that
is within range, via the addition of a codeword w ∈ W. We now reverse the
roles, as shown in Fig. 3. We iterate over all x’s, from most likely to least likely,
i.e., from ϕ1 to ϕJ . Within a certain ϕj , the order of the x’s may be chosen
arbitrarily. Subsequently, we assign p’s to each x, as represented by the black
squares, until the set P of size 2n is depleted. For each assigned p, we assume
that the corresponding x is the most likely vector, according to (8). Let spj denote
the number of black squares assigned to set ϕj . The residual min-entropy is then
easily computed as in (10).
1
p
J
∞ (X|P ) = − log
H sj qj . (10)
2
|M| j=1
Both linear and non-linear codes are supported by former graphical repre-
sentation. Nevertheless, we elaborate linear codes as a special case due to their
practical relevance. Figure 4 swaps the order of iteration in (9). Only one row
suffices, i.e., each column of helper data vectors p in Fig. 3 is condensed to a sin-
gle square. Black and white squares are now assigned to cosets, as represented
by their coset leaders . Let sj denote the number of black squares assigned
to set ϕj . The residual min-entropy is then easily computed as in (11), hereby
dropping denominator |M| compared to (10), given that spj = 2k · sj .
J
∞ (X|P ) = − log
H sj qj . (11)
2
j=1
In the worst-case scenario, the most likely x’s all map to unique p’s, without
overlap, resulting in a lower bound on H ∞ (X|P ). For a linear code, this would
be the case if the first 2 n−k
x’s all belong to different cosets. In the best-case
scenario, our sequence of x’s exhibits maximum overlap in terms of p, resulting
in an upper bound on H ∞ (X|P ). For a linear code, this would be the case if
the first 2 x’s all map to the same coset, and this repeated for all 2n−k cosets.
k
p
mod(2n , |M|)
|M|
p
mod(2n , |M|)
|M|
Fig. 3. Reversal of the roles in (8). (a) A lower bound on H ∞ (X|P ). (b) An upper
bound on H ∞ (X|P ). Black squares represent terms that contribute to H
∞ (X|P ), one
for each p ∈ P. White squares represent non-contributing terms, overruled by the max
operator. In general, there are few black squares but many white squares, 2n versus
(|M| − 1)2n to be precise. For block codes, i.e., |M| = 2k , the last column of black
squares is completely filled.
(a) 2n−k
(b) 2k 2k 2k
Fig. 4. Reversal of the roles in (9), as applied to linear codes. (a) A lower bound
∞ (X|P ). (b) An upper bound on H
on H ∞ (X|P ). Black squares represent terms that
contribute to H∞ (X|P ), one for each ∈ E. White squares represent non-contributing
terms, overruled by the max operator.
422 J. Delvaux et al.
Worst-Case Bounds. We further tighten the lower bound on H ∞ (X|P ) for the
correlated distribution. The improvement applies to linear codes that have the
all-ones vector 1 of length n as a codeword. This includes Reed-Muller codes of
any order [18]. This also includes many BCH, Hamming and repetition codes, on
the condition that these are cyclic and having d odd, as easily proven hereafter.
Consider an arbitrary codeword with Hamming weight d. XORing all 2n circular
shifts of this codeword results in the all-ones codeword, which ends the proof. As
mentioned before, each set ϕj of the correlated distribution can be partitioned in
pairs {x, x}, with x the ones’ complement of x. Paired inputs belong to the same
coset, i.e., maximum overlap in terms of helper data p. Therefore, we impose
Efficient Fuzzy Extraction of PUF-Induced Secrets: Theory and Applications 423
|ϕj |
upj = |M| = 2k−1 |ϕj |. (12)
2
2 8 12 8 2
x
⊕w
p
Fig. 5. The exact residual min-entropy H ∞ (X|P ) for the correlated distribution and
an [n = 5, k = 1, d = 5] repetition code.
424 J. Delvaux et al.
Best-Case Bounds. We improve the upper bound on H ∞ (X|P ) for both the
biased and correlated distribution. In particular, we take minimum distance d
into account. The main insight is that two slightly differing inputs xu = xv do
not overlap in terms of helper data p. More precisely, if HD(xu , xv ) ∈ [1, d − 1],
then {xu ⊕ w | w ∈ W} ∩ {xv ⊕ w | w ∈ W} = ∅. For the biased distribution,
the following holds: HD(xu , xv ) ∈ [1, d − 1] if xu = xv and xu , xv ∈ (ϕ1 ∪ ϕ2 ∪
. . . ∪ ϕt+1 ). Or stated otherwise, the elements of the first t + 1 sets all result
in unique p’s. Therefore, we can impose the constraint given in (13). Figure 6
depicts the squares.
p |ϕj ||M|, if j ∈ [1, t + 1]
lj = . (13)
0, otherwise
p
mod(2n , |M|)
|M|
t
n
t
n |M| mod(2n , |M|)
i i
(|M| − 1)
i=0 i=0
t+1
∞ (X|P ) = − log
H |ϕj | · qj = − log2 (Fbino (t; n, min(b, 1 − b))). (14)
2
j=1
Efficient Fuzzy Extraction of PUF-Induced Secrets: Theory and Applications 425
Figure 7 presents numerical results for various BCH codes. We focus on small
codes, as these allow for an exact exhaustive evaluation of the residual min-
entropy using (8) and/or (9). As such, the tightness of various bounds can be
assessed adequately. Figure 7(d) nevertheless demonstrates that our algorithms
support large codes equally well, in compliance with a practical key generator.
Note that only half of the bias interval b ∈ [0, 1] is depicted. The reason is that
all curves mirror around the vertical axis of symmetry b = 12 . The same holds
for the correlated distribution with parameter c.
Especially the lower bounds perform well, which benefits a conservative sys-
tem provider. The best lower bounds in Fig. 7(a), (b) and (c) visually coin-
cide with the exact result. The gap with the (n − k) bound is the most com-
pelling around b, c ≈ 0.7, where the corresponding curves hit the horizontal axis
∞ (X|P ) = 0. Also our upper bounds are considerably tighter than their more
H
general alternatives in (5). Nevertheless, the latter bounds remain open for fur-
ther improvement, with the exception of Fig. 7(b). An [n = 7, k = 4, d = 3] code
is perfect and lower and upper bounds then converge to the exact result for a
biased distribution.
4 Applications
The newly developed theory of Sect. 3 facilitates the design and analysis of error-
correction methods for PUFs, as exemplified in twofold manner. First, we point
out a fundamental security flaw in the reverse fuzzy extractor [28]. Second, we
provide a motivational framework for debiasing schemes [15,22,26,27,30].
426 J. Delvaux et al.
15 7
(I) (I)
∞
H ∞
H
(IV)
4
(IV)
7
(VI) (VI)
(II)
(II) (III) (III)
0 0
0.5 0.75 1 0.5 0.75 1
b b
(a) Bias; [n = 15, k = 7, d = 5]. (b) Bias; [n = 7, k = 4, d = 3].
15 127
(I) (I)
∞
H ∞
H
(IV)
(IV)
64
7
(VI)
(VI)
(III) (II)
(II) (V) (III)
0 0
0.5 0.75 c 1 0.5 0.75 1
b
(c) Correlation; [n = 15, k = 7, d = 5]. (d) Bias; [n = 127, k = 64, d = 21].
Fig. 7. The secure sketch min-entropy loss for various BCH codes. Dots correspond
to an exact exhaustive evaluation of (8)/(9). The legend of the curves is as fol-
lows. (I) The ingoing min-entropy H∞ (X) = − log2 (q1 ). (II) The lower bound
∞ (X|P ) = max(H∞ (X) − (n − k), 0). (III) The lower bound on H
H ∞ (X|P ) according
to BoundWorstCase. (IV) The upper bound on H∞ (X|P ) according to BoundBestCase.
(V) The lower bound on H ∞ (X|P ) according to BoundWorstCase2. (VI) The upper
bound on H∞ (X|P ) according to BoundBestCase2.
Efficient Fuzzy Extraction of PUF-Induced Secrets: Theory and Applications 427
The reverse fuzzy extractor, as proposed by Van Herrewege et al. [28] at Financial
Crypto 2012, improves the lightweight perspectives of PUF-based authentication
protocols. The construction was therefore also adopted in the CHES 2015 proto-
col of Aysu et al. [1]. Instead of a single helper data exposure only, p ← SSGen( x)
is regenerated and transferred with each protocol run by a resource-constrained
PUF-enabled device. A receiving resource-rich server, storing reference response
x, can hence reconstruct x ← SSRec(x, p) and establish a shared secret as such.
The footprint of the device is reduced due to the absence of the heavyweight
SSRec procedure.
We debunk the main security claim that repeated helper data exposure
does not result in additional min-entropy loss. The revealed flaw is attributed
to the misuse of a reusability proof of Boyen [6]. For the code-offset sketch
with linear codes, the exposure of p1 ← SSGen(x) and p2 ← SSGen(x ⊕ e),
with perturbation e known and fully determined by the attacker, is provably
equivalent. The latter helper data reveals that x belongs to an identical coset
{p1 ⊕ w : w ∈ W} = {p2 ⊕ e ⊕ w : w ∈ W}. However, perturbation e is deter-
mined by PUF noisiness rather than by the attacker and its release hence reveals
new information. Given a sequence of protocol runs, the attacker can approxi-
mate all individual bit error rates pE as well as the coset to which reference x
belongs.
Figure 8 quantifies the residual min-entropy of X with the exclusion and
inclusion of revealed bit error rates pE respectively. In the latter case, we
rely on a Monte Carlo evaluation of (16), as enabled by choosing a small
[n = 15, k = 7, d = 5] BCH code, given that an analytical approach is not
so very straightforward. Exposure of pE boils down to knowledge of threshold
discrepancy |v(i)−t|. For the biased distribution, the situation is identical to the
flaw in the soft-decision decoding scheme of Maes et al. [21]. As pointed out by
Delvaux of al. [8], there is a bit-specific bias bi = P(r(i) = 1) = fnorm (t + |v(i) −
n For each x in the coset corresponding
t|)/(fnorm (t+|v(i)−t|)+fnorm (t−|v(i)−t|)).
to p, we then compute P(X = x) = i=1 (x(i)bi + (1 − x(i))(1 − bi )). Similarly,
for the spatially correlated distribution, we compute P(X = x) = fnorm (v, 0, Σ),
with covariance matrix Σ exclusively depending on correlation parameter c, as
detailed in the extended version of this manuscript.
∞ (X|P ) = − log Ev←V max P(V = t + (1 − 2w)|v − t| | |v − t| .
H (16)
2
w∈W
The revealed flaw differs from existing attacks by Delvaux et al. [9] and
Becker [3] that apply to the original protocol [28] exclusively. The latter attacks
comprehend the modeling of the highly correlated arbiter PUF via repeated
helper data exposure; a preemptive fix can be found in the PhD thesis of
Maes [19]. The newly revealed flaw is more fundamentally linked to the reverse
fuzzy extractor primitive and applies to all existing protocols so far [1,19,28].
Observe in Fig. 8 that the overly conservative (n−k) bound would compensate for
428 J. Delvaux et al.
7 7
∞
H ∞
H
0 1
0.5 0.75 1 0.5 0.75 c 1
b
(a) Bias; [n = 15, k = 7, d = 5]. (a) Correlation; [n = 15, k = 7, d = 5].
Fig. 8. The residual min-entropy H ∞ (X|P ) for a BCH code. The solid lines that
exclude revealed bit error rates are computed with BoundWorstCase2; Fig. 7 confirms
the visual overlap with the exact result. Dots that include revealed bit error rates
correspond to Monte Carlo evaluations of size 106 .
Nevertheless, for high-bias situations, the new bounds clearly indicate the need
of debiasing schemes. The benefit is amplified by choosing a sketch with a k-bit
output, several of which are listed in the extended version of this manuscript.
The uniform output is then directly usable as a key, hereby eliminating the Hash
function and its additional min-entropy loss in case the leftover hash lemma is
applied.
Finally, we highlight that one of the von Neumann debiasing schemes in [22]
was claimed to be reusable. This claim holds, despite overlooking the misuse of
Boyen’s proof and stating that a stand-alone sketch is reusable. A side effect of
retaining pairs of alternating bits only, i.e., 01 and 10, is that the imbalance in
error rates between 0 and 1 cannot be observed in the helper data. The scheme
is considerably less efficient than other von Neumann variants though, showing
that reusability comes at a price.
5 Conclusion
Secure sketches are the main workhorse of modern PUF-based key generators.
The min-entropy loss of most sketches is upper-bounded by (n − k) bits and
designers typically instantiate system parameters accordingly. However, the lat-
ter bound tends to be overly pessimistic, resulting in an unfortunate imple-
mentation overhead. We showcased the proportions for a prominent category of
PUFs, with bias and spatial correlations acting as the main non-uniformities.
New considerably tighter bounds were derived, valid for a variety of popular
but algebraically complex codes. These bounds are unified in the sense of being
applicable to seven secure sketch constructions. Deriving tighter alternatives for
the (n − k) bound counts as unexplored territory and we established the first
significant stepping stone. New techniques may have to be developed in order to
tackle more advanced second-order distributions. Elaborating a wider range of
applications would be another area of progress. We hope to have showcased the
potential by debunking the main security claim of the reverse fuzzy extractor
and by providing proper quantitative motivation for debiasing schemes.
Acknowledgment. The authors greatly appreciate the support received. The Euro-
pean Union’s Horizon 2020 research and innovation programme under grant num-
ber 644052 (HECTOR). The Research Council of KU Leuven, GOA TENSE
(GOA/11/007), the Flemish Government through FWO G.0550.12N and the Hercules
Foundation AKUL/11/19. The national major development program for fundamental
research of China (973 Plan) under grant number 2013CB338004. Jeroen Delvaux is
funded by IWT-Flanders grant number SBO 121552. Matthias Hiller is funded by the
German Federal Ministry of Education and Research (BMBF) in the project SIBASE
through grant number 01IS13020A.
430 J. Delvaux et al.
References
1. Aysu, A., Gulcan, E., Moriyama, D., Schaumont, P., Yung, M.: End-to-end design
of a PUF-based privacy preserving authentication protocol. In: Güneysu, T., Hand-
schuh, H. (eds.) CHES 2015. LNCS, vol. 9293, pp. 556–576. Springer, Heidelberg
(2015)
2. Barak, B., Dodis, Y., Krawczyk, H., Pereira, O., Pietrzak, K., Standaert, F.-X., Yu,
Y.: Leftover hash lemma, revisited. In: Rogaway, P. (ed.) CRYPTO 2011. LNCS,
vol. 6841, pp. 1–20. Springer, Heidelberg (2011)
3. Becker, G.T.: On the pitfalls of using arbiter-PUFs as building blocks. IEEE Trans.
CAD Integr. Circuits Syst. 34(8), 1295–1307 (2015)
4. Bhargava, M., Mai, K.: An efficient reliable PUF-based cryptographic key gen-
erator in 65nm CMOS. In: Design, Automation & Test in Europe Conference &
Exhibition, DATE 2014, Dresden, Germany, 24–28 March 2014, pp. 1–6 (2014)
5. Bösch, C., Guajardo, J., Sadeghi, A.-R., Shokrollahi, J., Tuyls, P.: Efficient helper
data key extractor on FPGAs. In: Oswald, E., Rohatgi, P. (eds.) CHES 2008.
LNCS, vol. 5154, pp. 181–197. Springer, Heidelberg (2008)
6. Boyen, X.: Reusable cryptographic fuzzy extractors. In: Proceedings of the 11th
ACM Conference on Computer and Communications Security, CCS 2004, Wash-
ington, DC, USA, 25–29 October 2004, pp. 82–91 (2004)
7. Carter, L., Wegman, M.N.: Universal classes of hash functions. J. Comput. Syst.
Sci. 18(2), 143–154 (1979)
8. Delvaux, J., Gu, D., Schellekens, D., Verbauwhede, I.: Helper data algorithms for
PUF-based key generation: overview and analysis. IEEE Trans. CAD Integr. Circ.
Syst. 34(6), 889–902 (2015). http://dx.doi.org/10.1109/TCAD.2014.2370531
9. Delvaux, J., Peeters, R., Gu, D., Verbauwhede, I.: A survey on lightweight entity
authentication with strong PUFs. ACM Comput. Surv. 48(2), 26 (2015)
10. Delvaux, J., Verbauwhede, I.: Fault injection modeling attacks on 65nm arbiter
and RO sum PUFs via environmental changes. IEEE Trans. Circuits Syst. 61–
I(6), 1701–1713 (2014)
11. Dodis, Y., Ostrovsky, R., Reyzin, L., Smith, A.: Fuzzy extractors: how to generate
strong keys from biometrics and other noisy data. SIAM J. Comput. 38(1), 97–139
(2008)
12. Feller, W.: An Introduction to Probability Theory and Its Applications, vol. 1, 3rd
edn. Wiley, New York (1968)
13. Håstad, J., Impagliazzo, R., Levin, L.A., Luby, M.: A pseudorandom generator
from any one-way function. SIAM J. Comput. 28(4), 1364–1396 (1999)
14. Van Herrewege, A., van der Leest, V., Schaller, A., Katzenbeisser, S., Verbauwhede,
I.: Secure PRNG seeding on commercial off-the-shelf microcontrollers. In: TrustE
2013, Proceedings of the 2013 ACM Workshop on Trustworthy Embedded Devices,
pp. 55–64 (2013)
15. Hiller, M., Merli, D., Stumpf, F., Sigl, G.: Complementary IBS: application specific
error correction for PUFs. In: 2012 IEEE International Symposium on Hardware-
Oriented Security and Trust, HOST 2012, 3–4 June 2012, pp. 1–6 (2012)
16. Holcomb, D.E., Burleson, W.P., Fu, K.: Power-up SRAM state as an identifying
fingerprint and source of true random numbers. IEEE Trans. Comput. 58(9), 1198–
1210 (2009)
17. Koeberl, P., Li, J., Rajan, A., Wu, W.: Entropy loss in PUF-based key generation
schemes: the repetition code pitfall. In: 2014 IEEE International Symposium on
Hardware-Oriented Security and Trust, HOST 2014, Arlington, VA, USA, 6–7 May
2014, pp. 44–49 (2014)
Efficient Fuzzy Extraction of PUF-Induced Secrets: Theory and Applications 431
18. MacWiliams, F.J., Sloane, N.J.A.: The Theory of Error Correcting Codes. North-
Holland Mathematical Library (Book 16). North Holland Publishing Co., New
York (1977)
19. Maes, R.: Physically unclonable functions: constructions, properties and applica-
tions. Ph.D. thesis, KU Leuven (2012). Ingrid Verbauwhede (promotor)
20. Maes, R.: An accurate probabilistic reliability model for silicon PUFs. In: Bertoni,
G., Coron, J.-S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 73–89. Springer, Heidel-
berg (2013)
21. Maes, R., Tuyls, P., Verbauwhede, I.: A soft decision helper data algorithm for
SRAM PUFs. In: ISIT 2009, IEEE International Symposium on Information The-
ory, pp. 2101–2105 (2009)
22. Maes, R., van der Leest, V., van der Sluis, E., Willems, F.: Secure key generation
from biased PUFs: extended version. J. Cryptogr. Eng. 6(2), 121–137 (2016)
23. Maes, R., Van Herrewege, A., Verbauwhede, I.: PUFKY: a fully functional PUF-
based cryptographic key generator. In: Prouff, E., Schaumont, P. (eds.) CHES
2012. LNCS, vol. 7428, pp. 302–319. Springer, Heidelberg (2012)
24. Reyzin, L.: Entropy loss is maximal for uniform inputs. Technical report BUCS-
TR-2007-011, Department of Computer Science, Boston University, September
2007
25. Tuyls, P., Schrijen, G.-J., Škorić, B., van Geloven, J., Verhaegh, N., Wolters, R.:
Read-proof hardware from protective coatings. In: Goubin, L., Matsui, M. (eds.)
CHES 2006. LNCS, vol. 4249, pp. 369–383. Springer, Heidelberg (2006)
26. van der Leest, V., Schrijen, G.-J., Handschuh, H., Tuyls, P.: Hardware intrinsic
security from D flip-flops. In: Proceedings of the Fifth ACM Workshop on Scalable
Trusted Computing, STC 2010, pp. 53–62 (2010)
27. Van Herrewege, A.: Lightweight PUF-based key and random number generation.
Ph.D. thesis, KU Leuven, 2015. Ingrid Verbauwhede (promotor)
28. Van Herrewege, A., Katzenbeisser, S., Maes, R., Peeters, R., Sadeghi, A.-R.,
Verbauwhede, I., Wachsmann, C.: Reverse fuzzy extractors: enabling lightweight
mutual authentication for PUF-enabled RFIDs. In: Keromytis, A.D. (ed.) FC 2012.
LNCS, vol. 7397, pp. 374–389. Springer, Heidelberg (2012)
29. Yu, H., Leong, P.H.W., Hinkelmann, H., Möller, L., Glesner, M., Zipf, P.: Towards
a unique FPGA-based identification circuit using process variations. In: FPL 2009,
International Conference on Field Programmable Logic and Applications, pp. 397–
402 (2009)
30. Yu, M., Devadas, S.: Secure and robust error correction for physical unclonable
functions. IEEE Des. Test Comput. 27(1), 48–65 (2010)
Run-Time Accessible DRAM PUFs
in Commodity Devices
1 Introduction
Continued miniaturization and cost reduction of processors and System-on-Chip
designs have enabled the creation of almost ubiquitous smart devices, from smart
thermostats and refrigerators, to smartphones and embedded car entertainment
systems. While there are numerous advantages to the proliferation of such smart
devices, they create new security vulnerabilities [1,6,8,12]. One major concern is
that they often lack the implementation of sufficient security mechanisms [34,46].
Critical challenges in securing these devices are to provide robust device authen-
tication and identification mechanisms, and means to store long-term crypto-
graphic keys in a secure manner that minimizes the chances of their illegitimate
extraction or access.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 432–453, 2016.
DOI: 10.1007/978-3-662-53140-2 21
Run-Time Accessible DRAM PUFs in Commodity Devices 433
1
In the rest of the paper we will use the terms PUF response and PUF measurement
interchangeably.
434 W. Xiong et al.
1.2 Contributions
– We extract decay-based DRAM PUF instances from unmodified commod-
ity devices, including the PandaBoard and the Intel Galileo platforms. Two
approaches are presented: (i) accessing the PUF at device startup using a cus-
tomized firmware, and (ii) querying the PUF with a kernel module, while the
Linux OS is running on the same hardware and is actively using the DRAM
chip wherein the PUF is located.
– Through extensive experiments, we show that DRAM PUFs exhibit robust-
ness, reliability, and in particular allow usage of the decay time as part of the
PUF challenge.
– We introduce new metrics for evaluating DRAM PUFs, based on the Jaccard
index, and show they are significantly better suited for the decay-based DRAM
PUF evaluation over the classic Hamming-distance based metrics.
– Finally, we exploit time-dependent decay characteristics of DRAM cells in
the design of PUF-enhanced protocols. In particular, we show protocols for
device identification and authentication that draw their security from the
time-dependent decay of DRAM cells.
Run-Time Accessible DRAM PUFs in Commodity Devices 435
In a DRAM cell, a single data bit is stored in a capacitor and can be accessed
through a transistor, as shown in Fig. 1. DRAM cells are grouped in arrays,
where each row of the array is connected to a horizontal word-line. Cells in
the same column are connected to a bit-line. All bit-lines are coupled to sense-
amplifiers that amplify small voltages on bit-lines to levels such that they can
be interpreted as logical zeros or ones. In order to access a row, all the bit-lines
will be precharged to half the supply voltage VDD /2; subsequently the word-line
is enabled, connecting every capacitor in that row with its bit-line. The sense
amplifier will then drive the bit-line to VDD or 0 V, depending on the charge on
the capacitor. The amplifiers are usually shared by two bit-lines [15], of which
only one can be accessed at the same time. This structure makes the two bit-lines
complementary, which results in two kinds of cells: true cells and anti-cells. True
cells store the value 1 as VDD and 0 as 0 V in the capacitor, whilst anti-cells
store the value 0 as VDD and 1 as 0 V.
Fig. 1. A single DRAM cell consists of a Fig. 2. Five steps required for run-
capacitor and a transistor, connected to time access of a DRAM PUF. Only
a word-line (WL) and a bit-line (BL or during steps (b)–(d) the memory asso-
BL*); arrows indicate leakage paths for ciated with the PUF is not usable for
dissipation of charges that lead to PUF other processes.
behavior.
436 W. Xiong et al.
DRAM cells require periodic refresh of the stored charges, as otherwise the
capacitors lose its charge over time, which is referred to as DRAM cell decay or
leakage. The hardware memory controller takes care of periodic refresh, whose
interval is defined by the vendor and is usually 32 ms or 64 ms. Without this
periodic refresh, some of the cells will slowly decay to 0, while others decay to 1,
depending on whether they are a true cell or an anti-cell. Because of the man-
ufacturing variations among DRAM cells, some cells decay faster than others,
which can be exploited as a PUF.
In our evaluation, we use a fixed initialization value initval = 0 to all cells. The
entropy of our measurements thus can be further improved.
Overall, the challenge of a DRAM PUF can be defined as a tuple (id, t),
where id denotes the logical PUF instance (addr and size) and t denotes the
decay time after which the memory content is read. We will not specify the
initval as we assume it is fixed.
Although SRAM and DRAM PUFs are both considered weak PUFs [30], the
DRAM PUF presented in this paper offers multiple challenges due to the ability
to vary decay times t. Given two PUF measurements mx and mx+1 , taken at
corresponding decay times tx and tx+1 (tx+1 ≥ tx ), both mx+1 and mx can serve
as PUF responses. With increasing decay times t, the number of DRAM cells
flipping is monotonically increasing. Thus, mx+1 consists of a number of newly
flipped bits as well as the majority2 of bits that already flipped in mx . In general,
if tx ≤ tx+1 and addrx = addrx+1 , sizex = sizex+1 , we observe mx ⊆ mx+1 ,
up to noise. However, note that it is not possible to measure responses for several
decay times t0 , t1 , ..., tn at once. In particular, reading the PUF response at one
decay time will cause the memory to be refreshed (the cells are re-charged as
the data is read from DRAM cells into row buffers). Querying a PUF response
with different decay time thus requires one to restart the experiment.
Deactivating DRAM refresh for PUF access during device operation is a non-
trivial task: when DRAM refresh cycles are disabled, critical data (such as data
belonging to the OS or user-space programs) will start to decay and the system
will crash. In our experiments, the Intel Galileo board with Yocto Linux crashes
about a minute after DRAM refresh is disabled. Therefore, we present a cus-
tomized solution which allows us to refresh critical code, but leaves PUF areas
untouched. This solution is based on two techniques dubbed selective DRAM
refresh and memory ballooning. The former allows for selectively refreshing the
memory regions occupied by the OS and other critical applications so that they
run normally and do not crash. Memory ballooning, on the other hand, safely
reserves the memory region that corresponds to a logical PUF without corrupt-
ing critical data and also protects the memory region from OS and user-space
programs accesses, to let the cells decay during PUF measurement.
Selective DRAM Refresh. On some devices, such as the PandaBoard, DRAM
consists of several physical modules or logical segments, where the refresh of
each module/segment can be controlled individually. In this case, the PUF can be
allocated in a different memory segment from the OS and user-space programs.
When querying the PUF, only the refresh of the segment holding the PUF is
deactivated, while the other segments remain functional.
2
Due to noise, the set of flipping cells for a fixed time tx will not be completely stable.
Nevertheless, our experiments in Sect. 4 show very low amounts of noise.
438 W. Xiong et al.
On other devices, e.g., the Intel Galileo, the refresh rate can only be con-
trolled at the granularity of the entire DRAM3 . Refresh at segment granularity
is not possible. However, memory rows can be refreshed implicitly once they
are accessed due to a read or a write operation. When a word line is selected
because of a memory access, the sense amplifier drives the bit-lines to either
the full supply voltage VDD or back down to 0 V, depending on the value that
was in the cell. In this way, the capacitor charge is restored to the value it had
before the charges leaked. Using the above principle, even if refresh of the whole
memory is disabled, selective memory rows can be refreshed by issuing a read
to a word within each of the selected memory rows. This functionality can be
implemented in a kernel module by reading a word within each memory row to
be refreshed (Sect. 3).
Ballooning System Memory. To query a chosen logical PUF, the DRAM por-
tion given by addr and size is overwritten by the respective initialization value
(initval) and refresh is deactivated. To prohibit applications from accessing the
PUF and thus implicitly refreshing them, we use memory ballooning concepts
developed for virtual machines [47]. Memory ballooning is a mechanism for
reserving a portion of the memory so as to prevent the memory region from
being used by the kernel or any application. This approach allows to specify
the physical address (addr) and size (size) of the memory region that will be
reserved, i.e., the PUF. Once PUF memory is “ballooned”, DRAM refresh can
be disabled and selective refresh enabled for the non-PUF memory region. After
PUF access is finished, the balloon can be deflated and the memory restored to
normal use.
DRAM PUFs differ from classic memory-based PUFs, as they can be evaluated
during run-time. An attacker, who wants to evaluate the PUF has less capa-
bilities in doing so due to the fact that disabling and enabling DRAM refresh
includes writing to hardware registers, a task which can only be performed by
the kernel. An attacker thus requires root privileges. Furthermore, accessing the
memory dedicated to the PUF is restricted to the kernel as well. Thus, a crucial
security assumption is that firmware and operating system are trusted and an
attacker never gains root privileges.
An attacker may try to change the ambient temperature in order to influence
the bit flip characteristics, but a legitimate user can compensate the temperature
effect by adjusting the decay time (as discussed in Sect. 4). The attacker could
also try to adapt the “rowhammering” approach presented in [17], i.e., induc-
ing random bit flips into DRAM cells by repeatedly accessing adjacent rows.
3
Although the test boards do have multiple DRAM modules, DRAM refresh cannot
be disabled individually. In particular, on the Galileo board, one DRAM chip is used
to store the most significant 8 bits of every 16 bits, while the other chip is used to
store the least significant 8 bits of every 16 bits. Disabling refresh on a single chip
is not possible, as half of each memory word would be lost.
Run-Time Accessible DRAM PUFs in Commodity Devices 439
4
A key feature of Linux, the so-called workqueues, allowing tasks to be scheduled at
specific time intervals, is used for this purpose.
Run-Time Accessible DRAM PUFs in Commodity Devices 441
Table 1. Time needed to perform memory reads (i.e. the selective refresh) to refresh
varying sizes of memory regions on the Intel Galileo board with DDR3 memory.
5
One required change is disabling or limiting the journaling service. Other options
available are to reduce the size of the journal so it does not take much memory, or
using persistent storage for the journal.
442 W. Xiong et al.
|v1 ∩ v2 |
Jinter (v1 , v2 ) = . (1)
|v1 ∪ v2 |
For an ideal PUF, the value of Jinter (v1 , v2 ) should be close to zero, indicating
that two logical PUFs rarely share flipped bits. Indeed, as Table 2 shows, our
DRAM PUFs depict an almost perfect behavior with the Intel Galileo having a
Run-Time Accessible DRAM PUFs in Commodity Devices 443
0.15 0.25
Jinter Jinter
Jintra 0.2 Jintra
Probability
Probability
0.1
0.15
0.1
0.05
0.05
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Jaccard index between pairs of measurements Jaccard index between pairs of measurements
Fig. 3. Distribution of Jintra and Jinter values for (left) the PandaBoard and (right)
the Intel Galileo.
Note, that simply observing the number of bits decaying after time t has
elapsed, is not sufficient for determining k, as the bit decay will be due to two
effects: (i) short-term noise that must be corrected and (ii) stable long-term decay
characteristics. In order to approximate k, indicating the stable PUF character-
istics, multiple measurements for a single PUF can be averaged in order to elim-
inate the noise component. Table 2 lists the fractional entropy Ht /N computed
this way. We observe that the entropy is significantly bigger on the PandaBoard,
indicating more bit flips than on the Intel Galileo. This is most likely due to the
different technologies used to implement DRAM cells.
It is noteworthy to compare the entropy that can be extracted from different
PUF implementations. While SRAM PUFs usually show min-entropy values of
around 0.7–0.9 bits per cell, the entropy of the proposed DRAM PUF is one order
of magnitude smaller. This can be explained as follows: whilst within SRAM the
majority of cells have a unique startup pattern, in case of DRAM only some
cells will flip during the observed decay time. However, this lower entropy can
be easily compensated by the magnitudes higher (usually a thousand times)
amount of DRAM cells available.
Decay Dependency on Time and Temperature. Figure 4 shows the decay rate as a
function of decay time for both the PandaBoard and Intel Galileo. All measure-
ments were taken at ambient room temperature with DRAM chips operating at
around 40 ◦ C. Every data point shows the average of all logical PUFs. We see
that the decay rate significantly increases with time on the Galileo. The Panda-
Board shows an s-like decay that has a steep beginning and saturates towards
t = 360 s.
This plot allows us to estimate the number of time-dependent challenges
that a logical PUF can provide. In order to allow for unique identification at
any given decay time, the set of decay times t1 , t2 , ..., tn must be chosen such
that the corresponding measurements show a minimum number of new bits
flips, referred to as bits , which is greater than the inherent noise. Given bits ,
the set of viable decay times (and thus the challenges of a logical PUF) can be
0.025 0.012
0.01
0.02
Decay rate
0.008
Decay rate
0.015
0.006
0.01
0.004
0.005
0.002
0 0
120 180 240 300 360 120 180 240 300 360
Decay time (sec) Decay time (sec)
Fig. 4. Time-dependency of decay rate for DRAM modules on the (left) PandaBoard
and the (right) Intel Galileo at room temperature.
Run-Time Accessible DRAM PUFs in Commodity Devices 445
0.2 0.3
t1 = 120s t1 = 120s
t2 = 180s 0.25 t2 = 180s
0.15 t3 = 240s t3 = 240s
0.2
Decay rate
Decay rate
t4 = 300s t4 = 300s
t5 = 360s t5 = 360s
0.1 0.15
0.1
0.05
0.05
0 0
40 60 80 40 60 80
Temperature (∝C) Temperature (∝C)
Fig. 5. Relation between the temperature and the decay rate measured on (left) the
PandaBoard and (right) the Intel Galileo.
chosen accordingly. We used the maximum noise level previously observed for
each respective decay time t in order to get a conservative approximation of the
maximum number of challenges per logical PUF. We experimentally determined
the maximum number of decay times to be n = 7 for the Intel Galileo and n = 2
for the PandaBoard. The number assumes a maximum decay time tn ≤ 360 s
and possible challenges are indicated by vertical red lines in Fig. 4. The smaller
number for the PandaBoard is mainly due to higher noise. In particular, we
observe that for the PandaBoard the Jintra values can be comparably low, i.e.
Jintra = 0.3484 at t = 360 s. However, given Jinter is a magnitude different from
Jintra , unique identification ability is preserved.
A second factor influencing the decay rate of DRAM cells is temperature. In
Fig. 5 we show the dependency between temperature and decay rate for DRAM
modules on the Intel Galileo and the PandaBoard. In order to control the tem-
perature, we used a metal ceramic heater to heat the surface of DRAM modules
to the desired temperature and took the measurements.
Although temperature affects the decay rate significantly, it does not change
the decay characteristics much; instead, it affects decay time: We observed that
by using a carefully chosen smaller decay time tT < t at a larger temperature
T > T , the same PUF response can be obtained as with decay time t at tem-
perature T . In our experiments, we derive the following dependency for the Intel
Galileo boards:
tT = t ∗ e−0.0662∗(T −T ) . (3)
Hence, if the PUF is evaluated at a different temperature than during enroll-
ment, this can be compensated through adapting the decay time according
to Eq. (3). In order to support this statement, we calculated the noise Jintra
between an enrollment measurement at room temperature (40 ◦ C) and a mea-
surement taken at a different temperature by adjusting decay time. For this
purpose, we created reference measurements at room temperature with decay
times tx = {120 s, 180 s, 240 s, 300 s, 360 s}. In a next step, we used equiva-
lent decay times tT that correspond to temperatures T = {40 ◦ C, 50 ◦ C, 60 ◦ C}
446 W. Xiong et al.
0.12
1
0.1
0.8
0.08
0.6
Probability
Jintra
0.06
0.4
0.04
0.2
0.02
0
t1 t2 t3 t4 t5 t1 t2 t3 t4 t5 t1 t2 t3 t4 t5 0
0.75 0.8 0.85 0.9 0.95
40∝C 50∝C 60∝C Jaccard index between pairs of measurements
and measured the PUF accordingly. As shown in Fig. 6, for all measurements,
Jintra lies within the usual noise level. Thus, differences in temperature can be
accommodated by adjusting decay time accordingly.
Stability over Time. During extended lifetime of devices, DRAM aging effects
will begin to take place. Existing work on SRAM PUFs has explored aging
effects [23,25,32,38]. We are aware of limited work on aging-related effects in
DRAM cells with regard to security [36]. Figure 7 shows the histogram of Jintra
values for measurements of an Intel Galileo, taken 4 months apart. Three logic
PUFs were measured, and results are combined. Note that the measurements also
include the noise introduced by temperature changes in our lab. Jintra values
were computed of measurement pairs that comprise an enrollment and a recon-
struction measurement each. The values are similar to the Jintra results shown
in Table 2, suggesting sufficient stability of DRAM PUFs over a long-term usage
time period.
In the this section we propose two novel PUF-based protocols that draw their
security from the time-dependent decay characteristics of a DRAM PUF instance
when queried at different decay times. Both protocols involve two parties, a client
C and server S. Whilst the first protocol authenticates C towards an honest S,
the second protocol establishes a secure channel between C and S. The protocols
leverage PUF instances extracted from DRAM modules and thus require C to
own a device D that implements a DRAM PUF during the course of the protocol.
For the sake of clarity, we will refer to the PUF instance on the client’s device
as C itself. Further, we omit the full specification parameters of the logical PUF
instance to be queried. Instead of stating all parameters (addr, size) in every
protocol, we refer to one logical PUF instance as id.
Run-Time Accessible DRAM PUFs in Commodity Devices 447
Adversary and Threat Model. Our adversary model for the protocols considers a
passive attacker, who is able to observe the network traffic between client and
server and who can capture transmitted messages, in particular previous PUF
measurements that were sent by the client. Furthermore, we consider the Fuzzy
Extractor construction, in particular the ECC parameters as well as the Helper
Data, to be public and thus known by the attacker.
Enrollment. An enrollment phase precedes both protocols, which is assumed to
be conducted at a trusted party SYS, such as a manufacturer or a system integra-
tor. For each logical PUF instance, during the enrollment phase, SYS queries the
PUF n times in order to get a set of measurements M = {mid,0 , mid,1 , ..., mid,n }
at a defined set of decay times T = {t0 , t1 , ..., tn }, i.e., mid,x = P U F (id, tx ).
Decay times t0 , t1 , ..., tn are carefully chosen such that t0 < t1 < ... < tn and
for every tuple of subsequent decay times the number of newly introduced bit
flips in PUF measurements is always greater than a security parameter bits . The
parameter bits can be changed to adjust security and usability of the protocol
(see the end of this section).
To generate keys for the secure channel establishment protocol, SYS chooses
a set K = {kid,0 , kid,1 , ..., kid,n } containing uniformly distributed keys and uses
a Fuzzy Extractor to create a set of Helper Data W = {wid,0 , wid,1 , ..., wid,n },
such that (kid,x , wid,x ) = GEN (mid,x ), where GEN (·) denotes the generation
function of the Fuzzy Extractor. While the current Fuzzy Extractor construc-
tions [5,24] might leak entropy from the helper data in case of biased PUF, we
assume there is a construction tailored for DRAM PUFs. Eventually, T , M W
and K will be given to S, whilst the device will be handed to C in a secure
manner.
Device Authentication. In order to authenticate the client C towards an honest
server S, the server chooses the smallest decay time tx not previously used for
logical PUF id in a run of the authentication protocol. Next, S transmits id and
tx to C, who uses it as input to his or her PUF to retrieve a measurement mid,x ,
which is sent back to S. S checks if mid,x is close enough to a stored measurement
mid,x . This is done by checking whether the Jaccard index of mid,x and mid,x is
larger than a given threshold auth , defined based on the noise of measurement
mx . This authentication protocol is depicted on the left side of Fig. 8. Note that
for subsequent authentication trials, decay times are monotonically increasing.
The authentication is designed to be lightweight for the client in terms of
computational overhead and memory footprint. It does not require C to store
any long-term Helper Data or perform expensive decoding that is usually part
of the key reconstruction process performed by classical Fuzzy Extractors. This
is especially useful in the context of highly resource-constrained low-cost devices
that have to be authenticated towards a server repeatedly.
Secure Channel Establishment. Using similar ideas, a secure channel can be
established between C and S, see Fig. 8 (right side). Again, S sends the smallest,
not previously used decay time tx for logical PUF id, this time along with the
corresponding Helper Data wid,x . The client evaluates his PUF instance id using
448 W. Xiong et al.
Fig. 8. Sequence diagram of (left) the device authentication protocol and (right) the
secure channel establishment protocol.
The probability that an attacker guesses M random bits and l of them hap-
(bits N −bits
l )( M −l )
pen to be real new bit flips of the subsequent measurement is N .
(M )
Note that in this case, the Jaccard index of the attacker’s guess and the true
l+|mx |
measurement is J(mx+1 , mx+1 ) = bits +M −l+|mx | , where |mx | is the number of
bit flips in the previous measurement mx . Assuming the authentication and key
generation is successful if J(mx+1 , mx+1 ) > Δ, the attacker will only be suc-
cessful if l is greater than (bits1+Δ
+M )∗Δ
− |mx1+Δ
|∗(1−Δ)
. Thus, the probability for an
attacker to make a successful guess is:
bits
bits N −bits
PM = l
−l .
N M (4)
(bits +M )∗Δ |m |∗(1−Δ) M
l= 1+Δ − x1+Δ
The attacker can chose any M , which will maximize the success probability PM .
If N is large and M is between Mmin = bits ∗ Δ − (1 − Δ)|mx | and Mmax =
bits +(1−Δ)∗|mx |
Δ , PM is monotonically decreasing with M .6 Hence, the attacker
can choose M = Mmin to maximize the success probability.
In order to provide 128-bit security, P = maxM {PM } must be smaller
than 2−128 . Given Formula 4 and PUF characteristics, one can fix N and Δ,
then derive bits for different |mx |, and subsequently estimate the feasible decay
time. For the Intel Galileo, as a conservative estimation, the space of potential
new bit flips is of size N = 30 KB (assuming that out of a 32 KB logical PUF,
less than 2 KB are flipping in mx ), and the threshold is Δ = 0.6. To set |m0 |,
a point where the decay is larger than the noise should be found. To be conser-
vative, minimum max intra-HD is used as a reference for |m0 |.7 Hence, we set
|m0 | = 80, and then we can get bits1 = 73, and |m1 | = |m0 | + bits2 = 153. Then
with |m1 |, we can get bits2 = 122 and thus |m2 | = 275, etc. Consequently, in
the Galileo, a 32 KB logical PUF can provide 7 challenges, each with the decay
time shorter than 360 s.
This novel work on run-time accessible DRAM PUFs still leaves a number of open
research issues and questions that need to be addressed. This creates opportuni-
ties for the community to refine and further improve the concept of DRAM PUFs.
Temperature dependency of the DRAM cell decay allows physical attackers
to control the decay rate by adjusting the ambient temperature. For example,
heating a DRAM chip may “speed up” the decay rate, shortening the time
needed for an attacker to observe certain bits flip. Further investigation on the
temperature dependence is needed and counter-measures need to be developed
to thwart such attacks.
6
PM > 0 when M is between bits ∗ Δ − (1 − Δ)|mx | and bits +(1−Δ)∗|m
Δ
x|
.
7
If the PUF characteristic is better understood for t < 120 s, a smaller |m0 | may be
chosen.
450 W. Xiong et al.
Voltage dependency of the DRAM cell decay was not considered in this paper,
as commodity devices usually give no control over DRAM voltages. However,
voltage dependency could be another viable characteristic used for the run-time
accessible DRAM PUFs, if future commodity devices provide interfaces that
allow for fine-grained control of DRAM voltages.
Readout time of the DRAM PUFs is in the order of minutes. This can be
seen as a disadvantage, although in many cases it can be compensated by the
advantage of being able to access the DRAM PUFs at run-time. Use cases that
allow for such relatively long readout times need to be better understood. At
the same time, improving the readout time is critical in order to broaden the
applicability of DRAM PUFs.
Security assumptions, e.g., the trusted firmware and the operating system,
may be considered as too strong. While these are also required for the other
PUFs in commodity devices, one may look for solutions requiring a smaller
trusted computing base.
Fuzzy Extractor constructions are needed that are either specifically tailored
towards heavily biased PUF responses, found in decay-based DRAM PUFs, or
that use the introduced Jaccard distances. Classic Fuzzy Extractors are based on
Hamming distance-related metrics and are not secure for heavily biased PUFs.
Thus, new constructions for biased PUFs, such as [24,39], should be developed.
7 Conclusion
In this work we presented intrinsic PUFs that can be extracted from Dynamic
Random-Access Memory (DRAM) in commodity devices. An evaluation of the
DRAM PUFs found on unmodified, commodity devices, in particular the Panda-
Board and Intel Gallileo, showed their robustness, uniqueness, randomness, as
well as stability over period of at least few months. Moreover, in contrast to
existing DRAM and SRAM PUFs, we demonstrate a system model that is able
to query the PUF instance directly during run-time using a Linux kernel module,
based on the ideas of selective DRAM refresh and memory ballooning. We further
presented protocols for device authentication and identification that draw their
security from time-dependent decay characteristics of our DRAM PUF. Our
intrinsic DRAM PUFs overcome two limitations of the popular intrinsic SRAM
PUFs: they have the ability to be accessed at run-time, and have an expanded
challenge-response space due to use of decay time t as part of the challenge.
Consequently, our work presents a new alternative for device authentication by
leveraging DRAM in commodity devices.
Acknowledgements. This work has been co-funded by the DFG as part of project P3
within the CRC 1119 CROSSING. This work was also partly funded by CASED. The
authors would like to thank Kevin Ryan and Ethan Weinberger for their help with
building the heater setup used in the experiments, and Intel for donating the Intel
Galileo boards used in this work. The authors would also like to thank anonymous
CHES reviewers, and especially our shepherd, Roel Maes, for numerous suggestions
and guidance in making the final version of this paper.
Run-Time Accessible DRAM PUFs in Commodity Devices 451
References
1. Hacking DefCon 23’s IoT Village Samsung fridge. https://www.pentestpartners.
com/blog/hacking-defcon-23s-iot-village-samsung-fridge/. Accessed Feb 2016
2. Armknecht, F., Maes, R., Sadeghi, A.R., Sunar, B., Tuyls, P.: Memory leakage-
resilient encryption based on physically unclonable functions. In: Sadeghi, A.-R.,
Naccache, D. (eds.) Towards Hardware-Intrinsic Security, pp. 135–164. Springer,
Heidelberg (2010)
3. Bacha, A., Teodorescu, R.: Authenticache: harnessing cache ECC for system
authentication. In: Proceedings of International Symposium on Microarchitecture,
pp. 128–140. ACM (2015)
4. Batra, P., Skordas, S., LaTulipe, D., Winstel, K., Kothandaraman, C., Himmel, B.,
Maier, G., He, B., Gamage, D.W., Golz, J., et al.: Three-dimensional wafer stacking
using Cu TSV integrated with 45 nm high performance SOI-CMOS embedded
DRAM technology. J. Low Power Electron. Appl. 4, 77–89 (2014)
5. Dodis, Y., Reyzin, L., Smith, A.: Fuzzy extractors: how to generate strong keys
from biometrics and other noisy data. In: Cachin, C., Camenisch, J.L. (eds.)
EUROCRYPT 2004. LNCS, vol. 3027, pp. 523–540. Springer, Heidelberg (2004)
6. Foster, I., Prudhomme, A., Koscher, K., Savage, S.: Fast and vulnerable: a story
of telematic failures. In: USENIX Workshop on Offensive Technologies (2015)
7. Gassend, B., Clarke, D., Van Dijk, M., Devadas, S.: Delay-based circuit authen-
tication and applications. In: Proceedings of the ACM Symposium on Applied
Computing, pp. 294–301. ACM (2003)
8. Greenberg, A.: Hackers remotely kill a jeep on the highway–with me in it. Wired
(2015). https://www.wired.com/2015/07/hackers-remotely-kill-jeep-highway/.
Accessed 08 July 16
9. Guajardo, J., Kumar, S.S., Schrijen, G.-J., Tuyls, P.: FPGA intrinsic PUFs and
their use for IP protection. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007.
LNCS, vol. 4727, pp. 63–80. Springer, Heidelberg (2007)
10. Guajardo, J., Kumar, S.S., Schrijen, G.J., Tuyls, P.: Brand and IP protection with
physical unclonable functions. In: IEEE International Symposium on Circuits and
Systems, pp. 3186–3189 (2008)
11. Hashemian, M.S., Singh, B., Wolff, F., Weyer, D., Clay, S., Papachristou, C.:
A robust authentication methodology using physically unclonable functions in
DRAM arrays. In: Proceedings of the Design, Automation and Test in Europe
Conference, pp. 647–652 (2015)
12. Hernandez, G., Arias, O., Buentello, D., Jin, Y.: Smart nest thermostat: a smart
spy in your home. Black Hat USA (2014)
13. Jaccard, P.: Etude comparative de la distribution orale dans une portion des Alpes
et du Jura. Impr. Corbaz (1901)
14. Katzenbeisser, S., Kocabaş, Ü., Rožić, V., Sadeghi, A.-R., Verbauwhede, I., Wachs-
mann, C.: PUFs: myth, fact or busted? A security evaluation of physically unclon-
able functions (PUFs) cast in silicon. In: Prouff, E., Schaumont, P. (eds.) CHES
2012. LNCS, vol. 7428, pp. 283–301. Springer, Heidelberg (2012)
15. Keeth, B.: DRAM Circuit Design: Fundamental and High-Speed Topics. Wiley,
Hoboken (2008)
16. Keller, C., Gurkaynak, F., Kaeslin, H., Felber, N.: Dynamic memory-based physi-
cally unclonable function for the generation of unique identifiers and true random
numbers. In: IEEE International Symposium on Circuits and Systems, pp. 2740–
2743. IEEE (2014)
452 W. Xiong et al.
17. Kim, Y., Daly, R., Kim, J., Fallin, C., Lee, J.H., Lee, D., Wilkerson, C., Lai, K.,
Mutlu, O.: Flipping bits in memory without accessing them: an experimental study
of DRAM disturbance errors. In: ACM SIGARCH Computer Architecture News,
pp. 361–372 (2014)
18. Kocabaş, Ü., Peter, A., Katzenbeisser, S., Sadeghi, A.-R.: Converse PUF-based
authentication. In: Camp, L.J., Volkamer, M., Reiter, M., Zhang, X., Katzen-
beisser, S., Weippl, E. (eds.) Trust 2012. LNCS, vol. 7344, pp. 142–158. Springer,
Heidelberg (2012)
19. Kohnhäuser, F., Schaller, A., Katzenbeisser, S.: PUF-based software protection
for low-end embedded devices. In: Conti, M., Schunter, M., Askoxylakis, I. (eds.)
TRUST 2015. LNCS, vol. 9229, pp. 3–21. Springer, Heidelberg (2015)
20. Kong, J., Koushanfar, F., Pendyala, P.K., Sadeghi, A.R., Wachsmann, C.:
PUFatt: embedded platform attestation based on novel processor-based PUFs.
In: ACM/EDAC/IEEE Design Automation Conference, pp. 1–6 (2014)
21. Liu, J., Jaiyen, B., Kim, Y., Wilkerson, C., Mutlu, O.: An experimental study
of data retention behavior in modern DRAM devices: implications for retention
time profiling mechanisms. In: ACM SIGARCH Computer Architecture News, pp.
60–71 (2013)
22. Liu, W., Zhang, Z., Li, M., Liu, Z.: A trustworthy key generation prototype based
on DDR3 PUF for wireless sensor networks. Sensors 14, 11542–11556 (2014)
23. Maes, R., van der Leest, V.: Countering the effects of silicon aging on SRAM PUFs.
In: IEEE International Symposium on Hardware-Oriented Security and Trust, pp.
148–153 (2014)
24. Maes, R., van der Leest, V., van der Sluis, E., Willems, F.: Secure key generation
from biased PUFs. In: Güneysu, T., Handschuh, H. (eds.) CHES 2015. LNCS, vol.
9293, pp. 517–534. Springer, Heidelberg (2015)
25. Maes, R., Rožić, V., Verbauwhede, I., Koeberl, P., Van der Sluis, E., Van der Leest,
V.: Experimental evaluation of physically unclonable functions in 65 nm CMOS.
In: Proceedings of the ESSCIRC, pp. 486–489 (2012)
26. Phone as a Token - turn your phone into an authentication token. https://www.
intrinsic-id.com/technology/phone-as-a-token/. Accessed Feb 2016
27. Rahmati, A., Hicks, M., Holcomb, D.E., Fu, K.: Probable cause: the deanonymizing
effects of approximate DRAM. In: Proceedings of the International Symposium on
Computer Architecture, pp. 604–615 (2015)
28. Rosenblatt, S., Chellappa, S., Cestero, A., Robson, N., Kirihata, T., Iyer, S.S.:
A self-authenticating chip architecture using an intrinsic fingerprint of embedded
DRAM. IEEE J. Solid-State Circuits 48, 2934–2943 (2013)
29. Rosenblatt, S., Fainstein, D., Cestero, A., Safran, J., Robson, N., Kirihata, T.,
Iyer, S.S.: Field tolerant dynamic intrinsic chip ID using 32 nm high-K/metal gate
SOI embedded DRAM. IEEE J. Solid-State Circuits 48, 940–947 (2013)
30. Rührmair, U., Sölter, J., Sehnke, F.: On the foundations of physical unclonable
functions. IACR Cryptology ePrint Archive 2009, p. 277 (2009)
31. Schaller, A., Arul, T., van der Leest, V., Katzenbeisser, S.: Lightweight anti-
counterfeiting solution for low-end commodity hardware using inherent PUFs. In:
Holz, T., Ioannidis, S. (eds.) Trust 2014. LNCS, vol. 8564, pp. 83–100. Springer,
Heidelberg (2014)
32. Schaller, A., Škorić, B., Katzenbeisser, S.: On the systematic drift of physically
unclonable functions due to aging. In: Proceedings of the International Workshop
on Trustworthy Embedded Devices, pp. 15–20. ACM (2015)
Run-Time Accessible DRAM PUFs in Commodity Devices 453
33. Scheel, R.A., Tyagi, A.: Characterizing composite user-device touchscreen physical
unclonable functions (pufs) for mobile device authentication. In: Proceedings of the
International Workshop on Trustworthy Embedded Devices, pp. 3–13. ACM (2015)
34. Schneier, B.: The internet of things is wildly insecure—and
often unpatchable. Wired (2014). http://www.wired.com/2014/01/
theres-no-good-way-to-patch-the-internet-of-things-and-thats-a-huge-problem/.
Accessed 08 July 2016
35. Schrijen, G.J., van der Leest, V.: Comparative analysis of SRAM memories used
as PUF primitives. In: Proceedings of the Conference on Design, Automation and
Test in Europe, pp. 1319–1324. EDA Consortium (2012)
36. Schroeder, B., Pinheiro, E., Weber, W.D.: DRAM errors in the wild: a large-scale
field study. In: ACM SIGMETRICS Performance Evaluation Review, pp. 193–204
(2009)
37. Schulz, S., Sadeghi, A.R., Wachsmann, C.: Short paper: lightweight remote attesta-
tion using physical functions. In: Proceedings of the ACM Conference on Wireless
Network Security, pp. 109–114 (2011)
38. Selimis, G., Konijnenburg, M., Ashouei, M., Huisken, J., De Groot, H., Van der
Leest, V., Schrijen, G.J., Van Hulst, M., Tuyls, P.: Evaluation of 90 nm 6T-SRAM
as Physical Unclonable Function for secure key generation in wireless sensor nodes.
In: IEEE International Symposium on Circuits and Systems, pp. 567–570 (2011)
39. Skoric, B.: A trivial debiasing scheme for helper data systems. Cryptology ePrint
Archive, Report 2016/241 (2016)
40. Suh, G.E., Devadas, S.: Physical unclonable functions for device authentication
and secret key generation. In: Proceedings of the Design Automation Conference,
pp. 9–14 (2007)
41. Tehranipoor, F., Karimina, N., Xiao, K., Chandy, J.: DRAM based intrinsic phys-
ical unclonable functions for system level security. In: Proceedings of the Great
Lakes Symposium on VLSI, pp. 15–20 (2015)
42. Intrinsic-ID to Showcase TrustedSensor IoT Security Solution at InvenSense Devel-
opers Conference. https://www.intrinsic-id.com/intrinsic-id-to-showcase-trusted
sensor-iot-security-solution-at-invensense-developers-conference/. Accessed Feb
2016
43. Tuyls, P., Batina, L.: RFID-tags for anti-counterfeiting. In: Pointcheval, D. (ed.)
CT-RSA 2006. LNCS, vol. 3860, pp. 115–131. Springer, Heidelberg (2006)
44. Tuyls, P., Schrijen, G.J., Willems, F., Ignatenko, T., Skoric, B.: Secure key storage
with PUFs. In: Tuyls, P., Skoric, B., Kevenaar, T. (eds.) Security with Noisy Data-
On Private Biometrics, Secure Key Storage and Anti-Counterfeiting, pp. 269–292.
Springer, London (2007)
45. Tuyls, P., Škorić, B.: Secret key generation from classical physics: physical unclone-
able functions. In: Mukherjee, S., Aarts, R.M., Roovers, R., Widdershoven, F.,
Ouwerkerk, M. (eds.) AmIware Hardware Technology Drivers of Ambient Intelli-
gence, pp. 421–447. Springer, Netherlands (2006)
46. Viega, J., Thompson, H.: The state of embedded-device security (spoiler alert: it’s
bad). IEEE Secur. Priv. 10, 68–70 (2012)
47. Waldspurger, C.A.: Memory resource management in VMware ESX server. In:
ACM SIGOPS Operating Systems Review, pp. 181–194 (2002)
Side Channel Countermeasures II
On the Multiplicative Complexity of Boolean
Functions and Bitsliced Higher-Order Masking
1 Introduction
One of the most widely used strategy to protect software implementations of block-
ciphers against side-channel attacks consists in applying secret sharing at the
implementation level. This strategy also known as (higher-order ) masking notably
achieves provable security in the probing security model [ISW03] and in the noisy
leakage model [PR13,DDF14]. While designing a higher-order masking scheme for
a given blockcipher, the main issue is the secure and efficient computation of the
s-box. Most of the proposed solutions (see for instance [RP10,CRV14,CPRR15])
are based on a polynomial representation of the s-box over the finite field F2n (where
n is the input bit-length), for which the field multiplications are secured using the
ISW scheme due to Ishai et al. [ISW03].
An alternative approach has recently been put forward which consists in apply-
ing higher-order masking at the Boolean level by bitslicing the s-boxes within a
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 457–478, 2016.
DOI: 10.1007/978-3-662-53140-2 22
458 D. Goudarzi and M. Rivain
2 Preliminaries
2.1 Boolean Functions
Let F2 denote the field with 2 elements and let n be a positive integer. A Boolean
function f with n variables is a function from Fn2 to F2 . The set of such functions
On the Multiplicative Complexity 459
apply this approach to get highly efficient implementations of AES and PRESENT
with masking order up to 10. In their implementations, bitslice is applied at the
s-box level. Specifically, based on a Boolean circuit for an s-box S, one can per-
form parallel evaluations of S in software by replacing each gate of the circuit
with the corresponding bitwise instruction, where is the bit-size of the underly-
ing CPU architecture. It results that the only nonlinear operations in the parallel
s-box processing are bitwise AND instructions between -bit registers which can
be efficiently secured using the ISW scheme. Such an approach achieves important
speedup compared to polynomial methods since (i) ISW-based ANDs are substan-
tially faster than ISW-based field multiplications in practice, (ii) all the s-boxes
within a cipher round are computed in parallel. The authors of [GR16] propose
an additional optimization. In their context, the target architecture (ARM) is of
size = 32 bits, whereas the number of s-boxes per round is 16 (yielding 16-bit
bitslice registers). Therefore, they suggest to group the ANDs by pair in order to
perform a single ISW-based 32-bit AND where the standard method would have
performed two ISW-based 16-bit AND. This roughly decreases the complexity by
a factor up to two.1
and
f1 , f2 , . . . , fm ∈ 1, x1 , x2 , . . . , xn , g1 · h1 , . . . , gt · ht . (3)
In [BPP00], Boyar et al. provide a constructive upper bound for any Boolean
function:
The particular case of Boolean functions with 4 and 5 variables has been
investigated by Turan and Peralta in [TP14]. They give a complete characteri-
zation of affine-equivalence classes of these functions and they show that every
f ∈ F4 has C(f ) ≤ 3 and every f ∈ F5 has C(f ) ≤ 4.
Other works have focused on the multiplicative complexity of particular kinds
of Boolean functions. In [MS92], Mirwald and Schnorr deeply investigate the case
of functions with quadratic ANF. In particular they show that such functions
have multiplicative complexity at most n2 . Boyar et al. give further upper
bounds for symmetric Boolean functions in [BPP00].
out of reach. Moreover, the method is not generic in the sense that the obtained
decomposition stands for a single s-box and does not provide an upper bound
for the multiplicative complexity of s-boxes of a given size.
When m = n, the min is achieved by k = n−log
2
2n
for most n ∈ N, which gives
C(S) ≤ Bn with
√ n
3n + log n
Bn ≈ n 2 2 +1 − 2
+1 . (8)
2
We further introduce in this paper a heuristic decomposition method achiev-
ing lower multiplicative complexity. Our general result is summarized in the
following Theorem:
Theorem 3. For every S ∈ Fnm , we have C(S) ≤ Cn,m with
√ n
Cn,m ≈ m 2 2 +1 − m − n − 1. (9)
And in particular
⎧ ⎧
⎨ 17 f or n = 5 ⎨ 77 f or n = 8
Cn,n = 31 f or n = 6 and Cn,n = 122 f or n = 9 (10)
⎩ ⎩
50 f or n = 7 190 f or n = 10
n 4 5 6 7 8 9 10
Theorem 2 8 16 29 47 87 120 190
Our generic method (Cn,n ) 8 17 31 50 77 122 190
∗
Our improved method (Cn,n ) 7 13 23 38 61 96 145
On the Multiplicative Complexity 463
n 4 5 6 7 8 9 10
Cn,n 8 17 31 50 77 122 190
(2)
Cn,n 4 9 16 25 39 62 95
(4)
Cn,n 2 5 9 13 20 31 48
A·c=b (14)
where b = (f (e1 ), f (e2 ), . . . , f (en ))T with {ei } = Fn2 and where A is a matrix
defined as the concatenation of t + 1 submatrices:
with
⎛ ⎞
φ1 (e1 ) · gi (e1 ) φ2 (e1 ) · gi (e1 ) ... φ|B| (e1 ) · gi (e1 )
⎜ φ1 (e2 ) · gi (e2 ) φ2 (e2 ) · gi (e2 ) ... φ|B| (e2 ) · gi (e2 ) ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
Ai = ⎜ .. .. .. .. ⎟ (16)
⎜ . . . . ⎟
⎜ ⎟
⎝ ⎠
φ1 (e2n ) · gi (e2n ) φ2 (e2n ) · gi (e2n ) ... φ|B| (e2n ) · gi (e2n )
On the Multiplicative Complexity 465
for 0 ≤ i ≤ t − 1, and
⎛ ⎞
φ1 (e1 ) φ2 (e1 ) ... φ|B| (e1 )
⎜ φ1 (e2 ) φ2 (e2 ) ... φ|B| (e2 ) ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
At = ⎜ . .. .. . ⎟ (17)
⎜ .. . . .. ⎟
⎜ ⎟
⎝ ⎠
φ1 (e2n ) φ2 (e2n ) ... φ|B| (e2n )
It can be checked that the vector c, solution of the system, gives the coeffi-
cients of the hi ’s over the basis B. A necessary condition for this system to have
a solution whatever the target vector b (i.e. whatever the Boolean function f ) is
to get a matrix A of full rank. In particular, the following inequality must hold:
(t + 1)|B| ≥ 2n . (18)
B0 = {x → xu , u ∈ U}
with U = (u1 , . . . , u , 0, . . . , 0) ∪ (0, . . . , 0, u+1 , . . . , un ) (19)
where = n2 and where ui ∈ {0, 1} for every i ∈ [[1, n]]. Then, we clearly
have B0 × B0 = Mn . We hence suggest taking B ⊇ B0 , with B possibly larger
than B0 since restraining ourselves to B = B0 could be non-optimal in terms
of multiplications for the underlying decomposition method. Indeed, (18) shows
that the more elements in the basis, the smaller t, i.e. the less multiplications
gi · hi . We might therefore derive a bigger basis by iterating B ← B ∪ {φj · φk },
where φj and φk are randomly sampled from B until reaching a basis B with the
desired cardinality.
We then have r = |B| − n − 1, where we recall that r denotes the number of
multiplications to derive B, and since x → 1, x → x1 , . . . , x → xn ∈ B requires
no multiplications. By construction, we have |B| ≥ |B0 | = 2 +2n− −1, implying
r ≥ 2 + 2n− − (n + 2). Let Cn = r + t denote the number of multiplications
achieved by our decomposition method. Then, by injecting Cn in (18) we get:
that is:
2n
Cn ≥ r + − 1. (21)
n+1+r
n
It can be checked that the value of r minimizing the above bound is 2 2 − (n − 1).
However, r must satisfy r ≥ 2 + 2n− − (n + 2) where = n2 , which is always
466 D. Goudarzi and M. Rivain
n
greater than 2 2 − (n − 1) for n ≥ 2. That is why we shall define the optimal
value of the parameter r (for the single-Boolean-function case) as:
n +1
2 2 − (n + 2) if n even,
ropt = 2 + 2
n−
− (n + 2) = n−1 (22)
3·2 2 − (n + 2) if n odd,
which amounts to taking B = B0 . The corresponding optimal value for t is then
defined as:
2n
topt = −1 (23)
ropt + n + 1
which gives topt ≈ 2 2 −1 for n even, and topt ≈
n n+1
1
3 2 2 for n odd.
n 4 5 6 7 8 9 10
Optimal parameters
(r, t) (2,2) (5,2) (8,4) (15,5) (22,8) (37,10) (52,16)
|B| 7 11 15 23 31 47 63
Cn 4 7 12 20 30 46 68
Achievable parameters
(r, t) (2,3) (5,3) (9,5) (16,6) (25,9) (41,11) (59,17)
|B| 7 11 16 24 34 51 70
Cn 5 8 14 22 34 52 78
In Table 3, we give the optimal values for (r, t) as well as the corresponding
size of the basis B and multiplication complexity Cn for n ∈ [[4, 10]]. We also give
the parameter values that we could actually achieve in practice to get a full-
rank system. We observe a small gap between the optimal and the achievable
parameters, which results from the heuristic nature of the method (since we
cannot prove that the constructed matrix A is full-rank).
for 1 ≤ i ≤ m. Here the gj ’s are randomly sampled from B until obtaining a
full-rank system, which is then used to decompose every coordinate function fi .
The total number of multiplications is Cn,m = r + m · t. Then, (18) gives:
2n
Cn,m ≥ r + m −1 . (25)
n+1+r
On the Multiplicative Complexity 467
√
It can be checked that the value of r minimizing the above bound is m2n −n−1.
We hence define √
ropt = m2n − n − 1, (26)
which minimizes (25) for every n ∈ [[2, 10]] and every m ∈ [[1, n]]. Moreover, this
value satisfies the constraint (18) i.e. ropt ≥ 2 + 2n− − (n + 2) for every m ≥ 4,
and in practice we shall only consider s-boxes with m ≥ 4. The corresponding
optimal value topt is then defined w.r.t. ropt as in (23), which satisfies
2 n2
topt = √ −1 (27)
m
for every n ∈ [[2, 10]] and every m ∈ [[1, n]]. We hence get
√ n
Cn,m ≥ ropt + m · topt ≈ m 2 2 +1 − (n + m + 1). (28)
In Table 4, we give the optimal values for the parameters (r, t) as well as the
corresponding size of the basis B and multiplication complexity Cn,n for n × n s-
boxes with n ∈ [[4, 10]]. We also give the parameter values that we could actually
achieve in practice to get a full-rank system.
n 4 5 6 7 8 9 10
Optimal parameters
(r, t) (3,1) (7,2) (13,3) (22,4) (36,5) (58,7) (90,10)
|B| 8 13 20 30 45 68 101
Cn,n 7 17 31 50 76 121 190
Achievable parameters
(r, t) (4,1) (7,2) (13,3) (22,4) (37,5) (59,7) (90,10)
|B| 9 13 20 30 46 69 101
Cn,n 8 17 31 50 77 122 190
4.3 Improvements
We present hereafter some improvements of the above method which can be
applied to get a decomposition with better multiplicative complexity for a given
468 D. Goudarzi and M. Rivain
s-box. In comparison to the above results, the obtained system and the associated
multiplicative complexity depend on the target s-box and are not applicable to all
s-boxes.
Basis Update. Our first improvement of the above method is based on a dynamic
update of the basis, each time a coordinate function fi (x) is computed.2 Indeed,
the term gj (x) · hi,j (x) involved in the computation of fi (x) can be reused in the
computation of the following fi+1 (x), . . . , fn (x). In our decomposition process,
this means that the gj · hi,j functions can be added to the basis for the decompo-
sition of the next coordinate functions fi+1 , . . . , fn . Basically, we start with some
basis B1 ⊇ B0 , where B0 is the minimal basis as defined in (19). Then, for every
i ≥ 1, we look for a decomposition
i −1
t
fi (x) = gi,j (x) · hi,j (x) + hi,ti (x), (29)
j=0
where ti ∈ N and gi,j , hi,j ∈ Bi . Once such a decomposition has been found,
we carry on with the new basis Bi+1 defined as:
i −1
Bi+1 = Bi ∪ {gi,j · hi,j }tj=0 . (30)
Compared to the former approach, we use different functions gi,j and we get
a different matrix A for every coordinate function fi . On the other hand, for
each decomposition, the basis grows and hence the number ti of multiplicative
terms in the decomposition of fi might decrease. In this context, we obtain a
new condition for every i that is:
2n
ti ≥ − 1. (31)
|Bi |
The lower bound on ti hence decreases as Bi grows. The total multiplicative
complexity of the method is then of:
m
∗
Cn,m =r+ ti , (32)
i=1
for every i ≥ 1, where Id denote the identity function. The obtained optimal
complexity is then:
m
∗
Cn,m = s1 − (n + 1) + ψn ◦ (ψn + Id)(i−1) (s1 )
i=1
∗
n |B1 | r t1 , t2 , . . . , tn Cn,n
4 7 2 2,1,1,1 7
5 11 5 2,2,2,1,1 13
12 6 2,2,1,1,1 13
6 15 8 4,3,2,2,2,2 23
16 9 3,3,2,2,2,2 23
7 23 15 5,4,3,3,3,3,2 38
8 31 22 8,6,5,5,4,4,4,3 61
32 23 7,6,5,5,4,4,4,3 61
33 24 7,6,5,5,4,4,3,3 61
34 25 7,6,5,4,4,4,3,3 61
9 47 37 10,8,7,7,6,6,5,5,5 96
48 38 10,8,7,7,6,5,5,5,5 96
49 39 10,8,7,6,6,5,5,5,5 96
10 63 52 16,12,11,10,9,8,7,7,7,6 145
64 53 15,12,11,10,9,8,7,7,7,6 145
65 54 15,12,11,9,9,8,7,7,7,6 145
Rank Drop. Our second improvement is based on the observation that even
if the matrix A is not full-rank, the obtained system can still have a solution
for some given s-box. Specifically, if A is of rank 2n − δ then we should get a
solution for one s-box out of 2δ in average. Hence, instead of having ti satisfying
the condition (ti + 1)|B| ≥ 2n , we allow a rank drop in the system of equa-
tions, by taking ti ≥ 2|B−δ
n
i|
− 1 for some integer δ for which solving 2δ systems
is affordable. We hence hope to get smaller values of ti by trying 2δ systems.
Note that heuristically, we can only hope to achieve the above bound if δ is
n
(a few times) lower than the maximal rank 2n (e.g. δ ≤ 24 ). We can then define
the (theoretical) optimal sequence (si , ti ) and the corresponding multiplicative
∗
complexity Cn,m from s1 = |B1 | as in (33) and (35) by replacing the function ψn
for ψn,δ : x → 2 x−δ − 1. As an illustration, Table 6 provides the obtained para-
n
∗
n δ |B1 | r t1 , t2 , . . . , tn Cn,n
4 4 7 2 1,1,1,1 6
5 8 11 5 2,1,1,1,1 11
8 12 6 1,1,1,1,1 11
6 16 15 8 3,2,2,2,1,1 19
16 16 9 2,2,2,2,1,1 19
7 32 23 15 4,3,3,2,2,2,2 33
32 24 16 3,3,3,2,2,2,2 33
8 32 31 22 7,5,5,4,4,3,3,3 56
32 32 23 6,5,5,4,4,3,3,3 56
9 32 47 37 10,8,7,6,6,5,5,5,4 93
32 48 38 9,8,7,6,6,5,5,5,4 93
10 32 63 52 15,12,11,9,9,8,7,7,7,6 143
32 64 53 15,12,10,9,9,8,7,7,7,6 143
32 65 54 15,12,10,9,8,8,7,7,7,6 143
32 66 55 15,12,10,9,8,8,7,7,6,6 143
32 67 56 14,12,10,9,8,8,7,7,6,6 143
constant in the o(·) is the average incrementation of ti (i.e. the average number
of times Step 13 is executed per i). In our experiments, we observed that the
optimal value of t1 = ψn,δ (s1 ) is rarely enough to get a solvable system for f1 .
This is because we start with the minimal basis as in the single-Boolean-function
case. We hence have a few incrementations for i = 1. On the other hand, the
next optimal ti ’s are often enough or incremented a single time.
We used Algorithm 1 to get the decomposition of various n × n s-boxes for
n ∈ [[4, 8]], namely the eight 4 × 4 s-boxes of Serpent [ABK98], the s-boxes S5
(5 × 5) and S6 (6 × 6) of SC2000 [SYY+02], the 8 × 8 s-boxes S0 and S1 of
CLEFIA [SSA+07], and the 8 × 8 s-box of Khazad [BR00]. The obtained results
are summarized in Table 7. Note that we chose these s-boxes to serve as examples
for our decomposition method. Some of them may have a mathematical structure
allowing more efficient decomposition (e.g. the CLEFIA S0 s-box is based on
the inversion over F256 and can therefore be computed with a 32-multiplication
circuit as the AES).
We observe that Algorithm 1 achieves improved parameters compared to the
optimal ones with basis update and without the rank-drop improvement (see
Table 5) for n ∈ {4, 5, 6}. For n = 8, we only get parameters close to the optimal
∗
ones for the basis update (Cn,n = 62 instead of 61). This can be explained by
the fact that when n increases the value of δ becomes small compared to 2n and
the impact of exhaustive search is lowered. Thus Algorithm 1 can close the gap
and (almost) achieve optimal parameters even in presence of a minimal starting
basis, however it does not go beyond.
4.4 Parallelization
The proposed decomposition method is highly parallelizable. In practice, most
SPN blockciphers have a nonlinear layer applying 16 or 32 s-boxes and most
processors are based on a 32-bit or a 64-bit architecture. Therefore we shall
472 D. Goudarzi and M. Rivain
∗
|B1 | r t1 , t2 , . . . , tn Cn,n
n=4
Serpent S1 –S5 7 2 1, 1, 1, 1 6
Serpent S6 , S7 7 2 1, 2, 1, 1 7
n=5
SC2000 S5 11 5 2, 1, 1, 1, 1 11
12 6 1, 1, 1, 1, 1 11
n=6
SC2000 S6 15 8 4, 2, 2, 2, 2, 1 21
16 9 3, 2, 2, 2, 2, 1 21
n=8
Khazad & CLEFIA (S0 , S1 ) 31 22 11, 6, 5, 4, 4, 4, 3, 3 62
33 24 9, 6, 5, 4, 4, 4, 3, 3 62
32 23 10, 6, 5, 4, 4, 4, 3, 3 62
focus our study on the k-parallel multiplicative complexity of our method for
k ∈ {2, 4}.
General Method. In the general method (without improvement) described
in Sect. 4.2, the multiplications between the gj ’s and the hi,j ’s can clearly
be processed in parallel. Specifically, they can be done with exactly m·t k k-
multiplications. The multiplications involved in the minimal basis B0 = {x →
xu , u ∈ U} can also be fully parallelized at degree k = 2 and k = 4 for every
n ≥ 4. In other words, the k-multiplicative complexity for deriving B0 equals
rk0 for k ∈ {2, 4} where r0 = C(B0 ) = |B0 | − (n + 1) (see Sect. 4.1). One
just has to compute xu by increasing order of the Hamming weight of u ∈ U
(where U is the set defined in (19)), then taking the lexicographical order inside a
Hamming weight class. As an illustration, the 4-parallel evaluation of B0 is given
for n ∈ {4, 6, 8} in Table 8.
Once all the elements of B0 have been computed, and before getting to the mul-
tiplicative terms gj · hi,j , we have to update it to a basis B ⊇ B0 with target car-
dinality (see Table 4). This is done by feeding the basis with |B| − |B0 | products of
random linear combinations of the current basis. In order to parallelize this step,
these new products are generated 4-by-4 from previous elements of the basis. We
could validate that, by following such an approach, we still obtain full-rank sys-
tems with the achievable parameters given in Table 8. This means that for every
n ∈ [[4, 10]], the k-multiplicative complexity of the general method is kr + m·t
k .
The obtained results (achievable parameters) are summarized in Table 9.
Improved Method. The parallelization of the improved method is slightly
more tricky since all the multiplicative terms gi,j · hi,j cannot be computed in
parallel. Indeed, the resulting products are fed to the basis so that they are
On the Multiplicative Complexity 473
n=4 n=6
x1 x2 ← x2 · x1 x1 x2 ← x2 · x1 x4 x6 ← x6 · x4
x3 x4 ← x4 · x3 x1 x3 ← x3 · x1 x5 x6 ← x6 · x5
x2 x3 ← x3 · x2 x1 x2 x3 ← x3 · x1 x2
x4 x5 ← x5 · x4 x4 x5 x6 ← x6 · x4 x5
n=8
x1 x2 ← x2 · x1 x2 x4 ← x4 · x2 x5 x8 ← x8 · x5
x1 x3 ← x3 · x1 x3 x4 ← x4 · x3 x6 x7 ← x7 · x6
x1 x4 ← x4 · x1 x5 x6 ← x6 · x5 x6 x8 ← x8 · x6
x2 x3 ← x3 · x2 x5 x7 ← x7 · x5 x7 x8 ← x8 · x7
x1 x2 x3 ← x3 · x1 x2 x5 x6 x7 ← x7 · x5 x6 x1 x2 x3 x4 ← x4 · x1 x2 x2
x1 x2 x4 ← x4 · x1 x2 x5 x6 x8 ← x8 · x5 x6 x5 x6 x7 x8 ← x8 · x5 x6 x7
x1 x3 x4 ← x4 · x1 x3 x5 x7 x8 ← x8 · x5 x7
x2 x3 x4 ← x4 · x2 x3 x6 x7 x8 ← x8 · x6 x7
n 4 5 6 7 8 9 10
(r, t) (4,1) (7,2) (13,3) (22,4) (37,5) (59,7) (90,10)
|B| 9 13 20 30 46 69 101
Cn,n 8 17 31 50 77 122 190
(2)
Cn,n 4 9 16 25 39 62 95
(4)
Cn,n 2 5 9 13 20 31 48
the exact same parameters than in Table 7 for all the tested s-boxes (Serpent,
SC2000, CLEFIA, and Khazad) for a parallelization degree of k = 2, except for
the s-box S3 of Serpent that requires 1 more multiplication.
5 Implementations
This section describes our implementations of a bitsliced s-box layer protected
with higher-order masking based on our decomposition method. Our implemen-
tations evaluate 16 n × n s-boxes in parallel where n ∈ {4, 8}, and they are
developed in generic 32-bit ARM assembly. They take n input sharings [x 1 ],
[x 2 ], . . . , [x n ] defined as
d
[x i ] = (x i,1 , x i,2 , . . . , x i,d ) such that x i,j = x i (36)
j=1
where x i is a 16-bit register containing the i-th bit of the 16 s-box inputs. Our
implementations then output n sharings [y 0 ], [y 1 ], . . . , [y n ] corresponding to the
bitsliced output bits of the s-box. Since we are on a 32-bit architecture with 16-
bit bitsliced registers, we use a degree-2 parallelization for the multiplications.
Namely, the 16-bit ANDs are packed by pairs and replaced by 32-bit ANDs
which are applied on shares using the ISW scheme as explained in [GR16].
The computation is then done in three stages. First, we need to construct
the shares of the elements of the minimal basis B0 , specifically [x u ] for every
u ∈ U, where x u denote the bitsliced register for the bit xu , and where U is
the set defined in (19). This first stage requires r0 /2 32-bit ISW-ANDs, where
r0 = 2 for n = 4 and r0 = 22 for n = 8 (see Table 8).
Once the first stage is completed, all the remaining multiplications are done
between linear combinations of the elements of the basis. Let us denote by [t i ] the
sharings corresponding to the elements of the basis which are stored in memory.
After the first stage we have {[t i ]} = {[x u ] | u ∈ U}. Each new t i is defined as
ai,j t j bi,j t j (37)
j<i i<j
where denote the bitwise multiplication, and where {ai,j }j and {bi,j }j are the
binary coefficients obtained from the s-box decomposition (namely the coeffi-
cients of the functions gi,j and hi,j in the span of the basis). The second stage
hence consists in a loop on the remaining multiplications that
computes the linear-combination sharings [r i ] =
1. j<i ai,j [t j ] and [s i ] =
j<i bi,j [t j ]
2. refreshes the sharing [r i ]
3. computes the sharing [t i ] such that t i = r i s i
where the last step is performed for two successive values of i at the same time
by a call to a 32-bit ISW-AND. The sums in Step 1 are performed on each share
On the Multiplicative Complexity 475
We observe that our implementations are asymptotically faster than the opti-
mized implementations of CRV and AD methods (3.6 times faster for n = 8 and
5.7 times faster for n = 4). However, we also see that the linear coefficient is
significantly greater for our implementations, which comes from the computa-
tion of the linear combinations in input of the ISW-ANDs (i.e. the sharings
[r i ] and [s i ]). As an illustration, Figs. 1 and 2 plots the obtained timings with
respect to d. We see that for n = 4, our implementation is always faster than
the optimized AD and CRV. On the other hand, for n = 8, our implementation
is slightly slower for d ≤ 8. We stress that our implementations could probably
be improved by optimizing the computation of the linear combinations.
The RAM consumption and code size of our implementations are given in
Table 11 and compared to those of the CRV and AD implementations from
[GR16]. We believe these memory requirements to be affordable for not-too-
constrained embedded devices. In terms of code size, our implementations are
always the best. This is especially significant for n = 8 where CRV and AD needs
476 D. Goudarzi and M. Rivain
·106 ·105
clock cycles
0.6
1
0.4
0.5
0.2
5 10 15 20 5 10 15 20
d d
a high amount of storage for the lookup tables of the linearized polynomials
(see [GR16]). On the other hand, we observe a big gap between our implemen-
tations and those from [GR16] regarding the RAM consumption. Our method
is indeed more consuming in RAM because of all the [t i ] sharings that must be
stored while such a large basis is not required for the CRV and AD methods,
and because of some optimizations in the computation of the linear combinations
(see the full version).
References
[ABK98] Anderson, R., Biham, E., Knudsen, L.: Serpent: a proposal for the
advanced encryption standard. NIST AES Propos. (1998)
[BGRV15] Balasch, J., Gierlichs, B., Reparaz, O., Verbauwhede, I.: DPA, bitslicing
and masking at 1 GHz. In: Güneysu, T., Handschuh, H. (eds.) CHES 2015.
LNCS, vol. 9293, pp. 599–619. Springer, Heidelberg (2015)
[BMP13] Boyar, J., Matthews, P., Peralta, R.: Logic minimization techniques with
applications to cryptology. J. Cryptol. 26(2), 280–312 (2013)
[BPP00] Boyar, J., Peralta, R., Pochuev, D.: On the multiplicative complexity of
Boolean functions over the basis (∧, ⊕, 1). Theor. Comput. Sci. 235(1),
43–57 (2000)
On the Multiplicative Complexity 477
[BR00] Barreto, P., Rijmen, V.: The Khazad legacy-level block cipher. In: First
Open NESSIE Workshop (2000)
[Can05] Canright, D.: A very compact S-box for AES. In: Rao, J.R., Sunar, B.
(eds.) CHES 2005. LNCS, vol. 3659, pp. 441–455. Springer, Heidelberg
(2005)
[CGP+12] Carlet, C., Goubin, L., Prouff, E., Quisquater, M., Rivain, M.: Higher-
order masking schemes for S-boxes. In: Canteaut, A. (ed.) FSE 2012.
LNCS, vol. 7549, pp. 366–384. Springer, Heidelberg (2012)
[CJRR99] Chari, S., Jutla, C.S., Rao, J.R., Rohatgi, P.: Towards sound approaches
to counteract power-analysis attacks. In: Wiener, M. (ed.) CRYPTO 1999.
LNCS, vol. 1666, pp. 398–412. Springer, Heidelberg (1999)
[CMH13] Courtois, N., Mourouzis, T., Hulme, D.: Exact logic minimization and
multiplicative complexity of concrete algebraic and cryptographic circuits.
Adv. Intell. Syst. 6(3–4), 43–57 (2013)
[Cou07] Courtois, N.T.: CTC2 and fast algebraic attacks on block ciphers revisited.
Cryptology ePrint Archive, Report 2007/152 (2007). http://eprint.iacr.
org/2007/152
[CPRR14] Coron, J.-S., Prouff, E., Rivain, M., Roche, T.: Higher-order side channel
security and mask refreshing. In: Moriai, S. (ed.) FSE 2013. LNCS, vol.
8424, pp. 410–424. Springer, Heidelberg (2014)
[CPRR15] Carlet, C., Prouff, E., Rivain, M., Roche, T.: Algebraic decomposition for
probing security. In: Gennaro, R., Robshaw, M.J.B. (eds.) CRYPTO 2015.
LNCS, vol. 9215, pp. 742–763. Springer, Heidelberg (2015)
[CRV14] Coron, J.-S., Roy, A., Vivek, S.: Fast evaluation of polynomials over
binary finite fields and application to side-channel countermeasures. In:
Batina, L., Robshaw, M. (eds.) CHES 2014. LNCS, vol. 8731, pp. 170–
187. Springer, Heidelberg (2014)
[DDF14] Duc, A., Dziembowski, S., Faust, S.: Unifying leakage models: from prob-
ing attacks to noisy leakage. In: Nguyen, P.Q., Oswald, E. (eds.) EURO-
CRYPT 2014. LNCS, vol. 8441, pp. 423–440. Springer, Heidelberg (2014)
[DPV01] Daemen, J., Peeters, M., Van Assche, G.: Bitslice ciphers and power analy-
sis attacks. In: Schneier, B. (ed.) FSE 2000. LNCS, vol. 1978, pp. 134–149.
Springer, Heidelberg (2001)
[GLSV15] Grosso, V., Leurent, G., Standaert, F.-X., Varıcı, K.: LS-designs: bitslice
encryption for efficient masked software implementations. In: Cid, C.,
Rechberger, C. (eds.) FSE 2014. LNCS, vol. 8540, pp. 18–37. Springer,
Heidelberg (2015)
[GR16] Goudarzi, D., Rivain, M.: How fast can higher-order masking be in soft-
ware? Cryptology ePrint Archive (2016). http://eprint.iacr.org/
[ISW03] Ishai, Y., Sahai, A., Wagner, D.: Private circuits: securing hardware
against probing attacks. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol.
2729, pp. 463–481. Springer, Heidelberg (2003)
[MS92] Mirwald, R., Schnorr, C.P.: The multiplicative complexity of quadratic
Boolean forms. Theor. Comput. Sci. 102(2), 307–328 (1992)
[PLW10] Poschmann, A., Ling, S., Wang, H.: 256 bit standardized crypto for 650
GE – GOST revisited. In: Mangard, S., Standaert, F.-X. (eds.) CHES
2010. LNCS, vol. 6225, pp. 219–233. Springer, Heidelberg (2010)
[PR13] Prouff, E., Rivain, M.: Masking against side-channel attacks: a formal
security proof. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013.
LNCS, vol. 7881, pp. 142–159. Springer, Heidelberg (2013)
478 D. Goudarzi and M. Rivain
[RP10] Rivain, M., Prouff, E.: Provably secure higher-order masking of AES. In:
Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol. 6225, pp.
413–427. Springer, Heidelberg (2010)
[SSA+07] Shirai, T., Shibutani, K., Akishita, T., Moriai, S., Iwata, T.: The 128-
bit blockcipher CLEFIA (extended abstract). In: Biryukov, A. (ed.) FSE
2007. LNCS, vol. 4593, pp. 181–195. Springer, Heidelberg (2007)
[Sto16] Stoffelen, K.: Optimizing S-box implementations for several criteria using
sat solvers. In: Fast Software Encryption (2016)
[SYY+02] Shimoyama, T., Yanami, H., Yokoyama, K., Takenaka, M., Itoh, K.,
Yajima, J., Torii, N., Tanaka, H.: The block cipher SC2000. In: Matsui,
M. (ed.) FSE 2001. LNCS, vol. 2355, pp. 312–327. Springer, Heidelberg
(2002)
[TP14] Turan Sönmez, M., Peralta, R.: The multiplicative complexity of Boolean
functions on four and five variables. In: Eisenbarth, T., Öztürk, E. (eds.)
LightSec 2014. LNCS, vol. 8898, pp. 21–33. Springer, Heidelberg (2015)
Reducing the Number of Non-linear
Multiplications in Masking Schemes
1 Introduction
Side-channel attacks are a realistic and serious threat for cryptographic imple-
mentations [Koc96,KJJ99]. These attacks have the potential to leak one or more
sensitive intermediate variables that would otherwise be unavailable in a black-
box execution of a cryptographic primitive. Block ciphers are typical targets of
such attacks. Secret sharing, a.k.a. masking, is a popular technique to protect
block cipher implementations against leakage of one or more sensitive intermedi-
ate variables. Depending on how a sensitive variable is split into shares, processed
and then re-combined, and the formal leakage model used for security analysis,
there are several generic higher-order masking schemes that can secure block
cipher computations, with the secrets shared into as many shares as we desire
[ISW03,GM11,PR11,CGP+12,BFGV12,Cor14,BFG15]. Indeed, these schemes
can be used to secure any circuit.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 479–497, 2016.
DOI: 10.1007/978-3-662-53140-2 23
480 J. Pulkus and S. Vivek
The most popular among existing generic masking schemes for block cipher
implementations are those where the secrets are additively shared. This is
in part due to the effectiveness, efficiency and simplicity of additive masking
[CJRR99,ISW03,PR13,DDF14,DFS15,BFG15]. Over binary fields this type of
masking has also been called as Boolean masking. In fact, the very first generic
higher-order masking scheme, due to Ishai, Sahai and Wagner [ISW03] (hence-
forth referred to as the ISW method ), is based on additive masking. Their method
can be used to secure arbitrary Boolean circuits in the so-called probing model,
where an adversary can choose to leak, say, t intermediate variables and the
scheme is secure so long as the number of shares s ≥ 2t + 1. Though work-
ing with Boolean circuits is probably well-suited to hardware implementations,
representing a computation as a Boolean circuit will lead to huge overheads in
software implementations. Nonetheless, this method and the probing security
framework introduced in their work formed the basis for most of the later mask-
ing schemes. Rivain and Prouff [RP10] adapted the ISW method to secure AES
by representing its S-box as an arithmetic circuit over F28 .
CGPQR Method. Carlet et al. [CGP+12] adapted the ISW method to secure
software implementations of arbitrary block ciphers over binary finite fields F2n
(hereafter referred to as the CGPQR method ). For an additive masking scheme
processing F2 -linear or affine functions in the presence of shares is straightfor-
ward. Hence the main challenge is to securely process non-linear functions. Since
in a block cipher the only non-linear operations are the S-box table lookups, the
technique used in the CGPQR method to securely mask such table lookups is
to first represent a d-to-r-bit S-box function (d ≥ r) as a univariate polynomial
over a binary finite field F2d . Then this polynomial is evaluated in the presence
of shares using the following operations: addition (of two polynomials over F2d ),
scalar multiplication (i.e., multiplication of a polynomial by a constant from
F2d ), squaring (of a polynomial over F2d ), and multiplications of two distinct
polynomials (a.k.a. non-linear multiplications). While additions, scalar multipli-
cations and squarings are F2 -linear operations, the non-linear multiplications, as
the name suggests, are not F2 -linear. To process a non-linear multiplication in
the CGPQR method, an adaptation of the technique used in the ISW method to
mask (non-linear) AND gates is utilised. The overhead caused by the CGPQR
method (relative to unshared evaluation), in terms of both the time and the ran-
domness required, to securely mask a non-linear multiplication is O(s2 ), where
s is the number of input shares. For a linear or affine function the overhead is
only O(s).
Relation to Polynomial Evaluation. One of the relatively well-understood
approaches to analysing and improving the efficiency of the CGPQR method is
to investigate the problem of evaluating polynomials over binary finite fields. The
goal is to minimise the number of non-linear multiplications needed to evaluate a
polynomial over F2d , while ignoring the cost of additions, scalar multiplications
and squarings. As the works of Carlet et al. [CGP+12], Roy and Vivek [RV13],
and Coron, Roy and Vivek [CRV14,CRV15] demonstrate, this cost model of
minimising the non-linear multiplications while evaluating an S-box polynomial
Reducing the Number of Non-linear Multiplications in Masking Schemes 481
has turned out to be a reasonably effective way to model the overall cost of
processing a block cipher in software implementations, as long as one makes
sure that the use of linear operations is not made “unreasonably” large.
In [CGP+12], two methods to evaluate arbitrary polynomials over F2d are
presented that are tailored to the non-linear cost model: the cyclotomic-class
method (having complexity Ω(2d /d)) and the parity-split method (having proven
complexity O(2d/2 ). These two methods were applied to various S-box polyno-
mials to understand their complexity in terms of non-linear multiplications. In
[RV13], improved evaluation techniques for various specific S-box polynomials
were presented. In particular, it was shown that the 6-to-4-bit DES S-boxes
can be evaluated with 7 non-linear multiplications, while 8-bit (i.e., 8-to-8-bit)
CAMELLIA and CLEFIA S-boxes can be evaluated with 15 or 16 non-linear
multiplications. The work of [RV13] also initiated a formal analysis of this cost
model and established lower bounds on the necessary number of non-linear mul-
tiplications required to evaluate any polynomial over F2d . In particular, they
showed that, under certain representation over F26 , the DES S-box polynomials
need at least 3 non-linear multiplications to evaluate them, while the PRESENT
S-box polynomial over F24 needs at least 2 non-linear multiplications.
CRV Method. In [CRV14,CRV15], Coron et al. proposed an improved method
(henceforth referred to as the CRV method ) to evaluate arbitrary polynomials√
over F2d . Their method has a heuristic worst-case complexity of O(2d/2√/ d)
non-linear multiplications. They also show that the complexity of O(2d/2 / d) is
optimal for any method to evaluate arbitrary polynomials over F2d . Currently,
w.r.t. the non-linear multiplications cost model, the CRV method is the most
efficient way to implement the CGPQR countermeasure.
In the CRV method, a d-to-r-bit S-box S is represented by a polynomial
P (X) ∈ F2d [X] that is actually computed in the process. The d-bit and the r-bit
strings are identified with the elements of F2d . The polynomial P (X) satisfies
the property that its evaluation on the elements of F2d produces output elements
of F2d that agree in the lower-order r-bits with the corresponding S-box outputs.
Briefly the CRV method for a generic d-to-r-bit S-box is as follows:
Step 1: Pre-compute a collection of monomials L in F2d [X] (a) that is closed
w.r.t. squaring (because squarings are free) (b) has the property that L · L gen-
erates all the monomials X i (i = 0, 1, . . . , 2d − 1).
t−1
d
P (X) = pi (X) · qi (X) + pt (X) mod X 2 + X (1)
i=1
for some chosen parameter t, where the polynomials pi (X) and qi (X) have mono-
mials only from the set L, and the polynomials qi (X) are randomly chosen but
the values of P (X) and the coefficients of pi (X) are unknown. Next they write
down a set of r · 2d linear equations over F2 (in the unknown bits), correspond-
ing to each S-box output bit, by evaluating the above relation at the elements
482 J. Pulkus and S. Vivek
of F2d . Finally, the unknown bits are obtained by solving the resulting linear
system over F2 whose matrix has dimension r · 2d × d · t · |L|, which is approxi-
mately r · 2d × r · 2d . The total number of non-linear multiplications required is
about t − 1 + |L|/d.
It is shown in [CRV15] that any 4-bit S-box can be evaluated with 2 non-
linear multiplications in the worst case (which is optimal), any 6-bit S-box with
at most 5, any 6-to-4-bit S-box (in particular, DES S-boxes) with at most 4,
any 8-bit S-box with at most 10 non-linear multiplications (cf. Table 1). As, in
a block cipher, the time required for S-box table lookups grows quadratically
with the number of shares, seemingly marginal reductions in the count of non-
linear multiplications per S-box evaluation indeed lead to significant gains in the
overall execution time, as demonstrated in [Cor14,CRV15].
One obvious approach to improve the CRV method is to simultaneously solve
for the unknown coefficients of both the set of polynomials pi (X) and qi (X)
(including P (X)) in Step 2 of the CRV method described above, instead of
linearising (1) by choosing random polynomials qi (X). This results in r · 2d
multivariate homogeneous quadratic equations over F2 in approximately d · 2d
variables. To our knowledge, determining the roots of such a system of equations
seems infeasible with current techniques even for small values of d = 6 or d = 8.
Hence it is of interest to find alternative ways to reduce the parameters of the
CRV method (particularly, the parameters t and L) that affect the total number
of non-linear multiplications for the S-box polynomials. This is one of the main
themes of this paper.
From a technical point of view, apart from the problem of encoding mentioned
above, the main and the only other difference between our method and the CRV
method is in the selection of the following two parameters: L (the pre-computed
monomial list) and t (the number of summands in the decomposition in (1)).
Once these parameters are carefully determined, then the remaining steps to
obtain a decomposition of the form (1) by setting up a linear system of equations
is exactly the same. Since in the matrix step of the CRV method (cf. page 3)
we heuristically need n · t · |L| ≈ r · 2d , it is evident that we could end up
with smaller values of t, and hence a reduction in the total number of non-
linear multiplications required. Some technical hurdles arise due to the fact that
we would not gain anything if we insist, as in the CRV method, that the pre-
computed set of monomials L must span all monomials in F2n [X]. Our generic
method and its analysis is presented in Sect. 2.
Our method leads to improvements for most of the S-boxes found in practice.
Table 1 lists the (worst-case) cost of processing arbitrary d-to-r-bit S-boxes using
our method over F28 and F216 , and compares these with those of the previous
methods. In particular, any 6-to-4 bit S-box, including all the DES S-boxes, now
need at most 3 non-linear multiplications to evaluate them instead of the previous
best of 4 non-linear multiplications required by the CRV method that works over
F26 in this case (cf. Table 2). We discuss how to select suitable parameters for
various S-box dimensions in Sect. 2.2.
d 4 5 6 7 8
r 4 5 4 6 7 8
n 4 8 16 5 8 16 4 8 16 6 8 16 7 8 16 8 16
Estimated Md,r,n 3 0 0 4 2 0 4 2 1 5 3 1 7 6 3 10 5
Observed Md,r,n 2 2 2 4 3 3 4 3 3 5 4 3 7 6 4 10 6
results in [RV13,CRV15], while relatively little is known about the other cost
models. Also, this cost model and its variant where the circuit depth w.r.t. non-
linear multiplications also matters have found applications in fully homomorphic
encryption and multi-party computation settings [GHS12a,GHS12b,ARS+15].
We do not consider such applications in this work, and hence, prefer to work in
the non-linear multiplications cost model.
The functions Er,n : {0, 1}r → F2n and Dn,r : F2n → {0, 1}r are similarly defined,
as are En,n : {0, 1}n → F2n and Dn,n : F2n → {0, 1}n .
Remark 1. The composition map Dd,d ◦ Ed,n : F2d → F2n is a group homo-
morphism w.r.t. addition. But, in general, this map is not homomorphic w.r.t.
multiplication.
We say that a polynomial P (X) ∈ F2n [X] evaluates a d-to-r bit S-box S if
the trailing r bits of its evaluation on the encodings of every d-bit string matches
with the output of S. Formally,
S(i) = Dn,r (P (Ed,n (i))) , ∀i ∈ {0, 1}d . (2)
Our goal is to find a polynomial representation for a given S-box whose evaluation
requires as small a number of non-linear multiplications as possible.
Let Cαn denote the cyclotomic class of α w.r.t n (n ≥ 1, 0 ≤ α < 2n )
[CGP+12,RV13], that is, C0n = {0}, C2nn −1 = {2n − 1} and
Cαn := α · 2i (mod 2n − 1) : i = 0, 1, . . . , n − 1 for 0 < α < 2n − 1.
For any subset Λ ⊆ {0, 1, . . . , 2n
− 1}, let X Λ denotethe set X Λ := X i : i ∈ Λ
⊆ F2n [X]. Define X Λ · X Λ := X i · X j : i, j ∈ Λ . Finally, P(X Λ ) ⊆ F2n [X]
denotes the set of all polynomials (of degree at most 2n − 1) that have their
monomials only from X Λ .
486 J. Pulkus and S. Vivek
Let
L = ∪ Cαd i . (4)
d ∈T
Cα
i
Now “lift” the above collection of cyclotomic classes w.r.t. d to a collection w.r.t. n.
That is, for every Cαd i , we choose Cαni for some representative αi ∈ Cαd i . Define
T = Cαn1 =0 , Cαn2 =1 , Cαn3 , . . . , Cαn . (5)
Let
L= ∪ Cαni .
n ∈T
(6)
Cα
i
Note that we will be using only the collection T and the set L in the decompo-
sition step of our method (cf. (8)).
Heuristic 1. We assume that it is possible to choose a T as specified above (for
any “sufficiently smaller” than 2d ) in such a way that:
1. each cyclotomic class (except C0n ) in T has (maximal) length n,
2. X L can be computed using only − 2 non-linear multiplications,
2d −1
3. X {0,1,2,...,X }
⊆ X L · X L ⊆ F2d [X]. We refer to this property by saying
d
that X L spans the set {1, X, X 2 , . . . , X 2 −1 } in F2d [X].
The first two heuristics above are also used in the CRV method. The differ-
ence is in the third heuristic (Heuristic 1.3). Note that the condition is only on
the set L , not L. Note that in the CRV method it is required that X L spans
n
{1, X, X 2 , . . . , X 2 −1 } in F2n [X] (in their case n = d). But as we prescribe only
the values on F2d , not on all of F2n we do not need such a strong condition.
Indeed if we use this (stronger) condition from the CRV method, then we can-
not expect any improvement over the CRV method (it will actually be worse
since we are working in a bigger field).
n
−1
Remark 2. In general, X L does not span {1, X, X 2 , . . . , X 2 } nor {1, X,
d
X 2 , . . . , X 2 −1 } in F2n [X].
So we will make another assumption that turns out to be true experimentally
for instances of practical relevance.
Heuristic 2. Corresponding to any d-to-r-bit S-box S, there exists a polynomial
in P(X L · X L ) ⊆ F2n [X] that evaluates S.
The CRV method does not need to make the above assumption as the con-
dition is implied by Heuristic 1.3 when n = d.
Reducing the Number of Non-linear Multiplications in Masking Schemes 487
Remark 3. As noted in [RV13, Proof of Theorem 1], ifd|n, then the cyclotomic
classes Cun “lie above” Czd for every u ∈ Czd . That is, δ mod 2d − 1 ∈ Czd for
every δ ∈ Cvn and every v ∈ Czd .
Note that
|L| = 1 + n · ( − 1) . (7)
We would like to choose as small a value for as possible but still satisfying
Heuristic 1.3 (as we shall soon see, that must satisfy another (relatively milder)
condition in Heuristic 4). We use the following heuristic from the CRV method
for choosing a value of .
Heuristic 3. There exists a collection
of cyclotomic classes T (w.r.t. d) satisfying
2d
Heuristic 1.3 such that ≈ d .
Step 2. Then, as in the CRV method [CRV15, Sect. 4.3], we choose t−1 random
$
polynomials qi (X) ← P(X L ) ⊆ F2n [X], for some parameter t to be determined
later, that have their monomials only from X L . Then we try to find t polynomials
pi (X) ∈ P(X L ) such that
t−1
n
P (X) = pi (X) · qi (X) + pt (X) mod X 2 + X (8)
i=1
evaluates S.
Note that Heuristic 3 guarantees that the decomposition of (8) exists for
every d-to-r-bit S-box S for some t ≤ |L| · (|L| − 1). But we need to find as small
a value of t as possible for a chosen L.
The unknown coefficients of the polynomials pi (X) are obtained by evalu-
ating P (X) at Ed,n (j) ∀j ∈ {0, 1}d and then writing the resulting set of linear
equations over F2 instead of F2n . That is, we obtain a system of linear equations
over F2 with each equation corresponding to an output bit of S(j). Note that
the unknowns in these equations correspond to the (unknown) n “bits” of the
unknown coefficients (from F2n ) of pi (X). Denote the resulting system of linear
equations as
A · c = b, (9)
where the matrix A over F2 will have r · 2d rows and t · |L| · n columns, the F2 -
vector c corresponds to the unknown bits of the (to-be-determined) coefficients
of pi (X), and the F2 -vector b corresponds to the bits of the outputs of the S-box
S. We can solve the above linear equation for any b if A has rank r · 2d . We
make the following assumption, similar to the CRV method, that says that if
the number of columns exceed the number of rows, then the matrix A has full
rank r · 2d .
Heuristic 4. The condition t · |L| · n ≥ r · 2d suffices for A to have (full) rank
r · 2d .
Once the solution vector c is computed, then the unknown coefficients (from
F2n ) of the polynomials pi (X), and hence the polynomial P (X), are readily
obtained. This completes the description of our method.
488 J. Pulkus and S. Vivek
Remark 4. If the matrix A has full rank (r · 2d ) for a randomly chosen set of
polynomials qi (X) ∈ P(X L ), then this same set of polynomials will yield the
decomposition of (8) for any d-to-r-bit S-box.
Md,r,n = − 2 + t − 1 = + t − 3. (10)
r · 2d
t≥ .
|L| · n
r · 2d
Md,r,n ≥ − 3 + .
(1 + n · ( − 1)) · n
d
Since, from Heuristic 3, we can set ≈ 2d , we obtain from the above inequality
2d r · 2d
Md,r,n ≈ −3+
(11)
d 2d
n· 1+n· d −1
Hence in the limiting case the complexity of our method is half that of the CRV
method.
Numerical Experiments. In Table 2, we compare the estimate of (11) (on
rounding up to the successive integer) with the observed complexity for various
cases of practical interest. It turns out that the observed values are close to the
estimated values.
Remark 5. Experiments tend to indicate that the value of t cannot be made arbi-
trarily small with increasing values of n. The resulting ranks of the matrices seem
to saturate after a certain value of n. This, of course, has to do with the structure
of the pre-computed set X L . But the dependency is currently unclear, and hence
we are unable to give a lower bound on the value of t, unlike the case of .
Table 3. Choosing parameters l, t and L for evaluating d-to-r-bit S-boxes over F2n ,
where L is always the union of the first l elements of {C0n , C1n , C3n , C7n , C29
n n
, C87 }
d 4 5 6 7 8
r 4 5 4 6 7 8
n 4 8 5 8 4 8 6 8 16 7 8 16 8 16
l 3 3 4 4 4 4 4 4 4 5 5 5 6 6
t 2 2 3 2 3 2 4 3 2 5 4 2 7 3
|L| 9 17 16 25 19 25 19 25 49 29 33 65 41 81
where p1 , q1 , p2 ∈ P(X L ) ⊆ F28 [X], L = C08 ∪C18 ∪C38 ∪C78 , and the coefficients of
q1 are randomly chosen from F28 (cf. Table 3 and Algorithm 1). Table 4 describes
a polynomial q1 that will yield the above decomposition for each of the 8 S-boxes
of DES.
Table 4. A polynomial q1 that could be used in common for all the DES S-boxes. The
irreducible polynomial used to represent F28 is a8 + a4 + a3 + a + 1.
(a2 )·x224 +(a7 +a6 +a5 +a4 +a2 +1)·x193 +(a7 +a4 +a+1)·x192 +(a7 +
a5 + a3 + a2 + a + 1) · x131 + (a6 + a3 + 1) · x129 + (a7 + a5 + a) · x128 + (a6 +
a5 +a4 +a)·x112 +(a7 +a5 +a4 +a2 +a)·x96 +(a7 +a5 +a4 +a3 +a2 +1)·
x64 +(a5 +a4 )·x56 +(a7 +a6 +a3 +a2 +a)·x48 +(a6 +a3 +a2 +a)·x32 +
(a6 +a3 +1)·x28 +(a5 +a)·x24 +(a7 +a5 +a4 +a3 +a+1)·x16 +(a7 +a6 +
a5 + a4 + a + 1) · x14 + (a5 + a4 + a3 ) · x12 + (a7 + a4 + a + 1) · x8 + (a3 + 1) ·
x7 +(a7 +a4 +a+1)·x6 +(a6 +a4 +a3 )·x4 +(a6 +a5 +a4 +a3 +a2 +a+
1)·x3 +(a7 +a3 +a)·x2 +(a6 +a4 +a3 +a2 +a)·x+(a7 +a3 +a2 +a+1)
the full security model of [ISW03], and n = 2t + 1 is the number of shares. The
RAM memory usage (in bytes) that is reported is only for the S-box computa-
tions and the total CPU time for a DES encryption is measured in milliseconds.
The penalty factor (PF) is the ratio of the total execution time for a given method
to that of an unprotected implementation. The total number of calls made to
the PRG that outputs random bytes is 1000 times the reported quantity.
p(X) := z∈Fd (X −z) without changing these. So we can work with polynomials
2
of degree < 2d instead of 2n as is the case when the table is defined on all of
F2n . However, in general, the polynomial p does not have a nice structure. But
d
if Fd2 = F2d is the unique subfield of order 2d of F2n , then p(X) = X 2 + X and
d
the equation x2 = x for all elements x ∈ F2d (≤ F2n ) implies
⎛ ⎞
f (x) = fi xi = f0 + ⎝ fi ⎠ xj .
0≤i≤deg f 0<j<2d i=j mod 2d −1
Working over a bigger field than in the original CRV method has two benefits.
The cyclotomic classes over F2n have sizes up to n, and hence more elements
than the possible d over Fd2 , so that one gathers more degrees of freedom per
non-linear multiplication in Step 1.1 Additionally some extra power is given by
being able to choose the coefficients of the polynomials in Step 2 from a bigger
field:
Lemma 1. Given 2k polynomials fi , gi ∈ F2n [X] (0 ≤ i < k) there exists an
extension field F2n of F2n and elements ai , bi ∈ F2n such that for every i the
function x → fi (x) · gi (x)
defined on F2 n is an F2 -linear image of the single
Since, for x ∈ F2n , we have also fi (x)gj (x) ∈ F2n , the claim is proved.
Remark 6. The technique in the proof of Lemma 1 can be used to evaluate the
non-linear part of the S-box of AES given by the monomial X 254 (over F28 )
with 3 non-linear multiplications over F216 . The first non-linear multiplication is
spent to get X 3 , the second to multiply X 2 + z · X 3 by (X 3 )4 , where z is any
element of F216 \ F28 . From the result X 14 + z · X 15 , one can F2 -linearly extract
the functions x → x14 and x → x15 defined over the subfield F28 , which enables
one to finally obtain X 254 = X 14 · (X 15 )16 .
1
Lemma 2 generalizes this statement about monomials to polynomials.
Reducing the Number of Non-linear Multiplications in Masking Schemes 493
Proof. F is the image of the F2n –linear map ϕ : g → g◦f from the set EndF2 (F2n )
of F2 -linear maps F2n → F2n to the set FZ 2n of functions Z → F2n . EndF2 (F2n ) =
F2n ⊗F2 F2 has dimension n over F2 (with ∗ denoting the dual F2 -vector space).
∗ n n
The kernel of ϕ is the F2n -subspace of F2 -linear maps whose restriction to the
image of f in F2n is 0. This is the tensor product with F2n (over F2 ) of the
annihilator (≤ F∗2n ) of the image {f (x) | x ∈ Z} of f in F2n , proving the claim.
Example 1. For monomials X α the dimension of the set F from Lemma 2 is the
cardinality of the cyclotomic class containing α. For example, in the field F64
the cyclotomic classes of 9 = 10012 and 21 = 101012 have order 3 resp. 2, so the
dimension of the corresponding F over F64 is 3 resp. 2. On the other hand, the
images under f (x) = x9 resp. g(x) = x21 of the multiplicative group F× ∼
64 = Z63
have order 7 resp. 3, and are therefore the multiplicative groups of the subfields
F8 resp. F4 . Their dimensions over F2 are 3 resp. 2 as claimed by the lemma.
A criterion for having enough degrees of freedom in Step 2 is given by:
n
Lemma 3. Let F be F2n -subspace of F2n [X]/(X 2 + X) that is closed under
taking squares. Then the F2n -subspace
F · F F2n generated by the products of
pairs of elements of F contains
F, is also closed under taking squares and has
dimension at most dim F + dim2 F .
Proof. As squaring is a field automorphism, only the statement about the dimen-
sions needs to be proved. But this follows from the commutativity of multipli-
cation as for any base (fi ) of F the set (fi · fj )i≤j generates
F · F F2n .
The remainder of this section is devoted to proving a lower worst-case bound
for the number of non-linear multiplications over F2n needed for functions from
Fd2 to Fr2 with d, r ≤ n but not necessarily d|n. The proof is an adaption of
[CRV14, Proposition 3] to our situation with minor improvements.
2
For a polynomial f = l fl · X l this is max {Hamming weight(l) | fl = 0}.
3
Corresponding to choosing L in Algorithm 1 as the union of cyclotomic classes that
have as many elements as possible to get as many degrees of freedom as possible for
the linear equation system being constructed.
494 J. Pulkus and S. Vivek
Proposition 1. For d, r ≤ n and fixed subspaces Fd2 , Fr2 ≤ F2n there is a func-
tion f : Fd2 → Fr2 that
√ cannot be represented by any polynomial in F2n [X] that
r(2d −1−d)+(d+ r−n )2 −(d+ r−n )
2 2
requires less than n non-linear multiplications for
evaluation.
n
In case of n = r = d this term simplifies to 2 n−1 − 1.
Proof. Without loss of generality, we may look only at functions that map 0 to
n−1
0: the only monomial not fixing zero is 1, and on F2n \ {0} the monomial X 2
is constant 1. This allows us to work with linear functions where the authors
of [CRV14] used affine functions instead. Starting with z0 = id |Fd2 one can get
all F2 -linear functions Fd2 → F2n without using any non-linear multiplication.
Having obtained z0 , . . . , zj using exactly j non-linear multiplications, one can
choose F2 -linear maps λ0,j , λ0,j : Fd2 → F2n and λ1,j , λ1,j , . . . , λj,j , λj,j : F2n →
F2n to get j j
zj+1 = λi,j ◦ zi · λi,j ◦ zi .
i=0 i=0
(Adding a constant to either factor changes zj+1 by a summand that can be
represented already by the zi with i ≤ j). With the help of z0 , . . . , zk we then
can evaluate
f= λ i ◦ zi
0≤i≤k
for F2 -linear maps λ0 : Fd2 → Fr2 and λ1 , . . . , λk : F2n → Fr2 without further non-
linear multiplication. Conversely, any f : Fd2 → Fr2 fixing 0 that can be evaluated
using at most k non-linear multiplications is of this form. k−1
In total we have to choose 2k F2 -linear maps from Fd2 to F2n , 2 i=0 i =
k(k − 1) from F2n to F2n , one from Fd2 to Fr2 and k from F2n to Fr2 giving us
2
((2n )d )2k · ((2n )n )k(k−1) · (2r )d · ((2r )n )k = 22ndk+n k(k−1)+rd+rnk choices. As
d d
there are (2r )2 −1 = 2r(2 −1) functions from Fd2 to Fr2 mapping 0 to 0, to get
enough functions we need
2ndk + n2 k(k − 1) + rd + rnk ≥ r(2d − 1).
This is via (nk)2 + (2d + r − n)nk ≥ r(2d − 1) − rd and (nk + (d + r−n 2
2 )) =
(nk) + (2d + r − n)nk + (d + 2 ) ≥ r(2 − 1 − d) + (d + 2 ) equivalent to
2 r−n 2 d r−n 2
r(2d − 1 − d) + (d + r−n
2 ) − (d + 2 )
2 r−n
k≥ .
n
Remark 7. As the images of the zj s in the proof of Proposition 1 can span
at most a (2d − 1)-dimensional F2 -subspace of F2n , Lemma 2 shows that for
n ≥ 2d − 1 the λi,j , λi,j and λi with i > 0 have to be defined only on these
(2d − 1)-dimensional subspaces reducing the degrees of freedom for obtaining
the next zj resp. f . With n := max{n, 2d − 1} the number of choices reduces
to 22ndk+nn k(k−1)+rd+rn k , but as one gets better lower bounds by using the
algebraic degree, we do not expand upon this.
Reducing the Number of Non-linear Multiplications in Masking Schemes 495
References
[ARS+15] Albrecht, M.R., Rechberger, C., Schneider, T., Tiessen, T., Zohner, M.:
Ciphers for MPC and FHE. In: Oswald, E., Fischlin, M. (eds.) EURO-
CRYPT 2015. LNCS, vol. 9056, pp. 430–454. Springer, Heidelberg (2015)
[Bar86] Barrett, P.: Implementing the Rivest Shamir and Adleman public key
encryption algorithm on a standard digital signal processor. In: Odlyzko,
A.M. (ed.) CRYPTO 1986. LNCS, vol. 263, pp. 311–323. Springer,
Heidelberg (1987)
[BFG15] Balasch, J., Faust, S., Gierlichs, B.: Inner product masking revisited. In:
Oswald, E., Fischlin, M. (eds.) EUROCRYPT 2015. LNCS, vol. 9056, pp.
486–510. Springer, Heidelberg (2015)
[BFGV12] Balasch, J., Faust, S., Gierlichs, B., Verbauwhede, I.: Theory and prac-
tice of a leakage resilient masking scheme. In: Wang, X., Sako, K. (eds.)
ASIACRYPT 2012. LNCS, vol. 7658, pp. 758–775. Springer, Heidelberg
(2012)
[CGP+12] Carlet, C., Goubin, L., Prouff, E., Quisquater, M., Rivain, M.: Higher-order
masking schemes for S-boxes. In: Canteaut, A. (ed.) FSE 2012. LNCS, vol.
7549, pp. 366–384. Springer, Heidelberg (2012)
[CJRR99] Chari, S., Jutla, C.S., Rao, J.R., Rohatgi, P.: Towards sound approaches
to counteract power-analysis attacks. In: Wiener, M. (ed.) CRYPTO 1999.
LNCS, vol. 1666, pp. 398–412. Springer, Heidelberg (1999)
[Cor13] Jean-Sébastien Coron (2013). https://github.com/coron/htable/
[Cor14] Coron, J.-S.: Higher order masking of look-up tables. In: Nguyen, P.Q.,
Oswald, E. (eds.) EUROCRYPT 2014. LNCS, vol. 8441, pp. 441–458.
Springer, Heidelberg (2014)
[CPRR13] Coron, J.-S., Prouff, E., Rivain, M., Roche, T.: Higher-order side channel
security and mask refreshing. In: Moriai, S. (ed.) FSE 2013. LNCS, vol.
8424, pp. 410–424. Springer, Heidelberg (2014)
[CPRR15] Carlet, C., Prouff, E., Rivain, M., Roche, T.: Algebraic decomposition for
probing security. In: Gennaro, R., Robshaw, M. (eds.) CRYPTO 2015.
LNCS, vol. 9215, pp. 742–763. Springer, Heidelberg (2015)
[CRV14] Coron, J.-S., Roy, A., Vivek, S.: Fast evaluation of polynomials over binary
finite fields and application to side-channel countermeasures. In: Batina, L.,
Robshaw, M. (eds.) CHES 2014. LNCS, vol. 8731, pp. 170–187. Springer,
Heidelberg (2014)
[CRV15] Coron, J.-S., Roy, A., Vivek, S.: Fast evaluation of polynomials over binary
finite fields and application to side-channel countermeasures. J. Crypto-
graphic Eng. 5(2), 73–83 (2015)
[DDF14] Duc, A., Dziembowski, S., Faust, S.: Unifying leakage models: from probing
attacks to noisy leakage. In: Nguyen, P.Q., Oswald, E. (eds.) EUROCRYPT
2014. LNCS, vol. 8441, pp. 423–440. Springer, Heidelberg (2014)
496 J. Pulkus and S. Vivek
[DFS15] Dziembowski, S., Faust, S., Skorski, M.: Noisy leakage revisited. In: Oswald,
E., Fischlin, M. (eds.) EUROCRYPT 2015. LNCS, vol. 9057, pp. 159–188.
Springer, Heidelberg (2015)
[GHS12a] Gentry, C., Halevi, S., Smart, N.P.: Homomorphic evaluation of the AES
circuit. IACR Cryptology ePrint Archive 2012:99 (2012)
[GHS12b] Gentry, C., Halevi, S., Smart, N.P.: Homomorphic evaluation of the AES
circuit. In: Safavi-Naini, R., Canetti, R. (eds.) CRYPTO 2012. LNCS, vol.
7417, pp. 850–867. Springer, Heidelberg (2012)
[GM11] Goubin, L., Martinelli, A.: Protecting AES with Shamir’s secret sharing
scheme. In: Takagi, T., Preneel, B. (eds.) CHES 2011. LNCS, vol. 6917,
pp. 79–94. Springer, Heidelberg (2011)
[GMPT15] Longo, J., De Mulder, E., Page, D., Tunstall, M.: SoC it to EM: electromag-
netic side-channel attacks on a complex system-on-chip. In: Güneysu, T.,
Handschuh, H. (eds.) CHES 2015. LNCS, vol. 9293, pp. 620–640. Springer,
Heidelberg (2015)
[GPS14] Grosso, V., Prouff, E., Standaert, F.-X.: Efficient masked S-boxes process-
ing – a step forward –. In: Pointcheval, D., Vergnaud, D. (eds.)
AFRICACRYPT. LNCS, vol. 8469, pp. 251–266. Springer, Heidelberg
(2014)
[ISW03] Ishai, Y., Sahai, A., Wagner, D.: Private circuits: securing hardware against
probing attacks. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol. 2729, pp.
463–481. Springer, Heidelberg (2003)
[KJJ99] Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M.
(ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg
(1999)
[Koc96] Kocher, P.C.: Timing attacks on implementations of Diffie-Hellman, RSA,
DSS, and other systems. In: Koblitz, N. (ed.) CRYPTO 1996. LNCS, vol.
1109, pp. 104–113. Springer, Heidelberg (1996)
[Lim13] ARM Limited. NEON Programmer’s Guide (2013)
[NO14] Nguyen, P.Q., Oswald, E.: EUROCRYPT 2014. LNCS, vol. 8441. Springer,
Heidelberg (2014)
[OF15] Oswald, E., Fischlin, M.: EUROCRYPT 2015. LNCS, vol. 9056. Springer,
Heidelberg (2015)
[oST93] National Institute of Standards and Technology. FIPS 46-3: Data Encryp-
tionStandard, March 1993. http://csrc.nist.gov
[PR11] Prouff, E., Roche, T.: Higher-order glitches free implementation of the AES
using secure multi-party computation protocols. In: Preneel, B., Takagi, T.
(eds.) CHES 2011. LNCS, vol. 6917, pp. 63–78. Springer, Heidelberg (2011)
[PR13] Prouff, E., Rivain, M.: Masking against side-channel attacks: a formal secu-
rity proof. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT 2013.
LNCS, vol. 7881, pp. 142–159. Springer, Heidelberg (2013)
[PT11] Preneel, B., Takagi, T.: CHES 2011. LNCS, vol. 6917. Springer, Heidelberg
(2011)
[RP10] Rivain, M., Prouff, E.: Provably secure higher-order masking of AES. In:
Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol. 6225, pp.
413–427. Springer, Heidelberg (2010)
[RV13] Roy, A., Vivek, S.: Analysis and improvement of the generic higher-order
masking scheme of FSE 2012. In: Bertoni, G., Coron, J.-S. (eds.) CHES
2013. LNCS, vol. 8086, pp. 417–434. Springer, Heidelberg (2013)
Reducing the Number of Non-linear Multiplications in Masking Schemes 497
[SP06] Schramm, K., Paar, C.: Higher order masking of the AES. In: Pointcheval,
D. (ed.) CT-RSA 2006. LNCS, vol. 3860, pp. 208–225. Springer, Heidelberg
(2006)
[Wie99] Wiener, M.J.: CRYPTO 1999. LNCS, vol. 1666. Springer, Heidelberg
(1999)
[WVGX15] Wang, J., Vadnala, P.K., Großschädl, J., Xu, Q.: Higher-order masking
in practice: a vector implementation of masked AES for ARM NEON. In:
Nyberg, K. (ed.) CT-RSA 2015. LNCS, vol. 9048, pp. 181–198. Springer,
Heidelberg (2015)
Faster Evaluation of SBoxes via Common Shares
1 Introduction
E. Prouff—Part of this work has been done at Safran Identity and Security, and
while the author was at ANSSI, France.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 498–514, 2016.
DOI: 10.1007/978-3-662-53140-2 24
Faster Evaluation of SBoxes via Common Shares 499
The ISW Probing Model. Ishai, Sahai and Wagner [ISW03] initiated the
theoretical study of securing circuits against an adversary who can probe a
fraction of its wires. They showed how to transform any circuit of size |C| into a
circuit of size O(|C| · t2 ) secure against any adversary who can probe at most t
wires. The ISW constructions consists in secret-sharing every variable x into x =
x1 ⊕ x2 ⊕ · · · ⊕ xn where x2 , . . . , xn are uniformly and independently distributed
bits, with n 2t + 1 to get security against t probes. Processing a XOR gate
is straightforward as the shares can be xored separately. The processing of an
AND gate z = xy is based on writing:
n n
z = xy = ⊕ xi · ⊕ yi = ⊕ xi yj (1)
i=1 i=1 1i,jn
where the cross-products xi yj are first computed and then randomly recombined
to get an n-sharing of the output z. This construction, called ISW gadget in the
rest of this paper, enables, in its general form, to securely evaluate a multiplica-
tion at the cost of n2 multiplications, 2n(n − 1) additions and n(n − 1)/2 random
values. Its complexity is therefore O(n2 ), which implies that the new circuit with
security against t probes has O(|C| · t2 ) gates.
A proof of security in the ISW framework is usually simulation based: one
must show that any set of t probes can be perfectly simulated without the knowl-
edge of the original variables of the circuit. In [ISW03] and subsequent work this
is done by progressively generating a subset I of input shares such that the
knowledge of those input shares is sufficient to simulate all the t probes. For
example, in the above AND gate, if the adversary would probe xi · yj , one would
put both indices i and j in I, so that the simulator would get the input shares
xi and yj , and therefore could simulate the product xi · yj . More generally in
the ISW construction every probe adds at most two indices in I, which implies
|I| 2t. Therefore if the number of shares n is such that n 2t + 1, then
|I| < n, which implies that only a proper subset of the input shares is required
for the simulation; those input shares can in turn be generated as independently
uniformly distributed bits. Therefore, the knowledge of the original circuit vari-
ables is not required to generate a perfect simulation of the t probes, hence these
probes do not bring any additional information to the attacker (since he could
perform that simulation by himself).
500 J.-S. Coron et al.
Existing Work. In the last decade, several masking countermeasures have been
proposed for block-ciphers together with security proofs in the ISW probing
model, based on the original notion of private circuits introduced in [ISW03].
Except [Cor14] which extends the original idea of [KJJ99] to any order, the other
proposals are based on the ISW gadget recalled above. The core idea of the latter
works is to split the processing into a short sequence of field multiplications and
F2 -linear operations, and then to secure these operations independently, while
ensuring that the local security proofs can be combined to prove the security
of the entire processing. When parametrized at order n, as recalled above the
complexity of the ISW gadget for the field multiplication is O(n2 ), but only
O(n) for F2 -linear operations.1 Therefore, an interesting problem is to minimize
the number of field multiplications required to evaluate an SBox.
In the Rivain-Prouff countermeasure [RP10], the authors showed how to
adapt the ISW circuit construction to a software implementation of AES, by
working in F28 instead of F2 . Namely as illustrated in Fig. 1, the non-linear part
S(x) = x254 of the AES SBox can be evaluated with only 4 non-linear multipli-
cations over F28 , and a few linear squarings. Each of those 4 multiplications can
in turn be evaluated with the previous ISW gadget based on Eq. (1), by working
over F28 instead of F2 .
x x3 x12
x2 x14
Fig. 1. (a) Sequential computation of x254 as used in [RP10, BBD+15a]. (b) Alternative
computation of x254 ; the multiplications x14 = x12 · x2 and x15 = x12 · x3 can be
computed in parallel [GHS12].
Refined Security Model: t-SNI Security. Since in this paper we are inter-
ested in efficiency improvements, we would like to use the optimal n = t + 1
number of shares instead of n = 2t + 1 as in the original ISW countermeasure.
For n 2t + 1 shares the security proof for the single ISW multiplication gadget
easily extends to the full circuit [ISW03]; however for n t + 1 shares only
one must be extra careful. For example, for the Rivain-Prouff countermeasure,
it was originally claimed in [RP10] that only n t + 1 shares were required,
but an attack of order (n − 1)/2 + 1 was later described in [CPRR13]; the
security proof in [RP10] with n t + 1 shares actually applies only when the
ISW multiplication is used in isolation, but not for the full block-cipher.
To prove security with n t+1 shares only for the full block-cipher, a refined
security model against probing attacks was recently introduced in [BBD+15a],
called t-SNI security. As shown in [BBD+15a], this stronger definition of t-SNI
security enables to prove that a gadget can be used in a full construction with
n t + 1 shares, instead of n 2t + 1 for the weaker definition of t-NI security
(corresponding to the original ISW security proof). The authors show that the
ISW multiplication gadget does satisfy this stronger t-SNI security definition.
They also show that with some additional mask refreshing, the Rivain-Prouff
countermeasure for the full AES can be made secure with n t + 1 shares.
Due to its power and simplicity, the t-SNI notion appears to be the “right”
security definition against probing attacks. Therefore, in this paper, we always
prove the security of our algorithms under this stronger t-SNI notion, so that our
algorithms can be used within a larger construction (typically a full block-cipher)
with n t + 1 shares only.
2
A multiplication over a field of characteristic 2 is F2 -linear if it corresponds to a
Frobenius automorphism, i.e. to a series of squarings.
502 J.-S. Coron et al.
Our Contribution. Our goal in this paper is to further improve the efficiency
of the masking countermeasure. As recalled above, until now the strategy fol-
lowed by the community has been to reduce the number of calls to the ISW
multiplication gadget. In this paper, we follow a complementary approach con-
sisting in reducing the complexity of the ISW multiplication gadget itself. Our
core idea is to use common shares between the inputs of multiple ISW multi-
plication gadgets, up to the first n/2 shares; in that case, a given processing
performed in the first ISW gadget can be re-used in subsequent gadgets.
Consider for example the alternative evaluation circuit for x254 in AES used
in [GHS12], as illustrated in Fig. 1. It still has 4 non-linear multiplications as in
the original circuit [RP10], but now the two multiplications x14 ← x12 · x2 and
x15 ← x12 · x3 can be evaluated in parallel, moreover with a common operand
x12 . Let denote by d ← c · a and e ← c · b those two multiplications with
common operand c. In the ISW multiplication gadget, one must compute all
cross-products ci · aj and ci · bj for all 1 i, j n. Now if we can ensure that
half of the shares of a and b are the same, that is aj = bj for all 1 j n/2, then
the products ci · aj and ci · bj for 1 j n/2 are the same and can be computed
only once; see Fig. 2 for an illustration. This implies that when processing the
second multiplication gadget for e ← c · b, we only have to compute n2 /2 finite
field multiplications instead of n2 . For two multiplications as above, this saves
the equivalent of 0.5 multiplication.
a b
c c
d←c·a e←c·b
Fig. 2. When half of the shares in a and b are the same, the multiplications correspond-
ing to the left-hand blocks are the same. This saves the equivalent of 0.5 multiplications
out of 2.
To ensure that the two inputs have half of their shares in common, we intro-
duce a new gadget called CommonShares with complexity O(n), taking as input
two independent n-sharings of data and outputting two new n-sharings, but
with their first n/2 shares in common. Obviously this must be achieved without
degrading the security level; we show that this is indeed the case by proving the
security of the full SBox evaluation in the previous t-SNI model, with n t + 1
shares. Note that we cannot have more than n/2 shares in common between two
variables a and b, since otherwise there would be a straightforward attack with
fewer than n probes: namely if ai = bi for all 1 i k, then we can probe the
2(n−k) remaining shares ai and bi for k +1 i n; if k > n/2 this gives strictly
Faster Evaluation of SBoxes via Common Shares 503
less than n shares, whose xor gives the secret variable a ⊕ b. Hence having half
of the shares in common is optimal.
More generally, the 16 SBoxes of AES can be processed in parallel, and
therefore each of the 4 non-linear multiplications in x254 can be processed in
parallel. As opposed to the previous case those multiplications do not share any
operand, but we show that by using a generalization of the CommonShares algo-
rithm between m operands instead of 2, for every multiplication in the original
circuit one can still save the equivalent of roughly 1/4 multiplication. This also
applies to other block-ciphers as well, since in most block-ciphers the SBoxes are
applied in parallel. One can therefore apply the technique from [CRV14] based on
fast polynomial evaluation, and using our CommonShares algorithm between the
inputs of the evaluated polynomials, we again save roughly 1/4 of the number of
finite field multiplications. Our results for various block-ciphers are summarized
in Table 1, in which we give the equivalent number of non-linear multiplications
for a single SBox evaluation, for various block-ciphers; we refer to Sect. 5 for a
detailed description. Finally, we show in the full version of this paper [CGPZ16]
how to apply our common shares technique to the Threshold Implementations
(TI) approach for securing implementation against side channel attacks, even in
the presence of glitches.
Methods SBox
AES DES PRESENT SERPENT CAMELLIA CLEFIA
Parity-Split [CGP+12] 4 10 3 3 22 22
Roy-Vivek [RV13] 4 7 3 3 15 15,16
[CRV14] 4 4 2 2 10 10
Our Method 2.8 3.1 1.5 1.5 7.8 7.8
2 Security Definitions
Given a variable x ∈ F2k and an integer
n n, we say that the vector (x1 , . . . , xn ) ∈
(F2k )n is an n-sharing of x if x = i=1 xi . We recall the security definitions
504 J.-S. Coron et al.
from [BBD+15a], which we make slightly more explicit. For simplicity we only
provide the definitions for a simple gadget taking as input a single variable x
(given by n shares xi ) and outputting a single variable y (given by n shares yi ).
We provide the generalization to multiple inputs and outputs the full version of
this paper [CGPZ16]. Given a vector (xi )1in , we denote by x|I := (xi )i∈I the
sub-vector of shares xi with i ∈ I.
The t-NI security notion corresponds to the original security definition in the
ISW probing model; it allows to prove the security of a full construction with
n 2t + 1 shares. The stronger t-SNI notion allows to prove the security of a
full construction with n t + 1 shares only [BBD+15a]. The difference is that in
the stronger t-SNI notion, the size of the input shares subset I can only depend
on the number of internal probes t1 , and must be independent of the number of
output variables |O| that must be simulated (as long as the condition t1 +|O| t
is satisfied). Intuitively, this provides an “isolation” between the output shares
and the input shares of a given gadget, and for composed constructions this
enables to easily prove that a full construction is t-SNI secure, based on the
t-SNI security of its components.
Algorithm 1. SecMult
n
Require: shares ai satisfying n i=1 ai = a, shares bi satisfying i=1 bi = b
n
Ensure: shares ci satisfying i=1 ci = a · b
1: for i = 1 to n do
2: ci ← ai · bi
3: end for
4: for i = 1 to n do
5: for j = i + 1 to n do
6: r ← F 2k referred by ri,j
7: ci ← ci ⊕ r referred by ci,j
8: r ← (ai · bj ⊕ r) ⊕ aj · bi referred by rj,i
9: cj ← cj ⊕ r referred by cj,i
10: end for
11: end for
12: return (c1 , . . . , cn )
Lemma 1 (t-SNI of SecMult). Let (ai )1in and (bi )1in be the input shares
of the SecMult operation, and let (ci )1i<n be the output shares. For any set of
t1 intermediate variables and any subset |O| t2 of output shares such that
t1 + t2 < n, there exist two subsets I and J of indices with |I| t1 and |J| t1 ,
such that those t1 intermediate variables as well as the output shares c|O can be
perfectly simulated from a|I and b|J .
To obtain security against t probes with n t+1 shares instead of n 2t+1, the
previous SecMult algorithm is usually not sufficient; one must also use a mask
refreshing algorithm. The following RefreshMask operation is used in [BBD+15a]
to get the t-SNI security of a full construction.
The following lemma is proven in [BBD+15a], showing the t-SNI security
of RefreshMask. In the full version of this paper [CGPZ16] we also provide a
modular proof, using the same approach as in Lemma 1; namely the above
RefreshMask algorithm can be viewed as a SecMult with multiplication by 1,
with shares (1, 0, . . . , 0); therefore the same proof technique applies.
Lemma 2 (t-SNI of RefreshMask). Let (ai )1in be the input shares of the
RefreshMask operation, and let (ci )1in be the output shares. For any set of
506 J.-S. Coron et al.
Algorithm 2. RefreshMask
Input: a1 , . . . , an n
Output: c1 , . . . , cn such that n
i=1 ci = i=1 ai
1: For i = 1 to n do ci ← ai
2: for i = 1 to n do
3: for j = i + 1 to n do
4: r ← {0, 1}k
5: ci ← ci ⊕ r
6: cj ← cj ⊕ r
7: end for
8: end for
9: return c1 , . . . , cn
t1 intermediate variables and any subset |O| t2 of output shares such that
t1 + t2 < n, there exists a subset I of indices with |I| t1 , such that the t1
intermediate variables as well as the output shares c|O can be perfectly simulated
from a|I .
Algorithm 3. SecExp254
= n
Input: shares x1 , . . . , xn satisfying x i=1 xi
n
Output: shares y1 , . . . , yn such that i=1 yi = x254
1: For i = 1 to n do zi ← x2i i zi = x2
2: (zi )1in ← RefreshMask((zi )1in )
3: (yi )1in ← SecMult((zi )1in , (xi )1in ) i yi = x3
4: For i = 1 to n do wi ← yi4 i wi = x12
5: (wi )1in ← RefreshMask((wi )1in )
6: (yi )1in ← SecMult((yi )1in , (wi )1in ) i yi = x15
7: For i = 1 to n do yi ← yi16 i yi = x240
8: (yi )1in ← SecMult((yi )1in , (wi )1in ) i yi = x252
9: (yi )1in ← SecMult((yi )1in , (zi )1in ) i yi = x254
10: return y1 , . . . , yn
Using the two previous lemmas, one can prove the t-SNI security of Sec-
Exp254; we refer to [BBD+15a] for the proof.
Lemma 3 (t-SNI of x254 ). Let (xi )1in be the input shares of ExpSec254,
and let (yi )1in be the output shares. For any set of t1 intermediate variables
and any subset |O| t2 of output shares such that t1 + t2 < n, there exists a
subset I of indices with |I| t1 , such that those t1 intermediate variables as well
as the output shares y|O can be perfectly simulated from x|I .
Faster Evaluation of SBoxes via Common Shares 507
d←c·a
e←c·b
The SecMult algorithm will compute the cross-products ci · aj and ci · bj for all
1 i, j n. Now assume that half of the shares of a and b are the same, that is
aj = bj for all 1 j n/2. In that case the products ci · aj for 1 j n/2 have
to be computed only once, and therefore when processing e ← c · b, we only have
to compute n2 /2 multiplications instead of n2 ; see Fig. 2 for an illustration. For
an arithmetic circuit with 4 multiplications as above, this saves the equivalent
of 0.5 multiplication.
Algorithm 4. CommonShares
n
Require: shares ai satisfying n ai = a, shares bi satisfying
i=1 n i=1 bi = b
n
Ensure: shares ai and bi satisfying i=1 ai = a and i=1 bi = b, with ai = bi for all
1 i n/2
1: for i = 1 to n/2 do
2: ri ←$ F2k
3: ai ← ri , an/2+i ← (an/2+i ⊕ ri ) ⊕ ai ai ⊕ an/2+i = ai ⊕ an/2+i
4: bi ← ri , bn/2+i ← (bn/2+i ⊕ ri ) ⊕ bi bi ⊕ bn/2+i = bi ⊕ bn/2+i
5: end for
6: return (ai )1in and (bi )1in
It is easy to see that we still get as output an n-sharing of the same variables
a and b, since for each 1 i n/2 we have ai ⊕ an/2+i = ai ⊕ an/2+i , and
similarly for b. As explained previously, we cannot have more than n/2 shares
in common between a and b, since otherwise there would be a straightforward
attack with fewer than n probes: namely if ai = bi for all 1 i k, then we can
probe the 2(n − k) remaining shares ai and bi for k + 1 i n; if k > n/2 this
gives strictly less than n shares, whose xor gives the secret variable a ⊕ b. Hence
having half of the shares in common is optimal.
The following Lemma shows the security of the CommonShares algorithm; as
will be shown later, for this algorithm we only need the weaker t-NI security
property (instead of t-SNI).
Lemma 4 (t-NI of CommonShares). Let (ai )1in and (bi )1in be the input
shares of the algorithm CommonShares, and let (ai )1in and (bi )1in be the
output shares. For any set of t1 intermediate variables and any subsets of indices
I, J ⊂ [1, n], there exists a subset S ⊂ [1, n] with |S | |I| + |J| + t1 , such
that those t1 variables as well as the output shares a|I and b|J can be perfectly
simulated from a|S and b|S .
Proof. The proof intuition is as follows. If for a given i with 1 i n/2 the
adversary requests only one of the variables ri , an/2+i ⊕ ri , bn/2+i ⊕ ri , an/2+i or
bn/2+i , then such variable can be perfectly simulated without knowing any of the
input shares ai , bi , an/2+i and bn/2+i , thanks to the mask ri . On the other hand,
if two such variables (or more) are requested, then we can provide a perfect
simulation from the 4 previous input shares, whose knowledge is obtained by
adding the two indices i and n/2 + i in S . Therefore we never add more than
one index in S per probe (or per output index in I or J), which implies that
the size of the subset S of input shares is upper-bounded by |I| + |J| + t1 , as
required.3
3
Note that the proof would not work without the masks ri ; namely with ri = 0 we
would need to know both ai and an/2+i to simulate an/2+i ; hence with t probes we
would need at least n 2t + 1 shares, which would make CommonShares useless.
Faster Evaluation of SBoxes via Common Shares 509
Algorithm 5. CommonMult
n n
Input: shares satisfying cn = i=1 ci , a = i=1 ai and b = n i=1 bi .
Output: di such that i=1 di = c · a, and ei such that n i=1 i = c · b
e
1: (ai )1in , (bi )1in ← CommonShares((ai )1in , (bi )1in )
2: (di )1in ← SecMult((ci )1in , (ai )1in )
3: (ei )1in ← SecMult((ci )1in , (bi )1in )
4: return (di )1in and (ei )1in .
Lemma 5 (t-SNI of CommonMult). Let (ai )1in , (bi )1in and (ci )1in be
the input shares of the CommonMult operation, and let (di )1in and (ei )1in
be the output shares. For any set of t1 intermediate variables and any subsets
|O1 | t2 and |O2 | t2 of output shares such that t1 + t2 < n, there exist
two subsets I and J of indices such that |I| t1 and |J| t1 , and those t1
intermediate variables as well as the output shares d|O1 and e|O2 can be perfectly
simulated from a|J , b|J and c|I .
We are now ready to describe the full computation of y = x254 based on the
CommonShares algorithm; the algorithm SecExp254’ is described below; it is a
variant of Algorithm 3.
Algorithm 6. SecExp254’
= n
Input: shares x1 , . . . , xn satisfying x i=1 xi
n
Output: shares y1 , . . . , yn such that i=1 yi = x254
1: For i = 1 to n do zi ← x2i i zi = x2
2: (xi )1in ← RefreshMask((xi )1in )
3: (yi )1in ← SecMult((zi )1in , (xi )1in ) i yi = x3
4: For i = 1 to n do wi ← yi4 i wi = x12
5: (wi )1in ← RefreshMask((wi )1in )
i )1in ,14
6: (z (yi )1in ← CommonMult((wi )1in , (zi )1in , (yi )1in )
15
i zi = x , i yi = x
7: For i = 1 to n do yi ← yi16 i yi = x240
8: (yi )1in ← SecMult((yi )1in , (zi )1in ) i yi = x254
9: return y1 , . . . , yn
The following Lemma proves the t-SNI security of our new algorithm; there-
fore our new algorithm achieves exactly the same security level as Algorithm 3.
That is, it can be used in the computation of a full block-cipher, with n t + 1
shares against t probes. We provide the proof in the full version of this paper
[CGPZ16].
Lemma 6 (t-SNI of x254 ). Let (xi )1in be the input shares of the x254 oper-
ation, and let (yi )1in be the output shares. For any set of t1 intermediate
Faster Evaluation of SBoxes via Common Shares 511
variables and any subset |O| t2 of output shares such that t1 + t2 < n, there
exists a subset I of indices with |I| t1 , such that those t1 intermediate variables
as well as the output shares y|O can be perfectly simulated from x|I .
Table 2. Complexity of CommonMult and SecExp254’; for simplicity we omit the O(n)
terms.
In the previous section, we have shown that by using a different arithmetic circuit
for x254 , two multiplications in F28 could be processed in parallel, moreover with
a common operand, and then by using half common shares we could save the
equivalent of 1/2 multiplication out of 4 in the evaluation of an AES SBox.
In the full version of this paper [CGPZ16], we consider the case of parallel
multiplications that do not necessarily share an operand. Previously we have
focused on a single evaluation of an AES SBox, but in AES the 16 SBoxes can
actually be processed in parallel, and therefore each of the 4 multiplications in
x254 can be processed in parallel. As opposed to the previous case those multi-
plications do not share any operand, but we show that by using a generalization
of the CommonShares algorithm between m operands instead of 2, for every
multiplication one can still save the equivalent of roughly 1/4 multiplication.
4
The algebraic degree of a function h is the integer value maxai =0 (HW(i)) where
the ai ’s are the coefficients of the polynomial representation of h and where HW(i)
denotes the Hamming weight of i.
512 J.-S. Coron et al.
which holds for any sij ∈ F2k . From the above equation, any function h of
algebraic degree 2 can be securely processed with n-th order security.
In the full version of this paper [CGPZ16], we recall the algorithm from
[CPRR15] for the secure evaluation of the quadratic function h(x), and its appli-
cation to AES. We then show how to use our common shares technique for we
provide for m parallel evaluations of h(x).
7 Implementation
We have done a practical implementation of our algorithms for the AES SBox.
More precisely we have implemented the n-shared evaluation of x254 in four
different ways:
• RP10: using the Rivain-Prouff algorithm, as described in Algorithm 3;
• CM: using our common shares technique, as described in Algorithm 6;
• GPS14: using quadratic functions, as described in the full version of this paper
[CGPZ16];
• GPS14CS: using quadratic functions and common shares, as explained in the
full version of this paper [CGPZ16];
Table 3. Performances comparison of the RP10, CM, GPS14 and GPS14CS algo-
rithms, on the ATmega and ARM platforms.
8 shares 16 shares
RP10 CM GPS14 GPS14CS RP10 CM GPS14 GPS14CS
ATmega 20360 18244 11076 12447 70966 57644 39554 40086
ARM 20333 18156 13796 13156 77264 65556 54133 50560
For portability, the code is written in C, except the field multiplication in F28
which is written in assembly for ATmega1284P (8-bit AVR microcontroller) and
Faster Evaluation of SBoxes via Common Shares 513
Acknowledgments. We wish to thank Sonia Belaı̈d who applied the EasyCrypt ver-
ification tool [BBD+15b] on our AES SBox algorithm with common shares, at order
n = 6.
References
[BBD+15a] Barthe, G., Belaı̈d, S., Dupressoir, F., Fouque, P.-A., Grégoire, B.: Com-
positional verification of higher-order masking: application to a verifying
masking compiler. Cryptology ePrint Archive, Report 2015/506 (2015).
http://eprint.iacr.org/
[BBD+15b] Barthe, G., Belaı̈d, S., Dupressoir, F., Fouque, P.-A., Grégoire, B., Strub,
P.-Y.: Verified proofs of higher-order masking. In: Oswald, E., Fischlin,
M. (eds.) EUROCRYPT 2015. LNCS, vol. 9056, pp. 457–485. Springer,
Heidelberg (2015)
[BGN+14] Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: Higher-order
threshold implementations. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT
2014, Part II. LNCS, vol. 8874, pp. 326–343. Springer, Heidelberg (2014)
[Bla79] Blakely, G.R.: Safeguarding cryptographic keys. In: National Computer
Conference, vol. 48, pp. 313–317. AFIPS Press, New York (1979)
[CGP+12] Carlet, C., Goubin, L., Prouff, E., Quisquater, M., Rivain, M.: Higher-
order masking schemes for S-Boxes. In: Canteaut, A. (ed.) FSE 2012.
LNCS, vol. 7549, pp. 366–384. Springer, Heidelberg (2012)
[CGPZ16] Coron, J.-S., Greuet, A., Prouff, E., Zeitoun, R.: Faster evaluation of
Sboxes via common shares. Cryptology ePrint Archive, Report 2016/572
(2016). http://eprint.iacr.org/. Full version of this paper
[CJRR99] Chari, S., Jutla, C.S., Rao, J.R., Rohatgi, P.: Towards sound approaches
to counteract power-analysis attacks. In: Wiener, M. (ed.) CRYPTO 1999.
LNCS, vol. 1666, pp. 398–412. Springer, Heidelberg (1999)
[Cor14] Coron, J.-S.: Higher order masking of look-up tables. In: Nguyen, P.Q.,
Oswald, E. (eds.) EUROCRYPT 2014. LNCS, vol. 8441, pp. 441–458.
Springer, Heidelberg (2014)
[CPRR13] Coron, J.-S., Prouff, E., Rivain, M., Roche, T.: Higher-order side channel
security and mask refreshing. In: Moriai, S. (ed.) FSE 2013. LNCS, vol.
8424, pp. 410–424. Springer, Heidelberg (2014)
514 J.-S. Coron et al.
[CPRR15] Carlet, C., Prouff, E., Rivain, M., Roche, T.: Algebraic decomposition for
probing security. In: Gennaro, R., Robshaw, M. (eds.) CRYPTO 2015,
Part I. LNCS, vol. 9215, pp. 742–763. Springer, Heidelberg (2015)
[CRV14] Coron, J.-S., Roy, A., Vivek, S.: Fast evaluation of polynomials over
binary finite fields and application to side-channel countermeasures. In:
Batina, L., Robshaw, M. (eds.) CHES 2014. LNCS, vol. 8731, pp. 170–187.
Springer, Heidelberg (2014)
[DDF14] Duc, A., Dziembowski, S., Faust, S.: Unifying leakage models: from prob-
ing attacks to noisy leakage. In: Nguyen, P.Q., Oswald, E. (eds.) EURO-
CRYPT 2014. LNCS, vol. 8441, pp. 423–440. Springer, Heidelberg (2014)
[GHS12] Gentry, C., Halevi, S., Smart, N.P.: Homomorphic evaluation of the AES
circuit. In: Safavi-Naini, R., Canetti, R. (eds.) CRYPTO 2012. LNCS,
vol. 7417, pp. 850–867. Springer, Heidelberg (2012)
[GPS14] Grosso, V., Prouff, E., Standaert, F.-X.: Efficient masked S-Boxes
processing – a step forward –. In: Pointcheval, D., Vergnaud, D. (eds.)
AFRICACRYPT. LNCS, vol. 8469, pp. 251–266. Springer, Heidelberg
(2014)
[ISW03] Ishai, Y., Sahai, A., Wagner, D.: Private circuits: securing hardware
against probing attacks. In: Boneh, D. (ed.) CRYPTO 2003. LNCS, vol.
2729, pp. 463–481. Springer, Heidelberg (2003)
[KJJ99] Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M.
(ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg
(1999)
[MS06] Mangard, S., Schramm, K.: Pinpointing the side-channel leakage of
masked AES hardware implementations. In: Goubin, L., Matsui, M. (eds.)
CHES 2006. LNCS, vol. 4249, pp. 76–90. Springer, Heidelberg (2006)
[NRS11] Nikova, S., Rijmen, V., Schläffer, M.: Secure hardware implementation
of nonlinear functions in the presence of glitches. J. Cryptology 24(2),
292–321 (2011)
[PR11] Prouff, E., Roche, T.: Higher-order glitches free implementation of the
AES using secure multi-party computation protocols. In: Preneel, B.,
Takagi, T. (eds.) CHES 2011. LNCS, vol. 6917, pp. 63–78. Springer, Hei-
delberg (2011)
[PR13] Prouff, E., Rivain, M.: Masking against side-channel attacks: a formal
security proof. In: Johansson, T., Nguyen, P.Q. (eds.) EUROCRYPT
2013. LNCS, vol. 7881, pp. 142–159. Springer, Heidelberg (2013)
[RBN+15] Reparaz, O., Bilgin, B., Nikova, S., Gierlichs, B., Verbauwhede, I.: Consol-
idating masking schemes. In: Gennaro, R., Robshaw, M. (eds.) CRYPTO
2015, Part I. LNCS, vol. 9215, pp. 764–783. Springer, Heidelberg (2015)
[RP10] Rivain, M., Prouff, E.: Provably secure higher-order masking of AES. In:
Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol. 6225, pp.
413–427. Springer, Heidelberg (2010)
[RV13] Roy, A., Vivek, S.: Analysis and improvement of the generic higher-order
masking scheme of FSE 2012. In: Bertoni, G., Coron, J.-S. (eds.) CHES
2013. LNCS, vol. 8086, pp. 417–434. Springer, Heidelberg (2013)
[Sha79] Shamir, A.: How to share a secret. Commun. ACM 22(11), 612–613 (1979)
Hardware Implementations
FourQ on FPGA: New Hardware Speed Records
for Elliptic Curve Cryptography over Large
Prime Characteristic Fields
1 Introduction
With the growing deployment of elliptic curve cryptography (ECC) [15,24] in
place of traditional cryptosystems such as RSA, compact, high-performance
ECC-based implementations have become crucial for embedded systems and
hardware applications. In this setting, field-programmable gate arrays (FPGAs)
A. Miele—This work was performed while the second author was a post-doctoral
researcher at EPFL, Lausanne, Switzerland.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 517–537, 2016.
DOI: 10.1007/978-3-662-53140-2 25
518 K. Järvinen et al.
multiplications per second versus 32304 scalar multiplications per second). Even
when comparing the case without endormorphisms, our FourQ-based FPGA
implementation is faster: the laddered variant is about 1.3 times faster than
Curve25519 in terms of computing time. All these results were obtained on the
same Xilinx Zynq-7020 FPGA model used by [29].
The paper is organized as follows. In Sect. 2, the relevant mathematical back-
ground and general architectural details of the proposed design are provided. In
Sect. 3, the field arithmetic unit (called “the core”) is presented. In Sect. 4, we
describe the scalar unit consisting of the decomposition and recoding units. In
Sect. 5, three architecture variants are detailed: single-core, multi-core and the
Montgomery ladder implementation. We present the performance analysis and
carry out a detailed comparison with relevant work in Sect. 6. Finally, we con-
clude the paper and give directions for future work in Sect. 7.
2 Preliminaries: FourQ
FourQ is a high-performance elliptic curve recently proposed by Costello and
Longa [6]. Given the quadratic extension field Fp2 = Fp (i) with p = 2127 − 1 and
i2 = −1, FourQ is defined as the complete twisted Edwards [4] curve given by
where d := 125317048443780598345676279555970305165·i+4205857648805777768770.
The set of Fp2 -rational points lying on Eq. (1), which includes the neutral
point OE = (0, 1), forms an additive abelian group. The cardinality of this group
is given by #E(Fp2 ) = 392 · ξ, where ξ is a 246-bit prime, and thus, the group
E(Fp2 )[ξ] can be used in cryptographic systems.
The fastest set of explicit formulas for the addition law on E are due to Hisil
et al. [12] using the so-called extended twisted Edwards coordinates: any tuple
(X : Y : Z : T ) with Z = 0 and T = XY /Z represents a projective point
corresponding to an affine point (x, y) = (X/Z, Y /Z). Since d is non-square over
Fp2 , this set of formulas is also complete on E, i.e., they work without exceptions
for any point in E(Fp2 ).
Since FourQ is a degree-2 Q-curve with complex multiplication [10,30], it
comes equipped with two efficiently computable endomorphisms, namely, ψ and
φ. In [6], it is shown that these two endomorphisms enable a four-dimensional
decomposition m → (a1 , a2 , a3 , a4 ) ∈ Z4 for any integer m ∈ [0, 2256 − 1] such
that 0 ≤ ai < 264 for i = 1, 2, 3, 4 (which is optimal in the context of multi-
scalar multiplication) and such that a1 is odd (which facilitates efficient, side-
channel protected scalar multiplications); see [6, Proposition 5] for details about
FourQ’s decomposition procedure. This in turn induces a four-dimensional scalar
multiplication with the form
Our core design follows the same methodology described above and computes
FourQ’s scalar multiplication as in [6, Algorithm 2]. However, there is a slight
variation: since the negative of a precomputed point (X + Y, Y − X, 2Z, 2dT ) is
given by (Y − X, X + Y, 2Z, −2dT ), we precompute the values −2dT and store
each precomputed point using the tuple (X + Y, Y − X, 2Z, 2dT, −2dT ). This
representation is referred to as R5 . During scalar multiplication, we simply read
coordinates in the right order and assemble either (X + Y, Y − X, 2Z, 2dT ) (for
positive digit-columns) or (Y −X, X +Y, 2Z, −2dT ) (for negative digit-columns).
This approach completely eliminates the need for point negations during scalar
multiplication at the cost of storing only 8 extra elements in Fp2 . The slightly
modified scalar multiplication algorithm is presented in Algorithm 1.
In Algorithm 2, we detail the conversion of the multi-scalars to digit-columns
di . During a scalar multiplication, the 3-least significant bits of these digits
(values “vi ”) are used to select one out of eight points from the precomputed
table. The top bit (values “si ”) is then used to select between the coordinate
FourQ on FPGA: New Hardware Speed Records 521
value 2dT (if the bit is 1) and −2dT (if the bit is 0), as described above for a
point using representation R5 .
The structure of Algorithm 1 leads to a natural division of operations in
our ECC processor. The processor consists of two main building blocks: (a) a
scalar unit and (b) a field arithmetic unit. The former carries out the scalar
decomposition and recoding (steps 3 and 4 in Algorithm 1), and the latter—
referred simply as “the core”—is responsible for computing the endomorphisms,
precomputation, and the main loop through a fixed series of operations over Fp2 .
We describe these units in detail in Sects. 3 and 4.
3.1 Datapath
The datapath computes operations in Fp and it thus operates on 127-bit operands.
The datapath supports basic operations that allow the implementation of field
522 K. Järvinen et al.
commands,
resp onses do di
64 64
2 16 Interface logic
127
2
18 Dual-port
RAM
127
Control 127 127
16
Datapath
128
63
64
129
64 +
64 × 64-bit
128
127 multiplier 127
63 (pipelined)
127
64
64 c
a
b 127
1
127 127
127
127 +/− r
0
127 127
0 1 c
The control logic controls the datapath and memory and, as consequence, imple-
ments all the hierarchical levels required by scalar multiplications on FourQ. The
control logic consists of a program ROM that includes instructions for the data-
path and memory addresses, a small finite state machine (FSM) that controls the
read addresses of the program ROM, and a recoder for recoding the instructions
in the program ROM to control signals for the datapath and memory.
Field operations. consist of multiple instructions that are issued by the control
logic, as discussed in Sect. 3.1. Because of the pipelined multiplier, multiplica-
tions in Fp take several clock cycles (20 clock cycles including memory reads and
writes). Fortunately, pipelining allows computing independent multiplications
simultaneously and thus enables efficient operations over Fp2 .
Let a = (a0 , a1 ), b = (b0 , b1 ) ∈ Fp2 . Then, results (c0 , c1 ) of operations in Fp2
are given by
a + b = (a0 + b0 , a1 + b1 )
a − b = (a0 − b0 , a1 − b1 )
a × b = (a0 · b0 − a1 · b1 , (a0 + a1 ) · (b0 + b1 ) − a0 · b0 − a1 · b1 )
a2 = ((a0 + a1 ) · (a0 − a1 ), 2a0 · a1 )
a−1 = (a0 · (a20 + a21 )−1 , −a1 · (a20 + a21 )−1 )
where operations on the right are in Fp . Operations in Fp2 are directly computed
using the equations above: multiplication requires three field multiplications,
two field additions and three field subtractions, whereas squaring requires only
two field multiplications, two field additions and one field subtraction. Field
127
inversions are computed via Fermat’s Little Theorem (a−1 = ap−2 = a2 −3 )
using 138 multiplications in Fp .
An example of how the control logic implements c = a × b with a = (a0 , a1 )
and b = (b0 , b1 ) ∈ Fp2 using the datapath is shown in Fig. 3. The multiplication
begins by computing t1 = a0 ·b0 in Fp followed by t2 = a1 ·b1 . The additions t3 =
a0 +a1 and t4 = b0 +b1 are interleaved with these multiplications. As soon as they
are ready and the multiplier path becomes idle, the last multiplication t3 ← t3 ·t4
is computed. The multiplication a × b ends with three successive subtractions
c0 = t1 − t2 and c1 = t3 − t1 − t2 . The operation sequence was designed to allow
the interleaving of successive multiplications over Fp2 . A preceding multiplication
f = d × e and subsequent multiplications g × h and i × j are depicted in gray
color in Fig. 3. A multiplication finishes in 45 clock cycles but allows the next
multiplication to start after only 21 clock cycles. For every other multiplication
one must use t5 in place of t3 in order to avoid writing to t3 before it is read.
This operation sequence also allows interleaving further additions/subtractions
in Fp with the interleaved multiplications. E.g., if we read operands from the
memory in line 14, then we can compute an addition followed by a reduction
in lines 16 and 17 and write the result back in line 18. There is also a variant
FourQ on FPGA: New Hardware Speed Records 525
addresses). Each line is 25 bits wide: 3 bits for the multiplier path, 5 bits for the
adder/subtractor path, one bit for write enable and two 8-bit memory addresses
for the RAM. Execution of each instruction line takes one clock cycle. We tested
implementing the program ROM both using distributed memory and BlockRAM
blocks. The latter resulted in slightly better timing results arguably because of an
easier place-and-route process. Accordingly, we chose to implement the program
ROM using 6 BlockRAM blocks.
There are in total seven separate routines in the program ROM. Given a base-
point P = (x, y) and following Algorithm 1, initialization (lines 1–14) assigns
X ← x, Y ← y, Z ← 1, Ta ← x and Tb ← y (i.e., it maps the affine point P to
representation R1 ; see Sect. 2.1). Precomputation (lines 15–4199) produces the
table T containing 8 points using the endormorphisms and point additions. Pre-
computed points are stored using representation R5 . Initialization of the main
loop (lines 4200–4214) initializes the point accumulator by loading a point from
the table T using the first digit of the recoded multi-scalar and by mapping it to
representation R4 . In the main loop (lines 4215–4568), point doublings Q ← [2]Q
and additions Q ← Q + T [di ] are computed using the representations R1 ← R4
and R1 ← R1 × R2 , respectively. As explained in Sect. 2.1, converting precom-
puted points from representation R5 to R2 is simply done by reading values
from memory in the right order. The main loop consists of 64 iterations and
significant effort was devoted to optimizing its latency. Affine conversion (lines
4569–7437) maps the resulting point in representation R1 to affine coordinates
by computing x = X/Z and y = Y /Z. The bulk of this computation consists
of an inversion in Fp . Point validation (lines 7438–7561) checks if the basepoint
P = (x, y) is in E(Fp2 ), i.e., it verifies that −x2 + y 2 − 1 − dx2 y 2 = 0. Cofactor
clearing (lines 7562–8014) kills the cofactor by computing 392P . This is done
with an R2 ← R1 map (lines 7562–7643) followed by eight point doublings (lines
7644–7799) and two point additions (lines 7800–8014).
The control FSM. sets the address for the program ROM depending on the
phase of the scalar multiplication. The FSM includes a counter and hardcoded
FourQ on FPGA: New Hardware Speed Records 527
pointers to the routines in the program ROM. The value of the counter is used
as the address to the program ROM. Depending on the operation, the FSM sets
the counter to the address of the first line of the appropriate routine and, then,
lets the counter count up by one every clock cycle until it reaches the end pointer
of that routine. After that, the FSM jumps to the next routine or to the wait
state (line 0 is no-operation).
The instruction recoder. recodes instructions from the program ROM to control
signals for the datapath. The memory addresses from the program ROM are fed
into an address recoding circuit, which recodes the address if it is needed to access
a precomputed point (otherwise, it passes the address unchanged). The address
from the program ROM simply specifies the coordinate of the precomputed point
and the recoding unit replaces this placeholder address with a real RAM memory
address by recoding it using the value and sign of the current digit-column di of
the scalar.
4 Scalar Unit
This unit is in charge of decomposing the input scalar m into four 64-bit multi-
scalars a1 , a2 , a3 , a4 , which are then recoded to a sequence of digit-columns
(d64 , . . . , d0 ) with 0 ≤ di < 16. These digits are used during scalar multiplica-
tion to extract the precomputed points that are to be added. In our design, this
unit is naturally split into the decompose and recode units, which are described
below.
commands responses Yi X0
17 0
24
17
FSM 0: 17 × 24 + 17 DSP
X1
17 17 41
24
1: 17 × 24 + 17 DSP
X10
17 17
17 24
10: 17 × 24 + 17 DSP
41 24 24
281
ROM, and the 256-bit input scalar m, which is stored in a register. The core of
the decompose unit is a truncated multiplier : on input integers 0 ≤ X < 2256 and
0 ≤ Y < 2195 , it calculates the integer ZH = X · Y /(2256 ) mod 264 . This oper-
ation is needed to compute each of the four values α 1 , α
2 , α
3 and α
4 from
[6, Proposition 5] modulo 264 . The truncated multiplier computes ZH as
described in Algorithm 3. In addition, this multiplier can be adapted to compu-
tations with the form ZL = XY mod 264 by simply reducing the two for-loop
counters in Algorithm 3 from 11 to 3 and from 10 to 2, respectively. Thus, we
reuse the truncated multiplier for the 14 multiplications modulo 264 that are
needed to produce the final values a1 , a2 , a3 and a4 as per [6, Proposition 5].
The main building block of the truncated multiplier is a 17 × 264-bit row
multiplier that is used to compute the product of Yj · X for some j ∈ [0, 11]
(lines 4–5 of Algorithm 3). The row multiplier is implemented using a chain of
11 DSPs as shown in Fig. 4. Note that the DSP blocks available on the Xilinx
Zynq FPGA family allow 17 × 24 unsigned integer multiplication plus addition
of the result with an additional 47-bit unsigned integer. In order to comply with
the operand size imposed by the DSP blocks, we split the input integer X into
24-bit words and the input Y into into 17-bit words (the most significant words
are zero-padded). Both X and Y are then represented as X10 , X9 , . . . , X0 in
radix 224 and Y11 , X10 , . . . , Y0 in radix 217 , respectively.
The row multiplier computes the full 17 × 264-bit product after 11 clock
cycles. Its 281-bit result is then added to the 281-bit partial result right-shifted
by 17 bits (line 6 of Algorithm 3). This operation is performed by an adder-
shifter component. In our current design, the addition has been split into 3
steps to reduce the critical path. Finally, a shift register outputs the result (line
9 of Algorithm 3).
FourQ on FPGA: New Hardware Speed Records 529
Y X
195 264
17
264
281
+
264 281
17
64 64
ZH ZL
The recode unit is very simple, as the operations it performs are just bit manip-
ulations and 64-bit additions. The unit is designed as an FSM performing 64
iterations according to Algorithm 2, where each iteration is split into 6 steps
(corresponding to 6 states of the FSM). The first 4 states implement lines 3 to
8 of Algorithm 2, whereas the last 2 states implement line 9.
5 Architectures
Fig. 6. The multi-core architecture with one scalar unit and N cores.
1
The scalar unit outputs digits in the order d0 , d1 , . . . , d64 and the core uses them in
a reversed order (see Algorithm 1).
532 K. Järvinen et al.
requirements but also to a lower performance. Because BlockRAMs are not the
critical resource, we opted for keeping the current memory structure.
We derived hand-optimized routines for the scalar multiplication initializa-
tion and the double-and-add step using the formulas from [25]. The accumulator
is initialized with Q = (X : Z) = (1 : 0). One double-and-add step of the
Montgomery ladder takes 228 clock cycles. Because we have an either 256-bit or
246-bit scalar, a scalar multiplication involves 256 or 246 double-and-add steps,
which take exactly 58368 or 56088 clock cycles, respectively. A final conversion to
extract x from (X : Z) takes 2855 clock cycles. The total cost of scalar multiplica-
tion (without cofactor clearing) is 61235 or 58967 cycles for 256-bit and 246-bit
scalars, respectively. Cofactor clearing is computed with nine double-and-add
steps followed by an extraction of x from (X : Z) and takes 4932 cycles.
The three architectures from Sect. 5 were compiled with Xilinx Vivado 2015.4
to a Xilinx Zynq-7020 XC7Z020CLG484-3 FPGA, which is an all programmable
system-on-chip for embedded systems. All the given results were obtained after
place-and-route. Table 2 presents the area requirements of the designs. Table 3
collects latencies, timings and throughputs of the different operations supported
by the designs.
The single-core design requires less than 13 % of all the resources available
in the targeted Zynq-7020 FPGA. Timing closure was successful with a clock
constraint of 190 MHz (clock period of 5.25 ns). Hence, one scalar multiplication
(without cofactor clearing) takes 156.52 μs, which means 6389 operations per
FourQ on FPGA: New Hardware Speed Records 533
second. Using Vivado tools, we analyzed the power consumption of the single-
core with signal activity from post-synthesis functional simulations of ten scalar
multiplications. The power estimate was 0.359 W (with high confidence level),
and the energy required by one scalar multiplication was about 56.2 μJ.
The multi-core design was implemented by selecting the largest N that fitted
in the Zynq-7020 FPGA. Since the DSP blocks are the critical resource and
there are 220 of them in the targeted FPGA, one can estimate room for up to 13
cores. However, Vivado was unable to place-and-route a multi-core design with
N = 13. In practice, the largest number of admissible cores was N = 11 (85 %
DSP utilization). Even in that case timing closure was successful only with a
clock constraint of 175 MHz (clock period of 5.714 ns). This results in a small
increase in the computing time for one scalar multiplication, which then takes
169.94 μs (without cofactor clearing). Throughput of the multi-core design is
64730 operations per second, which is more than ten times larger than the single-
core’s throughput. Hence, the multi-core design offers a significant improvement
for high-demand applications in which throughput is critical.
The single-core design based on the Montgomery ladder is significantly
smaller than the basic single-core design mainly because there is no scalar unit.
The area requirements reduce to only 7.3 % of resources (DSP blocks) at the
534 K. Järvinen et al.
simpler arithmetic in Fp2 over a Mersenne prime; the simpler inversion alone
saves more than 10000 clock cycles. Our architecture computes scalar multipli-
cations on FourQ with 1.35 times faster latency compared to [29], but because
of the lower clock frequency, throughput and computation time are only 1.28
times faster. These results showcase FourQ’s great performance even when endo-
morphisms are not used (e.g., in some applications with very strict memory
constraints).
7 Conclusions
We presented three FPGA designs for the recently proposed elliptic curve FourQ.
These architectures are able to compute one scalar multiplication in only 157 μs
or, alternatively, with a maximum throughput of up to 64730 operations per sec-
ond by applying parallel processing in a single Zynq-7020 FPGA. The designs
are the fastest FPGA implementations of elliptic curve cryptography over large
prime characteristic fields at the 128-bit security level. This extends the soft-
ware results from [6] by showing that FourQ also offers significant speedups in
hardware when compared to other elliptic curves with similar strength such as
Curve25519 or NIST P-256.
Our designs are inherently protected against SSCA and timing attacks.
Recent horizontal attacks (such as horizontal collision correlations [3]) can break
SSCA-protected implementations by exploiting leakage from partial multiplica-
tions. Our designs compute these operations with a large 64-bit word size in
a highly pipelined and parallel fashion. Nevertheless, resistance against these
attacks, and other attacks that apply to scenarios in which an attacker can
exploit traces from multiple scalar multiplications (e.g., differential power analy-
sis), require further analysis. Future work involves the inclusion of strong coun-
termeasures against such attacks.
References
1. Azarderakhsh, R., Reyhani-Masoleh, A.: Efficient FPGA implementations of point
multiplication on binary Edwards and generalized Hessian curves using Gaussian
normal basis. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 20(8), 1453–1466
(2012)
2. Azarderakhsh, R., Reyhani-Masoleh, A.: Parallel and high-speed computations of
elliptic curve cryptography using hybrid-double multipliers. IEEE Trans. Parallel
Distrib. Syst. 26(6), 1668–1677 (2015)
3. Bauer, A., Jaulmes, E., Prouff, E., Reinhard, J.R., Wild, J.: Horizontal collision
correlation attack on elliptic curves. Crypt. Commun. 7(1), 91–119 (2015)
4. Bernstein, D.J., Birkner, P., Joye, M., Lange, T., Peters, C.: Twisted Edwards
curves. In: Vaudenay, S. (ed.) AFRICACRYPT 2008. LNCS, vol. 5023, pp. 389–
405. Springer, Heidelberg (2008)
5. Bernstein, D.J.: Curve25519: new Diffie-Hellman speed records. In: Yung, M.,
Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958, pp. 207–
228. Springer, Heidelberg (2006)
6. Costello, C., Longa, P.: FourQ: four-dimensional decompositions on a Q-curve over
the Mersenne prime. In: Iwata, T., et al. (eds.) ASIACRYPT 2015. LNCS, vol.
9452, pp. 214–235. Springer, Heidelberg (2015). https://eprint.iacr.org/2015/565
7. Faz-Hernández, A., Longa, P., Sánchez, A.H.: Efficient and secure algorithms for
GLV-based scalar multiplication and their implementation on GLV-GLS curves
(extended version). J. Cryptographic Eng. 5(1), 31–52 (2015)
8. Gallant, R.P., Lambert, R.J., Vanstone, S.A.: Faster point multiplication on elliptic
curves with efficient endomorphisms. In: Kilian, J. (ed.) CRYPTO 2001. LNCS,
vol. 2139, pp. 190–200. Springer, Heidelberg (2001)
9. Guillermin, N.: A high speed coprocessor for elliptic curve scalar multiplications
over Fp . In: Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol. 6225, pp.
48–64. Springer, Heidelberg (2010)
10. Guillevic, A., Ionica, S.: Four-dimensional GLV via the Weil restriction. In: Sako,
K., Sarkar, P. (eds.) ASIACRYPT 2013, Part I. LNCS, vol. 8269, pp. 79–96.
Springer, Heidelberg (2013)
11. Güneysu, T., Paar, C.: Ultra high performance ECC over NIST primes on com-
mercial FPGAs. In: Oswald, E., Rohatgi, P. (eds.) CHES 2008. LNCS, vol. 5154,
pp. 62–78. Springer, Heidelberg (2008)
12. Hisil, H., Wong, K.K.-H., Carter, G., Dawson, E.: Twisted Edwards curves revis-
ited. In: Pieprzyk, J. (ed.) ASIACRYPT 2008. LNCS, vol. 5350, pp. 326–343.
Springer, Heidelberg (2008)
13. Järvinen, K., Skyttä, J.: On parallelization of high-speed processors for elliptic
curve cryptography. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 16(9),
1162–1175 (2008)
14. Järvinen, K., Skyttä, J.: Optimized FPGA-based elliptic curve cryptography
processor for high-speed applications. Integr. VLSI J. 44(4), 270–279 (2011)
15. Koblitz, N.: Elliptic curve cryptosystems. Math. Comput. 48, 203–209 (1987)
16. Kocher, P.C.: Timing attacks on implementations of Diffie-Hellman, RSA, DSS,
and other systems. In: Koblitz, N. (ed.) CRYPTO 1996. LNCS, vol. 1109, pp.
104–113. Springer, Heidelberg (1996)
17. Kocher, P.C., Jaffe, J., Jun, B.: Differential power analysis. In: Wiener, M. (ed.)
CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999)
FourQ on FPGA: New Hardware Speed Records 537
18. Loi, K.C.C., Ko, S.B.: High performance scalable elliptic curve cryptosystem
processor for Koblitz curves. Microprocess. Microsyst. 37(4–5), 394–406 (2013)
19. Loi, K.C.C., Ko, S.B.: Scalable elliptic curve cryptosystem FPGA processor for
NIST prime curves. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 23(11),
2753–2756 (2015)
20. Ma, Y., Liu, Z., Pan, W., Jing, J.: A high-speed elliptic curve cryptographic proces-
sor for generic curves over GF(p). In: Lange, T., Lauter, K., Lisoněk, P. (eds.) SAC
2013. LNCS, vol. 8282, pp. 421–437. Springer, Heidelberg (2014)
21. McIvor, C.J., McLoone, M., McCanny, J.V.: An FPGA elliptic curve cryptographic
accelerator over GF (p). Proc. Irish Signals Syst. Conf. 2004, 589–594 (2004)
22. McIvor, C.J., McLoone, M., McCanny, J.V.: Hardware elliptic curve cryptographic
processor over GF (p). IEEE Trans. Circuits Syst. I Regul. Pap. 55(9), 1946–1957
(2006)
23. Mentens, N.: Secure and efficient coprocessor design for cryptographic applications
on FPGAs. Ph.D. thesis, Katholieke Universiteit Leuven, July 2007
24. Miller, V.S.: Use of elliptic curves in cryptography. In: Williams, H.C. (ed.)
CRYPTO 1985. LNCS, vol. 218, pp. 417–426. Springer, Heidelberg (1986)
25. Montgomery, P.L.: Speeding the Pollard and elliptic curve methods of factorization.
Math. Comput. 48(177), 243–264 (1987)
26. Rebeiro, C., Roy, S.S., Mukhopadhyay, D.: Pushing the limits of high-speed
GF (2m ) elliptic curve scalar multiplication on FPGAs. In: Prouff, E., Schaumont,
P. (eds.) CHES 2012. LNCS, vol. 7428, pp. 494–511. Springer, Heidelberg (2012)
27. Roy, D.B., Mukhopadhyay, D., Izumi, M., Takahashi, J.: Tile before multiplication:
an efficient strategy to optimize DSP multiplier for accelerating prime field ECC for
NIST curves. In: Proceedings of the 51st Annual Design Automation Conference–
DAC 2014, pp. 177: 1–177: 6. ACM (2014)
28. Sasdrich, P., Güneysu, T.: Efficient elliptic-curve cryptography using Curve25519
on reconfigurable devices. In: Goehringer, D., Santambrogio, M.D., Cardoso,
J.M.P., Bertels, K. (eds.) ARC 2014. LNCS, vol. 8405, pp. 25–36. Springer, Hei-
delberg (2014)
29. Sasdrich, P., Güneysu, T.: Implementing Curve25519 for side-channel-protected
elliptic curve cryptography. ACM Trans. Reconfigurable Technol. Syst. 9(1),
(2015). Article 3
30. Smith, B.: Families of fast elliptic curves from Q-curves. In: Sako, K., Sarkar, P.
(eds.) ASIACRYPT 2013, Part I. LNCS, vol. 8269, pp. 61–78. Springer, Heidelberg
(2013)
31. Sutter, G.D., Deschamps, J.P., Imaña, J.L.: Efficient elliptic curve point multi-
plication using digit-serial binary field operations. IEEE Trans. Industr. Electron.
60(1), 217–225 (2013)
A High Throughput/Gate AES Hardware
Architecture by Compressing Encryption
and Decryption Datapaths
— Toward Efficient CBC-Mode Implementation
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 538–558, 2016.
DOI: 10.1007/978-3-662-53140-2 26
A High Throughput/Gate AES Hardware Architecture 539
1 Introduction
Cryptographic applications have been essential for many systems with secure
communications, authentication, and digital signatures. In accordance with the
rapid increase in Internet of Things (IoT) applications, many cryptographic
algorithms are required to be implemented in resource-constrained devices and
embedded systems with a high throughput and efficiency. Since 2001, many hard-
ware implementations for AES have been proposed and evaluated for CMOS logic
technologies. Studies of AES design are important from both practical and aca-
demic perspectives since AES employs an SPN structure and the major compo-
nents (i.e., an 8-bit S-box and permutation used in ShiftRows and MixColumns)
followed by many other security primitives.
AES encryption and decryption are commonly used in block-chaining modes
such as CBC, CMAC, and CCM (e.g., for SSL/TLS, IEEE802.11 wireless
LAN, and IEEE802.15.4 wireless sensor networks). Therefore, AES architectures
that efficiently perform both encryption and decryption in the above block-
chaining modes are highly demanded. However, many conventional architec-
tures employ pipelining techniques to enhance the throughput and efficiency
[13,15,17], although such block-wise parallelism is not available in the above
block-chaining modes. For example, the highest throughput of 53 Gbps was
achieved in the previous best encryption/decryption architecture [17], but it
only worked in the ECB mode. In addition, these previous studies assumed
offline key scheduling owing to the difficulty of on-the-fly scheduling. On-the-
fly key scheduling should be implemented in most resource-constrained devices
because an offline key scheduling implementation requires additional memory to
store expanded round keys. Thus, it is valuable to investigate an efficient AES
architecture with on-the-fly key scheduling without any pipelining technique.
In this paper, we present a new round-based AES architecture for both
encryption and decryption with on-the-fly key scheduling, which achieves the
lowest critical path delay (the least number of serially connected gates in the
critical path) with less area overhead compared to conventional architectures
with tower-field S-boxes. Our architecture employs new operation-reordering
and register-retiming techniques to unify the inversion circuits for encryption
and decryption without any selectors. In addition, these techniques make it pos-
sible to unify the affine transformation and linear mappings (i.e., the isomor-
phism and constant multiplications) to reduce the total number of logic gates.
The proposed and conventional AES encryption/decryption datapaths are syn-
thesized and evaluated with the TSMC standard-cell and NanGate open-cell
libraries. The evaluation results show that our architecture can perform both
(CBC-) encryption and decryption more efficiently. For example, the through-
put per gate of the proposed architecture in the NanGate 15-nm process is 72 %
larger than that of the best conventional architecture.
The rest of this paper is organized as follows: Sect. 2 introduces related works
on AES hardware architectures, especially those with round-based encryption
and decryption. Section 3 presents a new AES hardware architecture based on
our operation-reordering, register-retiming, and affine-transformation unification
540 R. Ueno et al.
techniques. Section 4 evaluates the proposed datapath by the logic synthesis com-
pared with conventional round-based datapaths. Section 5 discusses variations of
the proposed architecture. Finally, Sect. 6 contains our conclusion.
2 Related Works
2.1 Unified AES Datapath for Encryption and Decryption
Architectures that perform one round of encryption or decryption per clock cycle
without pipelining are the most typical for AES design and are called round-
based architectures in this paper. Round-based architectures can be implemented
more efficiently in terms of throughput per area than other architectures by
utilizing the inherent parallelism of symmetric key ciphers. For example, the
byte-serial architecture [16,18] is intended for the most compact and low-power
implementations such as in RFID but is not intended for the high throughput
and efficiency. In contrast, round-based architectures are suitable for a high
throughput per gate, which leads to a low-energy implementation [29].
To design such round-based encryption/decryption architectures in an effi-
cient manner, we consider how to unify the resource-consuming components such
as the inversion circuits in SubBytes/InvSubBytes for the encryption and decryp-
tion datapaths. There are two conventional approaches for designing such unified
datapaths. The first approach is to place two distinct datapaths for encryption
and decryption and select one of the datapaths with multiplexers as in [15].
Figure 1 shows an overview of the datapath flow in [15], where the inversion
circuit is shared by both paths, and additional multiplexers are used at the
input and output of the encryption and decryption paths. In [15], a reordered
decryption operation was introduced as shown in Fig. 2. The intermediate value
is stored in a register after InvMixColumns instead of AddRoundKey. Such reg-
ister retiming was suitable for pipelined architectures. The main drawbacks of
such approaches are the false critical path delay and the required area and delay
overheads caused by three multiplexers. The critical path of the datapath in
Fig. 1 is denoted in bold, which would never be active because it passes from
the decryption path to the encryption path. This false critical path reduces the
maximum operation frequency owing to logic synthesis due to the false longest
logic chain. The overhead caused by the multiplexers is also nonnegligible for
common standard-cell-based designs.
The second approach is to unify the circuits of the functions SubBytes,
ShiftRows, and MixColumns with their inverse functions, respectively. Figure 3
shows the datapath in [29] where encryption and decryption paths are com-
bined using the second approach, where the reordering technique is given in
Fig. 4. The order of the decryption operations is changed to be the same as
that of the encryption operations. Note that the order of (Inv)SubBytes and
(Inv)ShiftRows can be changed without any overhead, and the datapath in [29]
changes the order of SubBytes and ShiftRows in the encryption. The reordering
of AddRoundKey and InvMixColumns utilizes the linearity of InvMixColumns
as follows: M C −1 (Mr + Kr ) = M C −1 (Mr ) + M C −1 (Kr ), where M C −1 is
A High Throughput/Gate AES Hardware Architecture 541
K10 K10
K9 Kr
Kr K0
(a) (b)
the function InvMixColumns, and Mr and Kr are the intermediate value after
InvShiftRows and the round key at the r-th round, respectively. Here, InvMix-
Columns requires the round keys, whereas MixColumns and InvMixColumns
can be unified to reduce the area. Therefore, this type of architecture requires
an additional InvMixColumns to compute M C −1 (Kr ) for decryption. In addi-
tion, the false path and multiplexer overhead exist because each function and
its inverse function are implemented in a partially serial manner with multiplex-
ers like SubBytes and InvSubBytes in Fig. 1, where the critical path consists of
Affine, Inversion, InvAffine, and an additional multiplexer.
The architecture in [17] employs a reordering technique similar to [29]. The
major difference is the intermediate value stored in the register. The architecture
in [14] also employs the same approach that combines the encryption and decryp-
tion datapaths, but does not change the order of AddRoundKey and InvMix-
Columns to remove InvMixColumns to compute M C −1 (Kr ). As a result, an
additional selector is required to unify MixColumns and InvMixColumns.
542 R. Ueno et al.
K10 K10
Kr
Kr
MC-1(Kr)
MC-1(Kr)
K K
(a) (b)
The design of the inversion circuit used in (Inv)SubBytes has a significant impact
on the performance of AES implementations. Many inversion circuit designs have
been proposed. There are two major approaches using direct mapping and tower-
field arithmetic. Inversion circuits based on direct mapping such as table-lookup,
Binary Decision Diagram (BDD), and Positive-Polarity Reed-Muller (PPRM)
[15,19,20] are faster but larger than those based on a tower field. On the other
hand, tower-field arithmetic enable us to design more compact and more area-
time efficient inversion circuits in comparison with direct mapping. Therefore,
we focus on inversion circuits based on tower-field arithmetic in this paper.
The performance of tower-field-based inversion circuits varies with the field
towering and Galois field (GF) representation. After the introduction of tower-
field inversion over GF (((22 )2 )2 ) based on a polynomial basis (PB) by Satoh
et al. [29], Canright reduced the gate count using a normal basis-(NB-)based
GF (((22 )2 )2 ), which has been known as the smallest for a long time [7], Nogami
et al. showed that a mixture of a PB and an NB was useful for a more efficient
design [23]. On the other hand, Rudra et al., Joen et al., and Mathew et al.
A High Throughput/Gate AES Hardware Architecture 543
designed inversion circuits using PB-based GF ((24 )2 ), which have a smaller criti-
cal path delay than those based on GF (((22 )2 )2 ) [12,17,27]. Nekado et al. showed
that a redundantly represented basis (RRB) was useful for an efficient design
[21]. Recently, Ueno et al. designed an inversion circuit based on the combination
of an NB, an RRB, and a polynomial ring representation (PRR), which is known
as the most area-time efficient inversion [31]. In addition, a logic minimization
technique was applied to Canright’s S-box, which resulted in a more compact
S-box [6].
To embed such a tower-field-based inversion circuit in AES hardware, an iso-
morphic mapping between the AES field and the tower field is required because
the inversion and MixColumns are performed over the AES field (i.e., PB-based
GF (28 ) with an irreducible polynomial x8 + x4 + x3 + x + 1). Typically, the input
into the inversion circuit (in the AES field) is initially mapped to the tower field
by the isomorphic mapping. After the inversion operation over the tower field, an
inverse isomorphic mapping (and affine transformation) are applied [29]. On the
other hand, some architectures perform all of the AES subfunctions (i.e., Sub-
Bytes as well as ShiftRows, MixColumns, and AddRoundKey) over the tower
field, where isomorphic mapping and its inverse mappings are performed at the
timings of the data (i.e., plaintext and ciphertext) input and output, respectively
[10,16–18,27]. In other words, the cost of field conversion is suppressed when the
conversion is performed only once during encryption or decryption. However, the
cost of constant multiplications in MixColumns over a tower field is worse than
that over the AES field while inversion is efficiently performed over the tower
field. More precisely, in tower-field architectures, such linear mappings including
constant multiplications usually require 3TXOR delay, where TXOR indicates the
delay of an XOR gate [21]. The XOR gate count used in (Inv)MixColumns over
a tower field is also worse than that over AES field.
3 Proposed Architecture
This section presents a new round-based AES architecture that unifies the
encryption and decryption paths in an efficient manner. The key ideas for reduc-
ing the critical path delay are summarized as follows: (1) to merge linear map-
pings such as MixColumns and isomorphic mappings as much as possible by
reordering subfunctions, (2) to minimize the number of selectors to unify the
encryption and decryption paths by the above merging and a register retiming,
and (3) to perform isomorphic mapping and its inverse mappings only once in
the pre- and post-round datapaths. We can reduce the number of linear map-
pings to at most one for each round operation as the effect of (1). Moreover,
we can reduce the number of selectors to only one (4-to-1 multiplexer) in the
unified datapath as the effect of (2) while the inversion circuit is shared by the
encryption and decryption paths. From the idea of (3), we can remove the iso-
morphic mapping and its inverse mappings from the critical path. Figure 5 shows
the overall architecture that consists of the round function and key scheduling
parts. Our architecture performs all of the subfunctions over a tower field for
544 R. Ueno et al.
GF(28) GF((24)2)
both the round function and key scheduling parts and therefore applies iso-
morphic mappings between the AES and tower fields in the datapaths of the
pre- and post-round operations, which are represented as the blocks “Pre-round
datapath” and “Post-round datapath” in Fig. 5. “Round datapath” performs
one round operation for either encryption or decryption.
(r) (r)
where mi,j and ki,j are the i-th row and j-th column intermediate value and
round key at the r-th round, except for the final round. Note that the sub-
scripts of each variable are a member of Z/4Z. The function S indicates the 8-bit
S-box, and u0 , u1 , u2 , and u3 are the coefficients of the matrix of MixColumns
A High Throughput/Gate AES Hardware Architecture 545
3 −1
(r+1) (r) (r)
mi,j = (ue−i (A( me,i+j ) + c)) + ki,j , (2)
e=0
3 −1
(ue−i (A(Δ ( Δ(me,i+j )
(r+1) (r) (r)
mi,j = )) + c)) + ki,j , (3)
e=0
where Δ is the isomorphic mapping from the AES field to a tower field, and Δ
is the inverse isomorphic mapping.
The linear mappings, which include an isomorphism and constant multipli-
cations over the GF, are performed by the constant multiplication of the corre-
sponding matrix over GF (2). Therefore, we can merge such mappings to reduce
the critical path delay and the number of XOR gates. In addition, we consider
(r) (r) (r)
the variable di,j of the tower field derived from mi,j . Substituting mi,j with
Δ (di,j ) (= mi,j ), we can merge the linear mappings as follows:
(r) (r)
3 −1
(r+1) (r) (r)
di,j = (Ue−i ( de,i+j )) + Δ(c) + Δ(ki,j ), (4)
e=0
where Ue (x) = Δ(ue (A(Δ (x)))). Note that an arbitrary linear mapping L satis-
fies L(a + b) = L(a) + L(b). Thus, the linear mappings of a round in Eq. (4) can
be merged into at most one, even with a tower-field S-box, whereas the linear
mappings in Eq. (3) cannot be.
On the other hand, the corresponding equation for AES decryption with
tower-field arithmetic is given by
3 −1
(Δ(ve−i (Δ ( Δ(A (Δ (de,j−i ))) + Δ(c )
(r−1) (r) (r)
di,j = + Δ(ke,j−i ))))), (5)
e=0
where A indicates the linear mapping of the inverse affine transformation. The
coefficients v0 , v1 , v2 , and v3 are respectively given by β 3 +β 2 +β, β 3 +β +1, β 3 +
β 2 +1, and β 3 +1, and c (= β 2 +1) is a constant. Here, the linear mappings cannot
be merged into one because they are performed both before and after the inver-
sion operation. In addition, if we construct an encryption/decryption datapath
based on Eqs. (4) and (5), the inversion circuit cannot be shared by encryption
and decryption without a selector because the timings of the inversion operations
are different from each other. Therefore, we consider a register retiming to store
(r)
the intermediate value si,j given after the inverse affine transformation over the
546 R. Ueno et al.
K0 K0
K10 K10
Kr Kr Kr
Kr
K10
K10 K K
(i) (ii)
Fig. 6. Proposed (i) encryption and (ii) decryption flows (a) before and (b) after
reordering and register-retiming.
tower-field. Here, si,j is given by si,j = Δ(A (Δ (di,j ))) + Δ(c ). In the decryp-
(r) (r) (r)
3 −1
+ Δ(ke,j−i ))) + Δ(c ),
(r−1) (r) (r)
si,j = (Ve−i ( se,j−i (6)
e=0
GF(28) GF((24)2)
-1
GF((24)2) GF((28)
datapath in [14] employs seven 128-bit multiplexers1 . Fewer selectors can reduce
the critical path delay and circuit area and solve the false critical path problem.
Unified affine and Unified affine−1 in Fig. 7 perform the unified linear mappings
(i.e., U0 , . . . , U3 and V0 , . . . , V3 ) and constant addition. The number of linear
mappings on the critical path is at most one in our architecture, whereas that
of the conventional architectures is not. We can also suppress the overhead of
constant multiplication over the tower field by the unification. Adder arrays in
Fig. 7 consist of four 4-input 8-bit adders in MixColumns or InvMixColumns.
In the encryption, the factoring technique for MixColumns and AddRoundKey
[21] is available for Unified affine, which makes the circuit area smaller without
a delay overhead. As a result, the data width between Unified affine and Adder
array in Encryption path is reduced from 512 to 256 bits because the calculations
of U1 and U3 are not performed in Encryption path. In addition, Adder array
and AddRoundKey are unified in Encryption path because both of them are
composed of 8-bit adders2 . On the other hand, since there is no factoring tech-
nique for InvMixColumns without delay overheads, the data width from Unified
affine−1 to Adder array in Decryption path is 512 bits. Finally, an inactive path
can be disabled using a demultiplexer since our datapath is fully parallel after
the inversion circuit. Thanks to the disabling, a multiplexer and AddRoundKey
1
The selectors in SubBytes/InvSubBytes are included in the seven multiplexers.
2
Some architectures such as [14, 29] unify AddInitialKey and AddRoundKeys. We
did not unify them to avoid increasing the number of selectors.
548 R. Ueno et al.
are unified as Bit-parallel XOR. (The addition of Δ(c) in Unified affine should
be active only when encryption.) In addition, the demultiplexer would suppress
power consumption due to a dynamic hazard. Although tower-field inversion cir-
cuits are known to be power-consuming owing to dynamic hazards [19], these
hazards can be terminated at the input of the inactive path.
Our datapath employs the inversion circuit presented in [31] because it has
the highest area-time efficiency among inversion circuits including one using a
logic minimization technique [6]. We can merge the isomorphic mappings in
order to reduce the linear function on the round datapath to only one, even if
the inversion circuit has different GF representations at the input and output.
Since the output is given by an RRB, the data width from Inversion to Uni-
fied affine (or Unified affine−1 ) is given by 160 bits. However, AddRoundKey
in the decryption path and Bit-parallel XOR in the post-round datapath are
implemented respectively by only 128 XOR gates because the NB used as the
input is equal to the reduced version of the RRB. In addition, a 1:2 DeMUX is
implemented with NOR gates thanks to the redundancy, whereas nonredundant
representations require AND gates.
The on-the-fly key scheduling part is shared by the encryption and decryption
processes. For the encryption, the key scheduling part first stores the initial key
in the initial key register (in Fig. 5) and then generates the round keys during
the following clock cycles. For the decryption, the final round key should be
calculated from the initial key and stored in the initial key register in advance.
The key scheduling part then generates the round keys in the reverse order
by the round key generator (in Fig. 5). However, conventional key scheduling
datapaths such those as in [14,29] are not applicable to our round datapath
because they have a loop with a false path and/or a longer true critical path
than our datapath.
To address the above issue, we introduce a new architecture for the key
scheduling datapath. For on-the-fly implementation, the subkeys are calculated
for each of the four subkeys (i.e., 128 bits) in a clock cycle. Therefore, the on-
the-fly key scheduling for the encryption is expressed as
⎧ (r+1) (r) (r)
⎪
⎪ k0 = k0 + KeyEx(k3 )
⎪
⎨ (r+1) (r) (r) (r)
k1 = k0 + k1 + KeyEx(k3 )
, (7)
⎪
⎪ k
(r+1)
= k
(r)
+ k
(r)
+ k
(r)
+ KeyEx(k
(r)
)
⎪ 2
⎩ (r+1) 0
(r)
1
(r)
2
(r) (r)
3
(r)
k3 = k0 + k1 + k2 + k3 + KeyEx(k3 )
(r) (r) (r) (r)
where k0 , k1 , k2 , and k3 are a 32-bit subkey at the r-th round and KeyEx is
the key expansion function that consists of a round constant addition, RotWord,
and SubWord. The inverse key scheduling for the decryption is represented by
A High Throughput/Gate AES Hardware Architecture 549
GF(28) GF((24)2)
k1(r-1)
k2(r-1) k3(r-1)
Figure 8 shows the proposed key scheduling datapath architecture, where the
KeyEx components are unified for encryption and decryption. Note here that most
(r+1) (r+1) (r+1)
of adders (i.e., XOR gates) for computing k1 , k2 , and k3 should be non-
integrated to make the critical path shorter than that of the round function part.
The input key is initially mapped to the tower field, and all of the computations
(including AddRoundKey) are performed over the tower field. The ENC/DEC sig-
nal controls the input to RotWord and SubWord using a 32-bit AND gate. The
upper 2-in-1 multiplexer selects an initial key or a final round key as the input to
550 R. Ueno et al.
Initial key register, the middle 2-in-1 multiplexer selects a key stored in Initial key
register or a round key as the input to Round key generator, and the lower 2-in-1
multiplexers select encryption or decryption path. The round constant addition
is performed separately from RotWord and SubWord to reduce the critical path
delay. As a result, the critical path delay of the key scheduling part becomes shorter
than that of the round function part.
4 Performance Evaluation
Tables 1 and 2 summarize the synthesis results of the proposed AES encryp-
tion/decryption architecture by Synopsys Design Compiler (Version D2010-3)
with the TSMC 65-nm and NanGate 45- and 15-nm standard-cell libraries [2,3]
under the worst-case conditions, where Area indicates the circuit area estimated
on the basis of a two-way NAND equivalent gate size (i.e., gate equivalents
(GEs)); Latency indicates the latency for encryption, which is estimated by the
circuit path delay of the datapath under the worst low condition; Max. freq. indi-
cates the maximum operation frequency obtained from the critical path delay;
Throughput indicates the throughput at the maximum operation frequency; and
Efficiency indicates the throughput per area, which corresponds to the product
of the area and latency in this nonpipelined design3 . To perform a practical
performance comparison, an area optimization (which maximizes the effort of
minimizing the number of gates without flattening the description) was applied
in Table 1, and an area-speed optimization (where an asymptotical search with a
set of timing constraints was performed after the area optimization) was applied
in Table 2.
In these tables, the conventional representative datapaths [14,15,17,29] were
also synthesized using the same optimization conditions. The source codes for
these syntheses were described by the authors referring to [14,15,17,29], except
for the source codes of Satoh’s and Canright’s S-boxes in [7,29] that can be
obtained from their websites [1,8]. For a fair comparison, the datapaths of [15,17]
were adjusted to the round-based nonpipelined architecture corresponding to
the proposed datapath. Note that only the inversion circuit over a PB-based
GF ((24 )2 ) in [17] was not described faithfully according to the paper4 . Latency
and Throughput were calculated assuming that the datapath of [15] requires 10
clock cycles to perform each encryption or decryption and the others require 11
3
Design Compiler generated a static power consumption report for each architecture.
However, the report dose not consider the effect of glitches while tower-field inversion
circuits are known to include non-trivial glitches [19]. Therefore, we did not mention
the power consumption report to avoid misleading.
4
According to [17], the GF (24 ) inversion in the circuit can be implemented with a
TXOR + 3TN AN D delay, where TXOR and TN AN D are the delays of the XOR and
NAND gates, respectively. However, there is no detailed description to realize such
a circuit. Therefore, using the best of our knowledge, we described the circuit by a
direct mapping based on the PPRM expansion, which is an algebraic normal form
frequently used for designing GF arithmetic circuits [19, 28].
A High Throughput/Gate AES Hardware Architecture 551
Table 1. Synthesis results for proposed and conventional AES hardware architectures
with area optimization
Area (GE) Latency (ns) Max. freq. (MHz) Throughput (Gbps) Efficiency
(Kbps/GE)
TSMC 65-nm
Satoh et al. [29] 13, 671.75 78.10 140.85 1.64 119.88
Lutz et al. [15] 20, 380.50 68.50 145.99 1.87 91.69
Liu et al. [14] 12, 538.75 85.25 129.03 1.50 119.75
Mathew et al. [17] 20, 639.50 97.68 112.61 1.31 63.49
This work 15, 242.75 46.97 234.19 2.73 178.78
NanGate 45-nm
Satoh et al. [29] 12, 560.99 31.57 348.43 4.05 322.78
Lutz et al. [15] 20, 000.66 20.30 492.61 6.31 315.26
Liu et al. [14] 11, 829.34 34.43 319.49 3.72 314.28
Mathew et al. [17] 17, 573.33 41.80 263.16 3.06 174.25
This work 13, 814.69 16.94 649.35 7.56 546.96
NanGate 15-nm
Satoh et al. [29] 14, 526.01 4.36 2, 524.17 29.37 2, 022.04
Lutz et al. [15] 23, 391.49 4.57 2, 185.84 25.44 1, 087.37
Liu et al. [14] 13, 847.25 4.74 2, 321.05 27.01 1, 950.46
Mathew et al. [17] 21, 361.00 5.32 2, 066.93 24.05 1, 125.95
This work 15, 468.97 2.65 4, 144.22 48.22 3, 117.44
clock cycles. This is because the initial key addition and first-round computation
are performed with one clock cycle for [15]. Area was calculated without the
initial key, round key, and data registers to compare the datapaths more clearly.
Note also that the key scheduling parts of [15,17] were implemented with the
one presented in this paper because there is no description for the key scheduling
parts. (For [15], the isomorphic mapping from GF (28 ) to GF ((24 )2 ) was removed
for applying to the round function part.)
The results in Table 1 show that our datapath achieves the lowest latency
(i.e., highest throughput) compared with the conventional ones with tower-field
inversion circuits owing to the lower critical path delay. Moreover, the circuit
area is not the largest owing to fewer selectors. Note that the latency is con-
sistent with the throughput because these circuits are not pipelined. Although
all operations are translated to the tower field in our architecture, the area and
delay overheads of MixColumns and InvMixColumns are suppressed by the uni-
fication technique. In addition, even with a tower-field S-box, our architecture
has an advantage with regard to the latency over Lutz’s one with table-lookup-
based inversion, as indicated in Table 2. As a result, our architecture is more
efficient in terms of the throughput per area than any conventional architecture.
More precisely, the proposed datapath is approximately 53–72 % more efficient
than any conventional architecture under the conditions of the three CMOS
processes. The results also suggest that the proposed architecture would per-
form an AES encryption or decryption with the smallest energy. Moreover, the
cutoff of an inactive path by a demultiplexer would further reduce the power
552 R. Ueno et al.
Table 2. Synthesis results for proposed and conventional AES hardware architectures
with area-speed optimization
Area (GE) Latency (ns) Max. freq. (MHz) Throughput (Gbps) Efficiency
(Kbps/GE)
TSMC 65-nm
Satoh et al. [29] 14, 516.50 56.87 193.42 2.25 155.05
Lutz et al. [15] 22, 883.25 33.90 294.99 3.78 165.00
Liu et al. [14] 13, 970.50 60.17 182.82 2.13 152.27
Mathew et al. [17] 23, 298.49 65.45 168.07 1.96 83.94
This work 15, 807.00 34.10 322.58 3.75 237.47
NanGate 45-nm
Satoh et al. [29] 13, 386.67 24.42 450.45 5.24 391.55
Lutz et al. [15] 22, 417.01 14.40 694.44 8.89 396.52
Liu et al. [14] 12, 443.66 28.27 389.11 4.53 363.86
Mathew et al. [17] 19, 243.67 31.90 344.83 4.01 208.51
This work 14, 582.99 13.53 813.01 9.46 648.73
NanGate 15-nm
Satoh et al. [29] 16, 924.74 3.31 3, 322.26 38.66 2, 284.17
Lutz et al. [15] 25, 692.49 2.08 4, 799.85 61.44 2, 391.28
Liu et al. [14] 15, 768.43 3.65 3, 014.14 35.07 2, 224.29
Mathew et al. [17] 23, 789.48 4.03 2, 729.18 31.76 1, 334.95
This work 17, 232.00 1.80 6, 117.70 71.19 4, 131.14
5 Discussion
The proposed design employs a round-based architecture without block-wise
parallelism such as pipelining. The modes of operations with block-wise paral-
lelism (e.g., the ECB and CTR modes) are also available owing to the trade-off
between the area and the throughput by pipelining [11]. A simple way to obtain
a pipelined version of the proposed architecture is to unroll the rounds and
insert pipeline registers between them. The datapath can be further pipelined
by inserting registers into the round datapath. The proposed datapath can be
efficiently pipelined by placing the pipeline register at the output of the inversion
with a good delay balance between the inversion and the following circuit. For
example, the synthesis results for the proposed datapath using the area-speed
optimization with the NanGate 45-nm standard-cell library indicated that the
inversion circuit had a delay of 0.63 ns, and the remainder had a delay of 0.67 ns.
As a result, pipelining would achieve a throughput of 17.37 Gbps, which is nearly
twice that without pipelining. Thus, the proposed datapath is also suitable for
such a pipelined implementation.
Another discussion point is how the proposed architecture can be resistant
to side-channel attacks. A masking countermeasure would be based on a masked
tower-field inversion circuit [9,25] such as that in [24]. The major features of
the countermeasure are to replace the inversion with a masked inversion and
to duplicate other linear operations. Such a countermeasure can also be applied
to the proposed datapath. In addition, hiding countermeasures, such as WDDL
[30], which replaces the logic gates with a complementary logic style, would also
be applicable, and the hardware efficiency would be proportionally lower with
respect to the results in Tables 1 and 2.
More sophisticated countermeasures such as threshold implementation (TI)
and generalized masking schemes (GMSs) [4,5,18,22,26] would also be applicable
to the proposed datapath in principle in the same manner as other conventional
ones. On the other hand, such countermeasures, especially against higher-order
DPAs, require a considerable area overhead and more random bits compared with
the aforementioned countermeasures. When applying such countermeasures, the
area overhead would be critical for some applications. In addition, TI- and GMS-
based inversion circuits should be pipelined to reduce the resulting circuit area
(i.e., the number of shares). To divide the circuit delay equally, it would be better
to insert pipeline register at the middle of Encryption and Decryption path in
Fig. 7.
554 R. Ueno et al.
6 Conclusion
This paper presented a new efficient round-based AES architecture that supports
both encryption and decryption. An efficient AES datapath with a lower latency
(or higher throughput per gate) is suitable for some practical modes of opera-
tion, such as CBC and CCM, because pipelined parallelism cannot be applied
to such modes. The proposed datapath utilizes new operation-reordering and
register-retiming techniques to unify critical components (i.e., inversion and lin-
ear matrix operations) with fewer additional selectors. As a result, our datapath
has the lowest critical path delay compared to conventional ones with tower-
field S-boxes. The proposed and conventional AES hardware were designed on
the basis of compatible round-based architectures and evaluated using logic syn-
thesis with TSMC 65-nm and NanGate 45- and 15-nm CMOS standard-cell
libraries under the worst-case conditions. The synthesis results suggested that
the proposed architecture was approximately 53–72 % more efficient than the
best conventional architecture in terms of the throughput per area, which would
also indicate that the proposed architecture can perform encryption/decryption
with the lowest energy.
The performance evaluation was performed at the design stage of the logic
synthesis; therefore, the power consumption and latency considering place and
route were not evaluated. A detailed evaluation after the place and route is
planned as future work. However, the post-synthesis results would be propor-
tional to the presented synthesis results because the proposed and conventional
architectures employ the same or similar hardware algorithms (e.g., tower-field
inversion) and do not have any extra global wires that have an impact on the
critical path. The design of efficient and side-channel-resistant AES hardware
based on the proposed datapath is also planned for future work.
Acknowledgment. This work has been supported by JSPS KAKENHI Grant No.
25240006.
This appendix provides an example set of matrices for linear operations, i.e.,
an isomorphic mapping, an inverse isomorphic mapping, an affine transforma-
tion over the tower field, inverse affine transformation over the tower field,
U0 , U1 , U2 , U3 , V0 , V1 , V2 , and V3 . In this study, we employ the tower-field inver-
sion circuit in [31]. In the following formulae, the least-significant bits are in the
upper-left corner.
A High Throughput/Gate AES Hardware Architecture 555
The conversion matrices of the isomorphic mapping and its inverse mapping
(denoted by δ and δ , respectively) are given by
⎛ ⎞ ⎛ ⎞
01011100 1101100110
⎜1 0 1 0 0 0 1 1⎟ ⎜0 1 0 1 0 0 1 0 1 0⎟
⎜ ⎟ ⎜ ⎟
⎜1 0 0 1 0 0 0 1⎟ ⎜0 1 0 0 1 1 0 1 1 1⎟
⎜ ⎟ ⎜ ⎟
⎜0 0 0 0 0 1 0 0⎟ ⎜1 0 0 0 1 0 1 1 1 1⎟
δ=⎜⎜0 1 1 0 1 1 0 0⎟
⎟, δ = ⎜ ⎟
⎜1 0 0 1 0 0 0 1 0 1⎟ . (9)
⎜ ⎟ ⎜ ⎟
⎜1 0 1 0 1 0 0 0⎟ ⎜1 0 0 0 1 0 0 0 0 0⎟
⎜ ⎟ ⎜ ⎟
⎝1 1 1 0 0 0 0 1⎠ ⎝1 1 1 1 0 1 1 0 0 0⎠
00110001 1100001001
The isomorphic mapping using δ performs conversion from the AES field to the
tower field used in [31] (i.e., an NB-based GF ((24 )2 )). The inverse isomorphic
mapping using δ performs conversion from the RRB-based GF ((24 )2 ) to the
AES field. The affine and inverse affine matrices over the tower field (denoted
by φ and φ , respectively) are given by
⎛ ⎞ ⎛ ⎞
1110100110 00010110
⎜1 0 0 0 1 0 0 1 1 0⎟ ⎜1 1 0 1 0 1 1 0⎟
⎜ ⎟ ⎜ ⎟
⎜1 1 0 1 1 1 0 1 0 0⎟ ⎜0 1 0 1 1 0 0 0⎟
⎜ ⎟ ⎜ ⎟
⎜1 0 0 0 1 1 0 1 1 1⎟ ⎜0 0 1 1 1 0 1 1⎟
φ=⎜ ⎜ ⎟ ⎜ ⎟
⎟ , φ = ⎜0 0 1 0 0 0 0 1⎟ . (10)
⎜1 0 0 1 0 1 0 0 0 1⎟ ⎜ ⎟
⎜1 1 0 1 1 0 1 0 0 1⎟ ⎜0 1 0 1 0 1 0 1⎟
⎜ ⎟ ⎜ ⎟
⎝1 0 0 1 0 1 1 1 1 0⎠ ⎝0 0 1 0 1 1 1 0⎠
1101101100 01010000
The input and output of the linear mapping represented by φ are given by the
RRB- and NB-based GF ((24 )2 ), respectively. The input and output of the linear
mapping represented by φ are given by the NB-based GF ((24 )2 ). The constants
Δ(c) and Δ(c ) are given by β 5 + β 3 + β 2 and β 7 + β 4 + β 2 , respectively. Let ψe
and ψe be the matrices representing Ue and Ve , respectively (0 ≤ e ≤ 3). The
matrices ψ0 , ψ1 , ψ2 , and ψ3 are given by
⎛ ⎞ ⎛ ⎞
1111001111 0001101001
⎜0 0 1 1 0 1 0 1 0 0⎟ ⎜1 0 1 1 1 1 0 0 1 0⎟
⎜ ⎟ ⎜ ⎟
⎜1 1 0 1 1 0 1 1 1 1⎟ ⎜0 0 0 0 0 1 1 0 1 1⎟
⎜ ⎟ ⎜ ⎟
⎜1 1 0 1 1 1 0 0 0 1⎟ ⎜0 1 0 1 0 0 0 1 1 0⎟
ψ0 = ⎜ ⎜ ⎟ ⎜ ⎟
⎟ , ψ1 = ⎜0 0 0 0 0 1 0 0 1 0⎟ , (11)
⎜1 0 0 1 0 0 0 0 1 1⎟ ⎜ ⎟
⎜1 0 1 1 1 0 0 0 0 0⎟ ⎜0 1 1 0 0 0 1 0 0 1⎟
⎜ ⎟ ⎜ ⎟
⎝1 1 1 0 1 0 1 0 1 0⎠ ⎝0 1 1 1 1 1 0 1 0 0⎠
0100101001 1001000101
ψ2 = ψ3 = φ. (12)
556 R. Ueno et al.
respectively. The matrices ψ0 , ψ1 , ψ2 , and ψ3 are given by
⎛ ⎞ ⎛ ⎞
0000001100 0000011011
⎜0 0 1 0 1 0 0 1 0 1⎟ ⎜0 0 0 1 1 0 0 0 1 1⎟
⎜ ⎟ ⎜ ⎟
⎜0 1 0 0 1 1 1 0 1 1⎟ ⎜1 1 0 0 0 0 1 0 0 1⎟
⎜ ⎟ ⎜ ⎟
⎜1 0 0 0 1 1 1 0 1 1⎟ ⎜ ⎟
ψ0 = ⎜ ⎟ , ψ1 = ⎜0 1 1 1 1 0 1 0 0 1⎟ , (13)
⎜1 1 0 0 0 0 0 1 0 1⎟ ⎜1 0 1 1 1 0 0 0 1 1⎟
⎜ ⎟ ⎜ ⎟
⎜0 0 1 0 1 0 0 0 1 1⎟ ⎜0 0 0 1 1 1 1 1 1 0⎟
⎜ ⎟ ⎜ ⎟
⎝1 1 0 1 1 0 0 0 1 1⎠ ⎝0 1 0 0 1 1 1 1 1 0⎠
1101111110 0100101010
⎛ ⎞ ⎛ ⎞
1011111110 0011001111
⎜0 1 0 1 0 1 1 1 1 0⎟ ⎜1 0 0 0 1 1 1 1 1 0⎟
⎜ ⎟ ⎜ ⎟
⎜1 0 0 0 1 1 1 0 1 1⎟ ⎜0 0 1 0 1 1 0 0 0 1⎟
⎜ ⎟ ⎜ ⎟
⎜0 1 1 1 1 1 0 1 0 0⎟ ⎜1 0 0 1 0 1 1 1 0 1⎟
ψ2 = ⎜⎜ ⎟ ⎜ ⎟
⎟ , ψ3 = ⎜0 0 1 0 1 0 0 0 0 0⎟ . (14)
⎜1 1 0 0 0 1 0 1 1 1⎟ ⎜ ⎟
⎜1 0 0 0 1 1 0 0 0 1⎟ ⎜1 0 0 1 0 0 1 0 0 1⎟
⎜ ⎟ ⎜ ⎟
⎝1 1 0 0 0 0 0 1 0 1⎠ ⎝1 1 0 0 0 0 0 1 1 0⎠
1000100000 0011010100
References
1. Cryptographic hardware project. http://www.aoki.ecei.tohoku.ac.jp/crypto/
2. NanGate FreePDK15 open cell library, January 2016. http://www.nangate.com/?
page id=2328
3. NanGate FreePDK45 open cell library, January 2016. http://www.nangate.com/?
page id=2325
4. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: Higher-order threshold
implementations. In: Sarkar, P., Iwata, T. (eds.) ASIACRYPT 2014, Part II. LNCS,
vol. 8874, pp. 326–343. Springer, Heidelberg (2014)
5. Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., Rijmen, V.: Trade-offs for threshold
implementations illustrated on AES. IEEE Trans. Comput. Aided Des. Integr.
Syst. 34(7), 1188–1200 (2015)
6. Boyer, J., Matthews, P., Peralta, P.: Logic minimization techniques with applica-
tions to cryptology. J. Cryptology 47, 280–312 (2013)
7. Canright, D.: A very compact S-box for AES. In: Rao, J.R., Sunar, B. (eds.) CHES
2005. LNCS, vol. 3659, pp. 441–455. Springer, Heidelberg (2005)
8. Canright, D.: http://faculty.nps.edu/drcanrig/
9. Canright, D., Batina, L.: A very compact “Perfectly Masked” S-Box for AES. In:
Bellovin, S.M., Gennaro, R., Keromytis, A.D., Yung, M. (eds.) ACNS 2008. LNCS,
vol. 5037, pp. 446–459. Springer, Heidelberg (2008)
10. Hammad, I., El-Sankary, K., El-Masry, E.: High-speed AES encryptor with efficient
merging techniques. IEEE Embed. Syst. Lett. 2, 67–71 (2010)
11. Hodjat, A., Verbauwhede, I.: Area-throughput trade-offs for fully pipelined 30 to
70 Gbits/s AES processors. IEEE Trans. Comput. 50(4), 366–372 (2006)
12. Jeon, Y., Kim, Y., Lee, D.: A compact memory-free architecture for the AES
algorithm using resource sharing methods. J. Circ. Syst. Comput. 19(5), 1109–
1130 (2010)
A High Throughput/Gate AES Hardware Architecture 557
13. Lin, S.Y., Huang, C.T.: A high-throughput low-power AES cipher for network
applications. In: The 12th Asia and South Pacific Design Automation Conference
(ASP-DAC 2007), pp. 595–600. IEEE (2007)
14. Liu, P.C., Chang, H.C., Lee, C.Y.: A 1.69 Gb/s area-efficient AES crypto core
with compact on-the-fly key expansion unit. In: 41st European Solid-State Circuits
Conference (ESSCIRC 2009), pp. 404–407. IEEE (2009)
15. Lutz, A., Treichler, J., Gürkaynak, F., Kaeslin, H., Basler, G., Erni, A., Reichmuth,
S., Rommens, P., Oetiker, P., Fichtner, W.: 2Gbit/s hardware realizations of RIJN-
DAEL and SERPENT: a comparative analysis. In: Kaliski, B.S., Koç, Ç.K., Paar,
C. (eds.) CHES 2002. LNCS, vol. 2523, pp. 144–158. Springer, Heidelberg (2002)
16. Mathew, S., Satpathy, S., Suresh, V., Anders, M., Himanshu, K., Amit, A., Hsu, S.,
Chen, G., Krishnamurthy, R.K.: 340 mV-1.1V, 289 Gbps/W, 2090-gate nanoAES
hardware accelerator with area-optimized encrypt/decrypt GF (24 )2 polynomials
in 22 nm tri-gate CMOS. IEEE J. Solid-State Circ. 50, 1048–1058 (2015)
17. Mathew, S.K., Sheikh, F., Kounavis, M.E., Gueron, S., Agarwal, A., Hsu, S.K.,
Himanshu, K., Anders, M.A., Krishnamurthy, R.K.: 53 Gbps native GF (24 )2
composite-field AES-encrypt/decrypt accelerator for content-protection in 45 nm
high-performance microprocessors. IEEE J. Solid-State Circ. 46, 767–776 (2011)
18. Moradi, A., Poschmann, A., Ling, S., Paar, C., Wang, H.: Pushing the limits: a
very compact and a threshold implementation of AES. In: Paterson, K.G. (ed.)
EUROCRYPT 2011. LNCS, vol. 6632, pp. 69–88. Springer, Heidelberg (2011)
19. Morioka, S., Satoh, A.: An optimized S-Box circuit architecture for low power AES
design. In: Kaliski, B.S., Koç, Ç.K., Paar, C. (eds.) CHES 2002. LNCS, vol. 2523,
pp. 172–186. Springer, Heidelberg (2002)
20. Morioka, S., Satoh, A.: A 10 Gbps full-AES crypto design with a twisted-BDD S-
box architecture. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 12, 686–691
(2004)
21. Nekado, K., Nogami, Y., Iokibe, K.: Very short critical path implementation of
AES with direct logic gates. In: Hanaoka, G., Yamauchi, T. (eds.) IWSEC 2012.
LNCS, vol. 7631, pp. 51–68. Springer, Heidelberg (2012)
22. Nikova, S., Rijmen, V., Schläffer, M.: Secure hardware implementation of nonlinear
functions in the presence of glithces. J. Cryptology 24, 292–321 (2011)
23. Nogami, Y., Nekado, K., Toyota, T., Hongo, N., Morikawa, Y.: Mixed bases for
efficient inversion in F((22 )2 )2 and conversion matrices of SubBytes of AES. In:
Mangard, S., Standaert, F.-X. (eds.) CHES 2010. LNCS, vol. 6225, pp. 234–247.
Springer, Heidelberg (2010)
24. Okamoto, K., Homma, N., Aoki, T., Morioka, S.: A hierarchical formal approach
to verifying side-channel resistant cryptographic processors. In: Hardware-Oriented
Security and Trust (HOST), pp. 76–79. IEEE (2014)
25. Oswald, E., Mangard, S., Pramstaller, N., Rijmen, V.: A side-channel analysis
resistant description of the AES S-Box. In: Gilbert, H., Handschuh, H. (eds.) FSE
2005. LNCS, vol. 3557, pp. 413–423. Springer, Heidelberg (2005)
26. Reparaz, O., Bilgin, B., Nikova, S., Gierlichs, B., Verbauwhede, I.: Consolidating
masking schemes. In: Gennaro, R., Robshaw, M. (eds.) CRYPTO 2015. LNCS, vol.
9215, pp. 764–783. Springer, Heidelberg (2015)
27. Rudra, A., Dubey, P.K., Jutla, C.S., Kumar, V., Rao, J.R., Rohatgi, P.: Efficient
Rijndael encryption implementation with composite field arithmetic. In: Koç, Ç.K.,
Naccache, D., Paar, C. (eds.) CHES 2001. LNCS, vol. 2162, pp. 171–184. Springer,
Heidelberg (2001)
558 R. Ueno et al.
28. Sasao, T.: AND-EXOR expressions and their optimization. In: Sasao, T. (ed.) Logic
Synthesis and Optimization. The Kluwer International Series in Engineering and
Computer Science, vol. 212, pp. 287–312. Kluwer Academic Publishers (1993)
29. Satoh, A., Morioka, S., Takano, K., Munetoh, S.: A compact Rijndael hardware
architecture with S-Box optimization. In: Boyd, C. (ed.) ASIACRYPT 2001. LNCS,
vol. 2248, pp. 239–254. Springer, Heidelberg (2001)
30. Tiri, K., Verbauwhede, I.: A logic level design methodology for a secure DPA
resistant ASIC or FPGA implementation. In: Design, Automation and Test in
Europe Conference and Exhibition (DATE), vol. 1, pp. 246–251 (2004)
31. Ueno, R., Homma, N., Sugawara, Y., Nogami, Y., Aoki, T.: Highly efficient GF (28 )
inversion circuit based on redundant GF arithmetic and its application to AES
design. In: Güneysu, T., Handschuh, H. (eds.) CHES 2015. LNCS, vol. 9293, pp.
63–80. Springer, Heidelberg (2015)
32. Verbauwhede, I., Schaumont, P., Kuo, H.: Design and performance testing of a
2.29-GB/s Rijndael processor. IEEE J. Solid-State Circ. 38, 569–572 (2003)
Efficient High-Speed WPA2 Brute Force Attacks
Using Scalable Low-Cost FPGA Clustering
1 Introduction
Today’s Wi-Fi networks are commonly protected with the well known WPA2
protocol defined in the IEEE 802.11 standard documents [6]. The WPA2-
Personal variant is designed for smaller networks and uses a pre-shared key
(i.e., a Wi-Fi password) to derive the necessary key material for authentication,
encryption and integrity protection. The Wi-Fi password needs to be at least 8
characters long and the key material is mainly derived through the salted key
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 559–577, 2016.
DOI: 10.1007/978-3-662-53140-2 27
560 M. Kammerstetter et al.
derivation function PBKDF2 [8] in combination with the SHA1 hashing algo-
rithm [1] in HMAC configuration [2]. Due to the computational complexity of the
key derivation function and the use of the Wi-Fi’s SSID as cryptographic salt,
brute force attacks are very hard to conduct in the presence of random passwords
with increasing length. Incurring significant costs well outside of what amateurs
can afford, professional attackers can turn to commercial high-end FPGA-based
cluster solutions achieving WPA-2 password guessing speeds of 1 million guesses
per second and more [10]. In this paper, we focus on the WPA2-Personal key
derivation function and low-cost FPGA cluster based attacks affordable by ama-
teurs. Especially considering second-hand FPGA boards that have been used for
cryptocurrency mining, those boards are now available at low cost and can be
repurposed to mount attacks on cryptographic systems. In the first part, we use
a top-down approach to present WPA2-Personal security at a high level and we
subsequently break it down to low-level SHA1 computations. In the second part,
we use a bottom-up approach to show how these computations can be addressed
in hardware with FPGAs and we present how our solution can be integrated
into a scalable low-cost system to conduct WPA-2 Personal brute force attacks.
We evaluate our system with respect to performance and power usage and we
compare it to results we obtained from GPUs. The extended version of our
paper [9] also includes a real-world case study highlighting the practical impact.
Specifically, the contributions presented in this paper are as follows:
– We present a highly optimized design of a scalable and fully pipelined FPGA
implementation for efficient WPA2 brute force attacks that brings the perfor-
mance of today’s highly expensive professional systems to the low-cost FPGA
boards affordable by amateurs.
– Our implementation on Kintex-7 devices indicates that on the same hardware,
our implementation is more than 5 times as fast in comparison to what is cur-
rently marketed to be world’s fastest FPGA-based WPA2 password recovery
system [4,10].
– We implemented and evaluated our approach on three different low-cost FPGA
architectures including an actual FPGA cluster with 36 Spartan 6 LX150T
devices located on repurposed cryptocurrency mining boards.
– We evaluate our system with respect to the power consumption and per-
formance in comparison to GPU clusters, showing that FPGAs can achieve
comparable or higher performance with considerably less power and space
requirements.
2 Related Work
Since WPA2 is commonly used, there are several publications and projects deal-
ing with WPA2 security and brute force attacks in particular. For instance
in [11], Visan covers typical CPU and GPU accelerated password recovery
approaches with state-of-the-art tools like aircrack-ng1 or Pyrit2 . He considers a
1
http://www.aircrack-ng.org.
2
https://code.google.com/p/pyrit.
Efficient High-Speed WPA2 Brute Force Attacks 561
time-memory tradeoff usable for frequent Wi-Fi SSIDs and provides a perfor-
mance overview of common GPUs and GPU cluster configurations. In that
respect, oclHashcat3 and the commercial Wireless Security Auditor software4
need to be mentioned which are both password recovery frameworks with GPU
acceleration and WPA2 support. Unlike these GPU-based approaches, our sys-
tem comprises of a highly optimized and scalable FPGA implementation allow-
ing higher performance at lower costs and power consumption in comparison.
In [7], Johnson et al. present an FPGA architecture for the recovery of WPA and
WPA2 keys. Although WPA support is mentioned, their implementation seems
to support WPA2 only which is comparable to our system. However, while our
implementation features multiple fully pipelined and heavily optimized cores
for maximum performance, Johnson et al. present a straight-forward and mostly
sequential design leading to a significantly less performance in comparison. In [5],
Güneysu et al. present the RIVYERA and COPACOBANA high-performance
FPGA cluster systems for cryptanalysis. They provide details on exhaustive key
search attacks for cryptographic algorithms such as DES, Hitag2 or Keeloq and
have a larger cluster configuration than we had available for our tests. Yet, in
contrast to our work, they do not cover WPA2 or exhaustive key search attacks
on WPA2 in their work. As a result, it would be highly interesting to evaluate
our FPGA implementation on their machines. Finally, Elcomsoft’s commercial
Distributed Password Recovery5 software needs to be mentioned due to its sup-
port for WPA2 key recovery attacks on FPGA clusters [4,10] and its claim to
be world’s fastest FPGA-based password cracking solution [3]. Although there
is practically no publicly available information on the internals of their WPA2
implementation, in [10] performance data are provided. In contrast to their work,
we do not only disclose our design, architecture and optimizations of our FPGA
implementation, but we also claim that on the same professional FPGA hard-
ware our implementation would be more than 5 times as fast. In comparison to
the professional system, our system can achieve similar speeds on the low-cost
repurposed cryptocurrently mining hardware available to amateurs.
Confirmation Key - KCK) to compute a Message Integrity Code (MIC) over the
packet data. At this point, the AP can compare the received MIC with the com-
puted one to validate that the Station is authentic and has knowledge of the
password. To prove to the Station that the AP knows the password, the Station
sends a message including ANonce and the corresponding MIC code. Since the
Station can only compute the correct MIC code if it knows the PTK, the AP
can use this information for authentication. On success, the Station completes
the handshake by sending a usually empty, but signed (MIC) message back to
the AP.
During key derivation (Fig. 1b), PMK is computed from the password and
the SSID as cryptographic salt through the PBKDF2 [8] key derivation function
with HMAC-SHA1 at its core. The PTK and its truncated variant denoted KCK are
computed through the HMAC-SHA1 based pseudo random function PRF-128.
Likewise, also the computation of the MIC integrity code relies on HMAC-SHA1.
required to compute the PMK. In the first PBKDF2 round the xor-transformation
is applied on the password and the inner pad ipad. The result is a 512 bit block
serving as input to the SHA1 hash function in initial state. The output is the
HMAC inner state. Since the SSID may be no longer than 32 bytes, the hashing
of the SSID and the PBKDF2 round counter can be done together with the
SHA1 finalization so that only one SHA1 iteration is necessary.
In the next step, the outer HMAC state is computed by hashing the xor
of the password and the outer pad opad. Afterwards, the previously finalized
160 bit digest is hashed and finalized with the outer state. At this point the
MAC is ready. The second PBKDF2 iteration is computed in the same way with
the difference that the round counter value is set to two instead of one. Since
the password does not change during PBKDF2 iterations, the inner and outer
HMAC states stay the same allowing us to use cached states instead of having
to compute the states again. With that optimization in mind, it is required
to compute at least 2 + 4,096 * 2 SHA1 iterations for the first PBKDF2 round
and 4,096 * 2 SHA1 iterations for the second round (i.e., 16,386 SHA1 iterations
in total) to obtain the PMK. This computational effort, the use of the SSID as salt
for key derivation and the security of the innermost SHA1 cryptographic hash
function are the main reasons why WPA2-Personal key derivation is very strong
against typical exhaustive key search attacks. Once the PMK is available, the KCK is
derived by applying a 128 bit Pseudo Random Function (PRF). Internally, it just
uses HMAC-SHA1 again with the PMK as key. The hashed message is made up of
the string “Pairwise key expansion”, a terminating zero byte, an arithmetically
sorted tuple of the AP and Station addresses as well as another sorted tuple of
their nonces (i.e., ANonce and SNonce) including a finalizing zero byte. The PTK
is the resulting MAC and it is truncated to the first 128 bits to obtain the KCK.
If the PMK is available, the computation of the KCK takes 5 SHA1 iterations as
due to the length of the PMK the finalization of the inner HMAC state can not
be combined with the hashing of the PMK. Whenever AP or Station would like to
compute a MIC, they can do so by utilizing HMAC-SHA1 on the message with
KCK as key. The result of the computation truncated to the first 128 bits is the
MIC. The computational effort depends on the length of the message. However,
564 M. Kammerstetter et al.
4 FPGA Implementation
Assuming familiarity with FPGA design in general, SHA1 [12] is especially well
suited for FPGA implementation due to the following reasons:
The most expensive operation are SHA1’s additions due to the long carry chain
between the adders. To implement the algorithm, a surrounding state machine
is required to control which inputs should be supplied to the logic in different
rounds. Considering that SHA1 has 80 rounds and we would like to achieve
Efficient High-Speed WPA2 Brute Force Attacks 565
maximum performance, there are two design options: Either the SHA1 algorithm
is implemented sequentially or in a fully pipelined way. The advantage of a
sequential implementation is that the FPGA can be completely filled up with
relatively small SHA1 cores. However, the disadvantage is that each of those
cores would require its own state machine which takes up a significant amount
of space. In comparison, a fully pipelined implementation does not require an
internal state machine as each of the SHA1 rounds is implemented in its own
logic block. While this is a significant advantage enabling parallel processing,
the drawback is that a fully pipelined implementation has much higher space
and routing requirements. When using multiple cores (each containing a full
pipeline), only an integer number of cores can be placed so that a significant
amount of unused space might be left on the FPGA. In our implementation, we
also experimented with filling up this space with sequential cores but refrained
from it due to the negative effect on the overall design complexity and the
lower achievable clock speeds. Due to the typically higher performance that can
be achieved through pipelining and the property that we get one full SHA1
computation output per clock cycle per core, we targeted a heavily optimized
and fully-pipelined approach. However, while pipelining alone has a considerable
performance impact in comparison to a sequential approach, the key of obtaining
maximum design performance are the optimizations. Our overall FPGA design
is illustrated in Fig. 3 and has the following components: A global brute force
search state machine, a shared password generator and an FPGA device specific
number of brute force cores, each comprising a WPA2-Personal state machine
with password verifier and a SHA1 pipeline.
Global Brute Force State Machine. The task of the global brute force
state machine is to constantly supply all brute force cores with new password
candidates and check whether one of them found the correct password. Due to
the insignificant speed impact and the advantage of lower design complexity we
chose an iterative approach. Since our SHA1 pipeline comprises of 83 stages,
we can concurrently test 83 passwords per brute force core. With our iterative
approach, we enable the password generator and consecutively fill all brute force
cores with passwords. Once all cores have been filled, the password generator
is paused and we iteratively wait until all cores have completed. At that point,
the password filling process is restarted. If a core finds the correct password
or the password generator has reached the last password, the state machine
jumps into the idle state and can accept the next working block. The penalty
for this iterative approach is 83 clock cycles per core since once a brute force
core has finished, we could immediately fill it with a new password. However, in
comparison to the long run time of each core the impact is insignificant.
State Machine
SHA-1 Pipeline
FIFO
Initiate
Stage
Stage
Buffer
Add
80 Rounds
can utilize its fast clock to drain the FIFO buffer which would in turn enable
the password generator to refill the corresponding buffer at its slower clock.
The advantages of this approach would be the following: First, the complexity
of the password generator design can be further increased without negatively
impacting the critical path. Second, the big advantage is the routing of the
bus signals from the password generator to all the cores. Considering that the
password generator is located at the center of the design and the passwords
need to be distributed across the entire FPGA to all brute force cores, there is
a significant impact on the time-driven routing complexity and the interconnect
delays that negatively impact the maximum clock speed of the overall design. By
leveraging a slower clock, the passwords would be already located in the FIFO
buffers next to the SHA1 pipelines of each core but they could still be read
with the fast clock the SHA1 pipelines are operating on. However, since with
our previously mentioned password generator optimization the critical path was
no longer within the password generator domain, we did not implemented the
approach. It will be covered in future work.
PBKDF2 round, the SSID and the PBKDF2 round counter with value 1 are used
as salt. After that, there are 4,095 more iterations in which the digest output is
used as input. At that point, the second PBKDF2 round is computed by first
computing the salt with an increased round counter value (2) and subsequently
performing 4,095 iterations to obtain the PMK.
SHA1 Pipeline. In each brute force core, the SHA1 pipeline occupies a large
amount of space due to the high number of pipeline stages. While SHA1 has 80
rounds and a fully pipelined implementation would thus have an equal number
of pipeline stages, we heavily optimized our pipeline to allow higher clock fre-
quencies and consequently achieve more performance. The SHA1 pipeline is the
key limiting factor of how fast our password guessing attacks can be conducted.
Within the brute force cores, each of our SHA1 pipelines has 83 stages due to
the optimizations we performed. Each core can thus compute 83 password can-
didates in parallel. The optimization approaches we applied are described in the
following.
The first stage of the SHA1 pipeline is a buffer stage so that the delays of the
different input logic blocks within the WPA-2 Personal state machine are not
added to the pipeline’s input logic and thereby do not increase the overall time
delay of the critical path. The second stage denoted ‘Initiate’ is an optimization
of the 4 required (expensive) additions in each SHA1 round. Instead of having
all 4 additions in one stage, the structure of the SHA1 algorithm allows us to
split up the required 4 sequential additions into two rounds with 2 additions
each, thereby significantly improving the maximum clock speed. Since the SHA1
expansion steps require only a small amount of logic, another optimization is to
do multiple message expansion steps in a single pipeline stage so that it is not
needed in the following few stages. As a result, the source data is not accessed in
each stage and shift register inference is boosted causing lower flip-flop fan-out
as well as less power usage and lower area requirements. Another approach we
took is the pipeline stage denoted ‘Add’ after the SHA1 rounds. After the last
SHA1 round, the resulting digest is added either to the constant initialization
vector (first iteration) or to the previous digest for subsequent iterations. Due to
these expensive additions, the design performance can be improved if they are
carried out in a separate pipeline stage. Instead of forwarding the initial digest
through all stages to the final addition stage, we leverage a FIFO-based delay line
utilizing the FPGAs Block-RAM resources. This avoids excessive interconnect
routing through all stages and thus makes the design smaller, reduces the number
of critical paths and allows us to achieve higher clock frequencies more easily.
for the password in later stages. A similar approach is used to avoid excessive
interconnects and design density. Instead of having large buses, we either use
Block-RAMs directly or form RAM-based delay lines to keep the IState and
OState states as well as the computed PMKs and PTKs in memory. Instead of one
large WPA2-Personal state multiplexer directly controlling all SHA1 pipeline
inputs and outputs, we make use of several smaller and less complex multiplex-
ers. Once again, this reduces overall design complexity and allows us to achieve
higher clock speeds more easily. The top-level design needs to communicate with
the outside world. Each time a new working block is added, all necessary Wi-Fi
and WPA2-Personal data needs to be transferred and subsequently forwarded
to all brute force cores. The result is a very broad bus spreading all over the
FPGA design and causing severe design congestion. Since in our design only the
password candidates and the SSID are required early within the WPA2-Personal
state machine, we transfer the rest of the data over a small 16 bit bus lever-
aging inferred shift registers. This significantly reduces the complexity of the
interconnects between the shared global state machine and the brute force cores
across the FPGA. To lower the amount of input and output data exchanged with
the outside world, we use a minimized Wi-Fi and WPA2-Personal data set that
only includes the variable data fields from the captured handshake. All other
data is not only fixed within the FPGA, but also kept locally in the cores. In
addition, the FPGA does not output the correct password, but a numeric offset
from the start password instead. To avoid design congestion and to push the
design to the highest clock speed possible, we make use of custom parameters
within the Xilinx design tools for synthesis, mapping and routing such as the
minimum inferred shift register size, register balancing or the number of cost
tables. In addition, we use floor planning to support the mapper, placer and
router in achieving higher clock rates. Floor planning is important to place crit-
ical components requiring a fast interconnect in between next to each other. In
general, we were able to obtain the highest speed improvements by utilizing a
star like topography: The password generator is distributed over the very center
of the FPGA and the brute force cores are surrounding it. In addition we also
used floor planning to avoid the placement of time critical components in FPGA
areas that are hard to reach through interconnects. Consequently, we carefully
placed critical components like the SHA1 pipelines in a way that those regions
do not negatively impact the routing delay. In our FPGA implementations, we
use a slow clock for communication with the outside world and a fast clock for
computation at the same time. In our Spartan-6 implementation, the speed of
the fast clock can be adjusted dynamically during runtime by programming the
clock multiplier. In contrast, our Artix-7 implementation includes an automatic
clock scaling mechanism to adjust the fast clock frequency with the device core
temperature. Both approaches allow the FPGA design to run at high speeds
without the danger of overheating.
570 M. Kammerstetter et al.
The system comprises of a PC with a host software and several FPGA boards
connected via the USB 2.0 high-speed interface. Each FPGA board has a fast EZ-
USB FX2 micro-controller with custom firmware to interface with the FPGAs.
Our custom host software utilizes the ZTEX SDK to allow easy communication
with the micro-controller and the FPGAs. The host software accepts a config-
uration file that includes all necessary Wi-Fi and WPA-2 Personal handshake
data. At startup, it enumerates all connected FPGA boards, uploads the micro-
controller firmware if necessary and configures the FPGAs with our bit stream.
The software makes use of several threads. Apart from the main program, there
is a thread to generate password working blocks for the FPGAs and additional
threads for each FPGA board. The password working blocks are kept in a pool
with constant size. The device threads can supply working blocks to FPGAs and
mark them as being processed. If an FPGA has finished a block, it is removed
from the pool and the generator automatically creates a new working block.
If for some reason an FPGA fails, the block sent to the FPGA is still in the
pool and just needs to be unmarked so that the next free FPGA can process
it instead. The micro-controller firmware is responsible for USB communication
with the host and communication with the FPGAs.
6
http://www.ztex.de.
Efficient High-Speed WPA2 Brute Force Attacks 571
5 Evaluation
We performed multiple evaluations with regard to our design performance, the
power usage and performance in comparison to GPUs. We evaluated the perfor-
mance and the power usage of our design on multiple FPGAs and FPGA boards.
The first FPGA we targeted was a Spartan-6 XC6SLX150T-3 device. Four of
these FPGAs can be found on the Ztex 1.15y board visible on the left of Fig. 6.
The second FPGA we used for our evaluation was an Artix-7 XC7A200T-2 device
on the Ztex 2.16 board visible on the right of the picture. For both FPGAs, we
created an optimized implementation and a configuration bit stream that can
be uploaded to the device. The main difference between the bit streams is the
FPGA type, the maximum clock frequency and most importantly the number
of brute force cores we were able to fit onto the device.
Table 1. Performance and power results of our implementations for different FPGA
devices and systems/boards
System FPGAs Type Cost Cores Tool W Tool MHz Meas. W Act. MHz calc pwd/s pwd/s pwd/s W
Ztex 1.15y 1 XC6SLX150T-3 175 2 4.281 187 6.99* 180 21,956 21,871 3,128*
Ztex 1.15y 4 XC6SLX150T-3 700 8 17.124 187 27.96 180 87,826 87,461 3,128
9x Ztex 1.15y 36 XC6SLX150T-3 2,400 72 154.116 187 254 180 790,436 741,200 2,918
Ztex 2.16 1 XC7A200T-2 213 8 10.458 180 11.04 180 87,826 87,737 7,947
N/A 1 XC7K410T-3 2,248 16 25.634 216 N/A N/A 210,783 N/A N/A
N/A 48 XC7K410T-3 107,904 768 1,230.432 216 N/A N/A 10,117,584 N/A N/A
7
http://hashcat.net/oclhashcat.
Efficient High-Speed WPA2 Brute Force Attacks 573
such as Digi-key8 . However while the cost for 9 new Ztex 1.15y would be appoxi-
mately 6,300 US$, we considered our 9 second-hand Ztex 1.15y boards previously
used for cryptocurrency mining instead. We were able to obtain these boards for
2,400 US$ which we believe is what amateurs could do as well, depending on
how much boards they would like to acquire and how much they are willing to
spend. The Cores column shows how many cores we were able to fit onto the
device to achieve maximum performance. While more cores per device gener-
ally increase the performance, it can also cause the maximum clocking speed
to drop significantly due to mapping, placement and routing issues. The table
presents the implementations allowing us to achieve the maximum performance
per device. The Tool W and Tool MHz columns present the design tool’s power
and timing analysis results. For the Spartan-6 FPGAs, we used the Xilinx ISE
Suite 14.7 whereas for the newer 7-series devices Artix-7 and Kintex-7, we used
Vivado Design Suite 2015.1. In general, it appeared that the newer Vivado tools
produced better results, but since it doesn’t support older model 6-series devices,
we were unable to use it for our Spartan-6 implementations. The Meas. W and
Act. MHz columns present the results for the power measurements we conducted
on the FPGA boards/systems and the actual clock speed we used to run the
devices. The calc pwd/s and pwd/s columns provide the WPA2-Personal per-
formance in passwords per second whereas the first one indicates the calculated
and theoretic maximum performance of our implementation whereas the latter
one shows the actual measured average performance per board and/or system.
In the last column pwd/s W, we use our actual power and performance measure-
ments to determine how much brute force speed can be achieved per Watt which
is especially important when scaling up our implementation to larger FPGA clus-
ter systems. In the following, we discuss the results of our implementations on
a per-device basis.
areas indicate that there would be sufficient space for an additional core, our
experiments showed that this would lead to lower performance due design con-
gestion. The first 3 rows in Table 1 present the results we obtained through this
implementation. Due to cooling requirements, we ran the design with a reduced
clock speed of 180 MHz. Our measurements indicate that in this configuration,
our implementation requires a total of 27.96 W for all 4 FPGAs on the Ztex
1.15y board. The power measurements per Spartan-6 FPGA are marked with
an asterisk to indicate that we were unable to measure them directly, but rather
derived the measurement results from our power measurements for the entire
Ztex 1.15y board with its 4 FPGAs. Our results show that our approach scales
well and can be easily run in a cluster configuration producing a performance
of 790,436 password guesses per second on our cluster. The difference between
the calculated maximum performance and the measured performance is mainly
due to the I/O times between the PC, the microcontroller and the FPGAs. In
addition, our Spartan-6 implementation includes a dynamic frequency scaling
mechanism slowing down the FPGAs in case of device temperatures getting too
high. With better cooling inside the cluster, we believe that the gap between the
theoretic performance and the measured performance could be made smaller.
Artix-7 Results. Starting from our already highly optimized Spartan-6 design
we ported our implementation to the newer 7-series Artix-7 XC7A200T-2 FPGA.
Since device internals such as the clocks or PLLs are different from the Spartan-
6 architecture, we had to adapt our implementation accordingly. The ability to
read the device’s core temperature from within the FPGA implementation was
especially interesting. It allowed us to implement frequency scaling mechanisms
directly on the FPGA not only preventing possible damage due to overheating,
but also ensuring that each device always runs at the maximum performance
possible. Our ready-to-upload placed and routed design is visible in Fig. 7b.
Through floorplaning all of the cores have a small path to the center where
the small block with the global state machine and the password generator are
located. The implementation can be run at up to 180 MHz to achieve a theoretic
maximum of 87,826 password guesses per second. With a measured performance
Efficient High-Speed WPA2 Brute Force Attacks 575
of 87,737 password guesses per second, our results show that a single XC7A200T-
2 device achieves not only more performance than 4 of the older model Spartan-6
XC6SLX150T-3 FPGAs altogether, but it also requires just 11.04 Watt during
operation.
The results of our GPU evaluation (Sect. 5.1) are visible in Table 2. We performed
the performance measurements by running cudaHashcat v1.36 on different sys-
tems and measuring the power consumption as the difference between idle and
busy WPA2 computations to get results independent from other components in
the system. The table shows the different GPU configurations (System) we used
for our tests. The pwd/s column shows the performance in passwords per second
and the W column indicates the power consumed by the GPU during runtime in
Watt. The performance per Watt is visible in the pwd/s W column. In addition
to running GPU measurements on our own machines, we also conducted mea-
surements on dedicated Amazon Elastic Cloud (EC2) GPU machines as well.
While we could measure the performance on the machines just the same, we
576 M. Kammerstetter et al.
References
1. Eastlake 3rd, D., Jones, P.: US Secure Hash Algorithm 1 (SHA1). RFC 3174 (Infor-
mational), September 2001. Updated by RFCs 4634, 6234
2. Eastlake 3rd, D., Hansen, T.: US Secure Hash Algorithms (SHA and SHA-based
HMAC and HKDF). RFC 6234 (Informational), May 2011
3. Elcomsoft: ElcomSoft and Pico Computing Demonstrate World’s Fastest Password
Cracking Solution. https://www.elcomsoft.com/PR/Pico 120717 en.pdf. Accessed
13 Nov 2015
4. Elcomsoft Blog: Accelerating Password Recovery: The Addi-
tion of FPGA (2012). http://blog.elcomsoft.com/2012/07/
accelerating-password-recovery-the-addition-of-fpga. Accessed 13 Nov 2015
5. Güneysu, T., Kasper, T., Novotný, M., Paar, C., Wienbrandt, L., Zim-
mermann, R.: High-performance cryptanalysis on RIVYERA and COPA-
COBANA computing systems. In: Vanderbauwhede, W., Benkrid, K. (eds.) High-
Performance Computing Using FPGAs, pp. 335–366. Springer, New York (2013).
http://dx.doi.org/10.1007/978-1-4614-1791-0 11
6. IEEE-Inst.: 802.11-2012 - IEEE Standard for Information technology-
Telecommunications and information exchange between systems Local and
metropolitan area networks-Specific requirements Part 11: wireless LAN Medium
Access Control (MAC) and Physical Layer (PHY) Specifications. Technical report,
IEEE Std 802.11TM -2012, IEEE-Inst (2012). http://ieeexplore.ieee.org/servlet/
opac?punumber=6178209
7. Johnson, T., Roggow, D., Jones, P.H., Zambreno, J.: An FPGA architecture for
the recovery of WPA/WPA2 keys. J. Circ. Syst. Comput. 24(7) (2015). http://dx.
doi.org/10.1142/S0218126615501054
8. Kaliski, B.: PKCS #5: Password-Based Cryptography Specification Version 2.0.
RFC 2898 (Informational), September 2000. http://www.ietf.org/rfc/rfc2898.txt
9. Kammerstetter, M., Muellner, M., Burian, D., Kudera, C., Kastner, W.: Efficient
high-speed WPA2 brute force attacks using scalable low-cost FPGA clustering
(extended version). http://arxiv.org/pdf/1605.07819v1.pdf. Accessed 25 May 2016
10. PicoComputing Inc.: SC5-4U Overview. http://picocomputing.com/brochures/
SC5-4U.pdf. Accessed 13 Nov 2015
11. Visan, S.: WPA/WPA2 password security testing using graphics processing units.
J. Mob. Embed. Distrib. Syst. 5(4), 167–174 (2013)
12. U.S. Department of Commerce National Institute of Standards, Technology: FIPS
PUB 180–2, Secure Hash Standard (SHS), U.S. Department of Commerce/National
Institute of Standards and Technology (2002)
Fault Attacks
ENCOUNTER: On Breaking the Nonce Barrier
in Differential Fault Analysis with a
Case-Study on PAEQ
1 Introduction
The popularity of cryptanalyzing a cipher by observing its behavior under the
influence of faults is mainly attributed to the ease of such fault induction and
overhead in incorporating a counter-measure. Among different types of fault
based cryptanalysis, Differential Fault Analysis (DFA) [1–7] has garnered par-
ticular attention of the side-channel research community since it has been one
of the most effective side-channel attacks on symmetric-key constructions. DFA
puts in the hand of an attacker an interesting ability: The possibility of per-
forming a differential analysis starting from an intermediate state of the cipher.
This ability could be fatal in case of iterated symmetric-key designs since it is
equivalent to cryptanalyzing a round-reduced version of the cipher. However,
classical DFA has a specific requirement known as the replaying criterion which
states that the attacker must be able to induce faults while replaying a previ-
ous fault-free run of the algorithm. In this scenario, the introduction of a nonce
constraint comes in as a direct contradiction to the ability to replay.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 581–601, 2016.
DOI: 10.1007/978-3-662-53140-2 28
582 D. Saha and D.R. Chowdhury
the output here is truncated giving an attacker access to only partial information
of the state. Thus the current work provides an instance of a fault based analysis
of partially specified states.
Our Results
– Introduce the concept of internal differential fault analysis (IDFA) in the con-
text of parallelizable ciphers in the counter mode
– Showcase that IDFA requires only one run of the algorithm thereby overcoming
the nonce barrier of DFA.
– Present a 4-round internal differential distinguisher for PAEQ.
– Use the distinguisher to develop the EnCounter attack on full-round PAEQ
using only two faults in the same instance of PAEQ
– Reduces average key-space of primary PAEQ variants to practical limits viz.,
paeq-64: 264 to 216 , paeq-80: 280 to 216 , paeq-128: 2128 to 250 .
– Present instances of fault analysis of an AES based design with various types
of partially specified internal states.
The rest of the paper is organized as follows. Section 2 provides a brief descrip-
tion of PAEQ. The notations used is the work are given in Sect. 3. The concept
of internal differential fault analysis is introduced in Sect. 4. A 4-round distin-
guisher of PAEQ is showcased in Sect. 5. Section 6 introduces the notion of fault
quartets. The EnCounter attack on PAEQ is devised in Sect. 7 and its complexity
analysis is furnished in Subsect. 7.5. The experimental results are presented in
Sect. 8 while the concluding remarks are given in Sect. 9.
AESQ operates on a 512-bit internal state which can be subdivided into 128-bit
substates. Before going into details we introduce some definitions.
A word is just a byte redefined to account for the field arithmetic. In this work,
we will come across partially specified states/substates where certain words might
have unknown values. To capture this scenario, we use the symbol ‘X’ to represent
unknown words. Thus, to be precise a word is an element of T ∪ {‘X’}.
si,j ∈ T ∪ {‘X’}
sm
= [sm
i,j ], where s = (s1 , s2 , s3 , s4 )
0 ≤ i, j < 4; m ∈ {1, 2, 3, 4}
s1 s2 s3 s4
From s1∗,0 s1∗,1 s1∗,2 s1∗,3 s2∗,0 s2∗,1 s2∗,2 s2∗,3 s3∗,0 s3∗,1 s3∗,2 s3∗,3 s4∗,0 s4∗,1 s4∗,2 s4∗,3
To s1∗,3 s4∗,3 s3∗,2 s2∗,2 s1∗,1 s4∗,1 s3∗,0 s2∗,0 s1∗,2 s4∗,2 s3∗,3 s2∗,3 s1∗,0 s4∗,0 s3∗,1 s2∗,1
denotes the AES Substitution box and Mμ denotes the MixColumns matrix. ρm r
does not rely on values of sm and just shifts the positions of unknown values.
⎧
⎪sm if i
= 1
ifsm =X m μ Mμ × sm if
= X m α ⎨ i,j
∀i, sm
βrm X m
∗,j
m
sm −−→ i,j
s∗,j −−→ r i,j
si,j −−→ sm
i,j ⊕ rcr
r m
if sm
i,j
= X
Otherwise Otherwise ⎪
i,j
SBOX(sm
i,j ) {X, X, X, X}T ⎩
X Otherwise
3 Notations
Definition 3 (Diagonal). A diagonal of a substate (sm = [sm i,j ]) is the set of
words which map to the same column under the Shift-Row operation.
dm
k = {si,j : ρ (si,j ) ∈ s∗,k }, where k = (j − σ(i)) mod 4; σ = {0, 1, 2, 3} (1)
m m m m
Fig. 2. Generic model for using internal differentials in fault analysis of parallelizable
ciphers in the counter mode.
However, in this work we introduce the concept of Fault Quartets (Refer Sect. 6)
which can use a round-reduced distinguisher of the underlying cipher to locate
the fault-free branch corresponding to a faulty branch due to the first fault under
reasonable assumptions. Building upon these ideas a practical IDFA attack is
mounted on PAEQ. Though the specifics rely on the underlying construction, the
overall notion of IDFA can be adapted to other ciphers which meet the properties
discussed earlier. In the next section we develop a four-round distinguisher of
PAEQ based on internal differentials arising from counter values.
Observation 1. Two parallel branches of PAEQ with the same domain separator
differ only in the counter value.
– R1 : Difference spreads to entire column, and the bytes are related by factors
governed by the MixColumns matrix.
– R2 : Entire substate affected; columns of the substate are related by factors.
– S : One column of each substate affected; factor relations unaffected.
– R3 : All substates affected. All columns exhibit byte inter-relations due μ3 .
– R4 : All substates still affected with all relations destroyed due to β4
– S : Columns permute.
This process is illustrated in Fig. 3a. After every round the corresponding byte
inter-relations are given. Recall that here we are dealing with a differential state
(Definition 4). Here, the numbers in a particular column indicate the factor by
2
Recall, the counter is of size c = n − k − 16 bits.
3
For instance, i = 5 and j = 8 differ only in the least significant byte.
EnCounter: On Breaking the Nonce Barrier 589
which the byte-wise differences in that column are related. In the first substate
{2, 1, 1, 3} imply that differences are of the form {2 × f, 1 × f, 1 × f, 3 × f }
where f ∈ T \ {0}. The byte inter-relations after R3 translates into the following
observation:
case the internal difference in R1 input might span multiple substates. Finally,
Distinguisher constitutes a general 3-round differential characteristic which
will hold for any 3 consecutive rounds of AESQ as long as the starting states
have an internal difference of one byte. This is what is exploited to develop an
internal differential fault attack on PAEQ.
From the above remark we get the impression that if somehow we could get
a one-byte internal difference in R17 then the 4-round distinguishing property
could be verified from R20 i.e. the full AESQ permutation. In order to achieve
this scenario we introduce the concept of Fault Quartets in the next section.
where Constraint 1 : s ⊕ t = 0
⎪
⎩
Constraint 2 : s# and t# have an internal difference of one byte.
of Qi,j which states that the input states must have no internal difference while
the differentiator induces a one-byte internal difference between the outputs
of R16 which constitutes the second constraint. Figure 4 demonstrates the fault
injection. The following observation accounts for the choice of a 255-block mes-
sage above.
The complete block at the end ensures that all inputs to AESQ have the same
domain separator. The choice of 255 implies that all of these differ only in the
last byte of the counter. Due to equalizer fault the counter value of branch
i changes to j which is equal to the counter value of any7 one of the remaining
254 fault-free branches. Thus, this outlines the condition to guarantee that a
fault quartet is generated. But how would one find such a quartet since neither
the input states nor the output of of AESQ16 is visible to an adversary. This is
addressed next where we give an algorithm to find such a quartet.
Finding Qi,j : Finding a fault-quartet translates into finding the branch-index
ordered pair (i, j) where i corresponds to the branch of PAEQ where the faults
have been introduced and the j corresponds to the fault-free branch. This is
done by Algorithm 3 using the distinguisher developed earlier as a sub-routine.
One can recall that due to the differentiator there will be a one-byte inter-
nal difference between branch i and j in the input of R17 . Thus distinguishing
property can be verified from R20 i.e., the full AESQ permutation.
Qi,j gives us the opportunity to exploit the distinguishing property of PAEQ
in the last four rounds of AESQ. With these concepts we are in a position to
finally introduce the EnCounter attack which exploits the property further to
recover the entire internal state of AESQ thereby revealing the key.
6
With a probability of 255
256
.
7
Except for the case when j = 0 when it matches none of the remaining branches.
592 D. Saha and D.R. Chowdhury
The EnCounter attack proceeds in two phases: InBound and OutBound. The
InBound phase is common to all PAEQ variants while the OutBound phase varies.
In this work we focus on paeq-64, paeq-80 and paeq-128 which constitute the
primary set of PAEQ family as specified by the designers. We first provide a
high-level description of the attack and then delve into the details.
High-level Description of EnCounter
Firstly, as EnCounter is an IDFA attack it needs only a single run of PAEQ. Sec-
ondly, it is based on a random byte fault model requiring 2 faults: equalizer
and differentiator, induced in any branch of AESQ while encrypting the plain-
text. equalizer targets the last byte of the counter while differentiator tar-
gets any byte at the input of R17 (Fig. 4). While classical DFA generally deals
with single block messages, IDFA uses a single multi-block message as it targets
parallelizable ciphers. In the context of EnCounter the plaintext itself is a 255-
block message. After fault injection the attacker uses faulty and fault-free blocks
EnCounter: On Breaking the Nonce Barrier 593
of the same ciphertext and corresponding plaintext blocks to mount the attack.
Assuming the faulty branch to be i, faulty ciphertext block Ci is identified below.
P = P1 ||P2 || · · · ||Pi || · · · ||Pj || · · · ||P255
EnCounter Input
C = C1 ||C2 || · · · ||Ci || · · · ||Cj || · · · ||C255 ||Tag
Fig. 5. The InBound phase. Returns candidates for columns after β19 .
The attacker inverts the reconstructed state up to input of ρ19 . Again, due
to fault diffusion every substate of this state has exactly one column with non-
zero related differences (Refer Fig. 3a). However, the relations differ based on the
location of differentiator fault. By virtue of the Diagonal Principle 9 [7] which
8
Computed using the XOR of plaintext and ciphertext blocks.
9
Faults injected in the same diagonal of an AES state in round r input lead to the
same byte interrelations at the end of round (r + 1).
594 D. Saha and D.R. Chowdhury
is a well-known result of DFA on AES, we know that for a particular substate there
can be four kinds of relations based on the diagonal where the differentiator
was injected. Figure 7 shows all these possible byte inter-relations based on the
source of the byte-fault at the input of R17 .
The attacker already knows which quadrant to look at in Fig. 7 since he
knows the substate location of differentiator. However, he has no idea about
the source diagonal and has to resort to guessing. So like classical DFA using
the input and output difference of β19 , the attacker solves differential equations
to generate candidates for the four columns which are stored in column vectors.
At the end of InBound phase the attacker has a set of four column vectors for
the particular guess of the fault diagonal.
Fig. 8. The classification of substates observed in the internal state after S −1 (Pj ⊕Cj )
based on number of unknown bytes.
the substate in the state determined by S −1 (Pj ⊕Cj ). This is captured by Fig. 8
for PAEQ variants analyzed in this work.
There are four types of substates with varying number of unknown bytes.
The primary aim of this phase is to reduce the search space for these sub-
states. A Type-1 substate is left unaltered since it has no unknown byte. Type-2
and Type-3 are equivalent in the sense that both have three completely known
columns while Type-4 can be converted to the same form by guessing 2 bytes of
the first column. Thus for the rest of the analysis we assume that we are deal-
ing with a substate with only one unknown column. We now describe how the
attacker produces candidates for such a substate. Again Fig. 9 visually illustrates
this process for easy reference. Let us consider the mth substate.
Fig. 9. The OutBound phase. Returns candidates for substates after R20 .
4. The process repeated for all columns in the column vector and all computed
substates are stored in the corresponding substate vectors.
At the end of the OutBound phase we have a set of substate vectors for all
substates of the state after R20 .
Remark 2. Unlike a Type-3 substate, a Type-2 substate has two extra known
bytes in the fourth column which can be exploited. Thus, substate vector is
reduced by comparing candidates with respect to these two bytes and eliminated
if unmatched. This should lead to large scale reduction of candidates. As regards
a Type-4 substate 2 bytes need to be guessed to make it like a Type-3. Thus we
have to repeat the process of candidate generation 216 times.
4
x v
|S| ≤
v
|([s] )d | = [s ] (3)
d d x=1 d
EnCounter: On Breaking the Nonce Barrier 597
Algorithm 4. EnCounter(P, C, i)
P, C ← One known plaintext-ciphertext with 255 complete blocks
Input:
i ← Index of faulty-branch
Output: K ← The Master Key
1: (i, j) ← FindQ (P, C, i) Locate Fault-quartet Qi,j
2: S ← ∅
Guess
3: for d ←−−− Fault diagonal do Location of differentiator
4:
InBound
Four Column Vectors ←−−−−−−−− Pi ⊕ Ci , Pj ⊕ Cj , d
5:
OutBound
Four Substate Vectors ←−−−−−−−−− Pj ⊕ Cj , Column Vectors
6: if Any Substate Vector = ∅ then Go to 3
Cross-Product
7: State Vector
←−−−−−−−
− Four Substate Vectors
8: S ← S State Vector Reduced state-space
−1
9: for all e ∈ S do (Dx ||jx ||Nx ||K) ← AESQ (S (e))
10: if (Dx ||jx ||Nx ) == (D0 ||j||N ) then return K
It implies that it suffices to study the sizes of substate vectors. It was seen
in OutBound phase that Type-2 and Type-4 substates are related to Type-3.
Accordingly, the sizes of the corresponding substate vectors can be expressed in
terms of a Type-3 substate vector. This is furnished in Table 2 where q denotes
the size of a Type-3 substate vector while p and r denote sizes of Type-2 and
Type-4 substate vectors respectively. Table 3 enumerates the theoretical upper
bounds of the complexities individually identifying sizes of the substate vectors.
8 Experimental Results
Computer simulations of EnCounter were performed over 1000 randomly chosen
nonces, keys. The results for paeq-64/80 are shown in the form of bar diagrams
in Figs. 10 and 11 respectively. The bars segregate the substate vectors in terms
of their sizes (the value at the base) with the frequency of occurrence given at the
top. The figures mainly show that in the average case q is concentrated around
28 . It was mentioned in Remark 2 that in the presence of additional information
p could be further reduced such that p q. This is confirmed by the results
which show that p = 1 with a few exceptions when p = 2. Table 3 summaries
the results while the details are given below:
– paeq-64: By Remark 3 we know that the diagonal for differentiator can
be recovered thereby avoiding the guessing step in Algorithm 4 and reducing
the complexity by a factor of four. So the final experimentally verified size of
S for paeq-64 stands at 72292 ≈ 216.14 .
598 D. Saha and D.R. Chowdhury
– paeq-80: During simulation it was found that for Type-2 substates OutBound
phase returned empty substate vectors for a wrong guess of faulty diagonal.
This made Step 6 of Algorithm 4 to be TRUE reducing |S| by four times. So
the final verified size 72578 ≈ 216.14 is very close to paeq-64.
– paeq-128: It has three Type-3 substates contributing around 224 while a
Type-4 substate is supposed to contribute over 216 × 28 . Finally, the complex-
ity is increased four times due to diagonal guess. Thus the estimated value of
|S| is around 250 .
600 1000
560 1000
500
800
400
600
300 292
400
200
100 200
52 34
12 26 8 6 6
2 2
0 0
240 256 272 288 496 512 528 544 560 576 1024 1
3v
|[s ] | μ = 267.20 σ = 69.38 |[s4]v| μ = 1.00 σ = 0.06
700
600 996
1000
514
500
800
400
352
600
300
400
200
100 200
56
6 2 22 22 14 8 4 4
0 0
240 256 272 288 304 496 512 528 544 560 1 2
400
322
300 294
200
108
100
48 48 38
32 18
8 2 10 4 2 14 8 14 2 4 2 10 4 4 2 2
0
0
28
00
40
20
60
00
40
64
56
52
60
44
28
53
12
63
96
98
88
16
76
7
52
90
67
69
05
10
44
82
92
33
74
39
57
61
65
65
69
69
72
73
73
5
11
11
12
12
12
13
13
13
13
13
13
14
14
24
25
Fig. 10. Bar diagram for sizes of substate vectors and reduced state-space for 1000
experiments on paeq-64 with mean (μ) and standard-deviation (σ) indicated.
EnCounter: On Breaking the Nonce Barrier 599
600 998
559 1000
500
800
400
600
299
300
400
200
100 200
58
10 30 25 7 6 5 1 2
0 0
240 256 272 288 496 512 528 544 560 576 1 2
600 993
544 1000
500
800
400
600
305
300
400
200
100 200
69
5 24 24 13 8 5 1 1 1 7
0 0
240 256 272 288 496 512 528 544 560 1008 1024 1072 1 2
400
327
301
300
200
100 74 88
38 33 31
7 6 2 3 8 16 7 15 5 4 10 1 11 4 1 1 1 1 1 1 1 1 1
0
0
00
40
80
20
76
60
72
00
12
68
40
64
60
16
68
20
60
52
80
44
36
76
60
44
28
53
12
63
72
33
52
90
28
67
69
05
10
44
49
51
82
92
33
36
79
23
57
39
72
21
03
41
57
61
65
65
69
69
73
78
11
11
12
12
12
13
13
13
13
13
13
13
14
14
14
15
24
25
25
26
27
27
Fig. 11. Bar diagram for sizes of substate vectors and reduced state-space for 1000
experiments on paeq-80 with mean (μ) and standard-deviation (σ) indicated.
9 Conclusion
This work introduces the notion of fault analysis using internal differentials.
Parallelizable ciphers using the counter mode are found to be good targets for
such kind of analysis though the real attack relies on the underlying construc-
tion. A 4-round distinguisher for authenticated cipher PAEQ is demonstrated.
Using this the idea of fault quartets is proposed which can locate the fault-free
branch corresponding to a faulty branch. Finally, an internal differential fault
attack EnCounter is devised against PAEQ using just two random byte faults
with only a single faulty ciphertext and the corresponding plaintext. The attack
reduces the key-space of paeq-64, paeq-80 and paeq-128 to around 216 , 216
and 250 respectively. The ability to mount an attack using a single faulty run of
the cipher makes IDFA independent of the effect of nonce thereby breaking the
nonce barrier of DFA. Moreover, the fault analysis presented here is of particu-
lar interest since it deals with internal states that are partially specified which
deviates it from classical DFA. Finally, this work constitutes the first analysis of
CAESAR candidate PAEQ.
600 D. Saha and D.R. Chowdhury
Acknowledgement. We would like to thank the anonymous reviewers for their invalu-
able comments and Orr Dunkelman for helping us in preparing the final version of the
paper.
References
1. Biham, E., Shamir, A.: Differential fault analysis of secret key cryptosystems. In:
Kaliski Jr., B.S. (ed.) CRYPTO 1997. LNCS, vol. 1294, pp. 513–525. Springer,
Heidelberg (1997)
2. Giraud, C.: DFA on AES. In: Dobbertin, H., Rijmen, V., Sowa, A. (eds.) AES
2005. LNCS, vol. 3373, pp. 27–41. Springer, Heidelberg (2005)
3. Dusart, P., Letourneux, G., Vivolo, O.: Differential fault analysis on A.E.S. IACR
Cryptology ePrint Archive, 2003:10 (2003). http://eprint.iacr.org/2003/010
4. Piret, G., Quisquater, J.-J.: A differential fault attack technique against SPN struc-
tures, with application to the AES and KHAZAD. In: Walter, C.D., Koç, Ç.K.,
Paar, C. (eds.) CHES 2003. LNCS, vol. 2779, pp. 77–88. Springer, Heidelberg
(2003)
5. Moradi, A., Shalmani, M.T.M., Salmasizadeh, M.: A generalized method of differ-
ential fault attack against AES cryptosystem. In: Goubin, L., Matsui, M. (eds.)
CHES 2006. LNCS, vol. 4249, pp. 91–100. Springer, Heidelberg (2006)
6. Mukhopadhyay, D.: An improved fault based attack of the advanced encryption stan-
dard. In: Preneel, B. (ed.) AFRICACRYPT 2009. LNCS, vol. 5580, pp. 421–434.
Springer, Heidelberg (2009)
7. Saha, D., Mukhopadhyay, D., Chowdhury, D.R.: A diagonal fault attack on the
advanced encryption standard. IACR Cryptology ePrint Archive, 2009:581 (2009).
http://eprint.iacr.org/2009/581
8. Rogaway, P.: Nonce-based symmetric encryption. In: Roy, B., Meier, W. (eds.) FSE
2004. LNCS, vol. 3017, pp. 348–359. Springer, Heidelberg (2004)
9. Boneh, D., DeMillo, R.A., Lipton, R.J.: On the importance of eliminating errors
in cryptographic computations. J. Cryptol. 14(2), 101–119 (2001)
10. Joye, M., Lenstra, A.K., Quisquater, J.-J.: Chinese remaindering based cryptosys-
tems in the presence of faults. J. Cryptol. 12(4), 241–245 (1999)
11. Coron, J.-S., Joux, A., Kizhvatov, I., Naccache, D., Paillier, P.: Fault attacks on
RSA signatures with partially unknown messages. In: Clavier, C., Gaj, K. (eds.)
CHES 2009. LNCS, vol. 5747, pp. 444–456. Springer, Heidelberg (2009)
12. Saha, D., Kuila, S., Chowdhury, D.R.: EscApe: diagonal fault analysis of APE. In:
Progress in Cryptology - INDOCRYPT 2014 - 15th International Conference on
Cryptology in India, New Delhi, India, December 14–17, 2014, pp. 197–216 (2014)
13. Andreeva, E., Bilgin, B., Bogdanov, A., Luykx, A., Mennink, B., Mouha, N.,
Wang, Q., Yasuda, K.: PRIMATEs v1.02. Submission to the CAESAR Compe-
tition (2014). http://competitions.cr.yp.to/round2/primatesv102.pdf
14. Peyrin, T.: Improved differential attacks for ECHO and Grøstl. In: Rabin, T. (ed.)
CRYPTO 2010. LNCS, vol. 6223, pp. 370–392. Springer, Heidelberg (2010)
15. Dinur, I., Dunkelman, O., Shamir, A.: Collision attacks on up to 5 rounds of SHA-3
using generalized internal differentials. In: Moriai, S. (ed.) FSE 2013. LNCS, vol.
8424, pp. 219–240. Springer, Heidelberg (2014)
16. CAESAR: Competition for Authenticated Encryption: Security, Applicability, and
Robustness. http://competitions.cr.yp.to/caesar.html
EnCounter: On Breaking the Nonce Barrier 601
17. Daemen, J., Rijmen, V.: The Design of Rijndael: AES - The Advanced Encryption
Standard. Information Security and Cryptography. Springer, Heidelberg (2002)
18. Biryukov, A., Khovratovich, D.: PAEQ: parallelizable permutation-based authen-
ticated encryption. In: Chow, S.S.M., Camenisch, J., Hui, L.C.K., Yiu, S.M. (eds.)
ISC 2014. LNCS, vol. 8783, pp. 72–89. Springer, Heidelberg (2014)
19. Khovratovich, D., Biryukov, A.: PAEQ v1. Submission to the CAESAR Competi-
tion (2014). http://competitions.cr.yp.to/round1/paeqv1.pdf
Curious Case of Rowhammer:
Flipping Secret Exponent Bits Using Timing
Analysis
1 Introduction
Rowhammer is a term coined for disturbances observed in recent DRAM devices
in which repeated row activation causes the DRAM cells to electrically interact
among themselves [1–4]. This results in bit flips [2] in DRAM due to discharging
of the cells in the adjacent rows. DRAM cells are composed of an access transistor
and a capacitor which stores charge to represent a bit. Since capacitors loose their
charge with time, DRAM cells require to be refreshed within a fixed interval of
time referred to as the refresh interval. DRAM comprises of two dimensional
array of cells, where each row of cells have its own wordline and for accessing
each row, its respective wordline needs to be activated. Whenever some data
is requested, the cells in the corresponding row are copied to a direct-mapped
cache termed Row-buffer. If same row is accessed again, then the contents of
row-buffer is read without row activation. However, repeatedly activating a row
causes cells in adjacent rows to discharge themselves and results in bit flip.
c International Association for Cryptologic Research 2016
B. Gierlichs and A.Y. Poschmann (Eds.): CHES 2016, LNCS 9813, pp. 602–624, 2016.
DOI: 10.1007/978-3-662-53140-2 29
Curious Case of Rowhammer: Flipping Secret Exponent Bits 603
2 Preliminaries
In this section, we provide a background on some key-concepts, which include
some DRAM details, the rowhammer bug and details of cache architecture which
have been subjected to attack.
Curious Case of Rowhammer: Flipping Secret Exponent Bits 605
Rank1
M em ory
Rank0
Processor Channel
Bank Bank
Code-hammer
{
mov (X), %eax // read from address X
mov (Y), %ebx // read from address Y
clflush (X) // flush cache for address X
clflush (Y) // flush cache for address Y
jmp Code-hammer
}
L3 Unified - 6 MB
MMU
32 20 16 11 5 0
Physical Frame number Page offset
Address
Tag
Line offset
30 11
hash Set index
....
increases. L3 or Last Level Cache (LLC) is shared across processor cores, takes
larger time and is further divided into slices such that it can be accessed by
multiple cores concurrently. Figure 2 illustrates the architectural specification
for a typical Intel Ivy-Bridge architecture [11]. In Intel architecture, the data
residing in any of the lower levels of cache are included in the higher levels as
well. Thus are inclusive in nature. On a LLC cache miss, the requested element
is brought in the cache from the main memory, and the cache miss penalty is
much higher compared to the lower level cache hits.
Requested data are brought from the main memory in the cache in chunks of
cache lines. This is typically of 64 Bytes in recent processors. The data requested
by the processor is associated with a virtual address from the virtual address
space allocated to the running process by the Operating System. The virtual
address can be partitioned into two parts: the lower bits in Fig. 3 is the offset
within the page typically represented by log2 (page size) bits, while the remain-
ing upper bits forms the page number. The page number forms an index to page
table and translates to physical frame number. The frame number together with
the offset bits constitute the physical address of the element. The translation of
virtual to physical addresses are performed at run time, thus physical addresses
of each elements are most likely to change from one execution to another.
The physical address bits decide the cache sets and slices in which a data is
going to reside. If the cache line size is of b bytes, then least significant log2 (b)
bits of the physical addresses are used as index within the cache line. If the
target system is having k processor cores, the LLC is partitioned into k slices
each of the slice have c cache sets where each set is m way associative.
The log2 (k) bits following log2 (b) bits for cache line determines the cache
set in which the element is going to reside. Because of associativity, m such
cache lines having the identical log2 (k) bits reside in the same set. The slice
addressing in modern processors is implemented computing a complex Hash
function. Recently, there has been works which reverse engineered [12,13] the
LLC slice addressing function. Reverse engineering on Intel architectures has
been attempted in [13] using timing analysis. The functions differ across different
architectures and each of these functions are documented via successful reverse
engineering in [12,13].
608 S. Bhattacharya and D. Mukhopadhyay
In this paper, we aim to induce bit fault in the secret exponent of the public
key exponentiation algorithm using rowhammer vulnerability of DRAM with
increased controllability. The secret resides in some location in the cache memory
and also in some unknown location in the main memory. The attacker having
user-level privileges in the system, does not have the knowledge of these locations
in LLC and DRAM since these location are decided by mapping of physical
address bits. The threat model assumed throughout the paper allows adversary
to send known ciphertext to the algorithm and observe its decrypted output.
Let us assume that the adversary sends input plaintext to the encryption
process and observes the output of the encryption. Thus the adversary gets hold
of a valid plaintext-ciphertext pair, which will be used while checking for the
bit flips. The adversary has the handle to send ciphertext to the decryption
oracle, which decrypts the input and sends back the plaintext. The decryption
process constantly polls for its input ciphertexts and sends the plaintext to the
requesting process. The adversary aims to reproduce bit flip in the exponent
and thus first needs to identify the corresponding bank in DRAM in which
the secret exponent resides. Let us assume, that the secret exponent resides in
some bank say, bank A. Though the decryption process constantly performs
exponentiation with accesses to the secret exponent, but such access requests
are usually addressed from the cache memory itself since they result in a cache
hit. In this scenario it is difficult for the adversary to determine the bank in
which the secret resides because the access request from the decryption process
hardly results in main memory access.
According to the DRAM architecture, the channel, rank, bank and row
addressing of the data elements depend on the physical address of the data
elements. In order to perform rowhammering on the secret exponent, precise
knowledge of these parameters need to be acquired, which is impossible for an
adversary since the adversary does not have the privilege to obtain the cor-
responding physical addresses to the secret. This motivates the adversary to
incorporate a spy process which probes on the execution of the decryption algo-
rithm and uses timing analysis to successfully identify the channel, rank and
even the bank where the secret gets mapped to.
The adversary introduces a spy process which runs everytime, before each
decryption is requested. The spy process issues accesses to data elements of
the eviction set, which eventually flushes the existing cache lines with its own
Curious Case of Rowhammer: Flipping Secret Exponent Bits 609
data requests and fills the cache. Thus during the next request to the decryp-
tion process, the access to the secret exponent results in a cache miss and the
corresponding access request is addressed from the bank A of main memory.
Effectively, a spy process running alternate to the decryption process, makes
arbitrary data accesses to ensure that every access request from the decryption
process is addressed from the corresponding bank of the main memory.
1. Step 1: The adversary starts the spy process, which initially allocates a set
of data elements and consults its own pagemap to obtain the corresponding
physical addresses for each element. The kernel allows userspace programs
to access their own pagemap (/proc/self/pagemap)1 to examine the page
tables and related information by reading files in /proc. The virtual pages are
Time
5. Probing LLC: On getting the decrypted output the adversary signals the
spy to start probing and timing measurements are noted. In this probing step,
the spy process accesses each of the selected m elements (in Prime phase) of
eviction set t for all slices and time to access each of these elements are
observed.
The timing measurements will show a variation when the decryption algo-
rithm shares same cache set as the target set t. This is because, after the
priming step the adversary allows the decryption process to run. If the cache
sets used by the decryption is same as that of the spy, then some of the cache
lines previously primed by the spy process gets evicted during the decryption.
Thus, when the spy is again allowed to access the same elements, if it takes
longer time to access then it is concluded that the cache set has been accessed
by the decryption as well. On the other hand, if the cache set has not been
used by the decryption, then the time observed during probe phase is less
since no elements primed by spy have been evicted in the decryption phase.
Determining the LLC Slice Where the Secret Maps. The Prime + Probe
timing analysis elaborated in the previous discussion successfully identifies the
LLC set in which the cache line containing the secret exponent resides. Thus
at the end of the previous step we obtain an eviction set of m ∗ k elements
which map to the same set as the secret in all of the k slices. Now, this time
the adversary can easily identify the desired LLC slice by iteratively running the
same Prime + Probe protocol separately for each of the k slices with the selected
m elements for that particular slice. The timing observations while probing will
show significant variation for a set of m elements which corresponds to the same
slice where the secret maps. Thus we further refine the size of eviction set from
m ∗ k to m elements.
10. Receives decrypted message 10. Flush the accessed element from cache
from Decryption Engine using clf lush.
Time
performed by accessing elements from each bank. After each access request by
the spy, the elements are flushed deliberately from the cache using clflush.
The adversary sends an input to the decryption engine and waits for the
output to be received. While it waits for the output, the spy process targets one
particular bank, selects a data element which maps to the bank and accesses the
data element. This triggers concurrent accesses from the spy and the decryption
to the DRAM banks. Repeated timing measurements are observed for each of
the DRAM bank accesses by the spy, and this process is iterated for elements
from each DRAM bank respectively.
In the previous subsections, we have discussed how the adversary performs tim-
ing analysis to determine cache set collision and subsequently use it to determine
DRAM bank collisions to identify where the secret data resides. In this section,
we aim to induce fault in the secret by repeatedly toggling DRAM rows belonging
to the same DRAM bank as the secret.
Inside the DRAM bank, the cells are two-dimensionally aligned in rows and
columns. The row index in which any physical address maps is determined by
the most significant bits of the physical address. Thus it is absolutely impossi-
ble for an adversary to determine the row index of the secret exponent. Thus
rowhammer to the secret exponent has to be performed with elements which
map to the same DRAM bank as the secret, but on different row indices until
and unless the secret exponent is induced with a bit flip.
The original algorithm for rowhammer in [2], can be modified intelligently to
achieve this precise bit flip. The algorithm works in following steps:
– A set of addresses are chosen which map to different row but the same bank
of DRAM.
– The row indices being a function of the physical address bits are simulated
while execution. Elements of random row indices are selected and accessed
repeatedly by the adversary to induce bit flips in adjacent rows.
– The detection of bit flip in secret can be done easily, if and only if the output
of decryption differs.
The experiments being performed on RSA, the 1024 bit exponent resides in con-
secutive 1024 bit locations in memory. Considering the cache line size as 64 bytes,
1024 bits of secret maps to 2 cache lines. As described in Sect. 2.3, 11 bits of
physical address from b6 , b7 , · · · b16 refer to the Last Level cache set. Moreover,
the papers [12,13] both talk about reverse engineering of the cache slice selec-
tion functions. The authors in paper [13], used Prime + Probe methodology to
learn the cache slice function, while the authors in [12] monitored events from
the performance counters to build the cache slice functions. Though it has been
observed that the LLC slice functions reported in these two papers are not same.
In our paper, we devised a Prime + Probe based timing observation setup
and wished to identify the target cache set and slice which collides with the
secret. Thus we were in the lines of [13] and used the function from [13] in our
experiments for Prime + Probe based timing observations. As illustrated in the
following section, the timing observations using functions from [13] can correctly
identify the target cache slice where the secret maps to. Reverse engineering of
Last Level Cache (LLC) slice for Intel Ivy Bridge Micro-architecture in [13] uses
the following function:
b17 ⊕ b18 ⊕ b20 ⊕ b22 ⊕ b24 ⊕ b25 ⊕ b26 ⊕ b27 ⊕ b28 ⊕ b30 ⊕ b32 .
Curious Case of Rowhammer: Flipping Secret Exponent Bits 615
1000
No collision
950 Collision
900
1150 1250
Slice 0 Slice 0
1100 Slice 2 1200 Slice 2
1050 1150
Cache access time
750 900
700 850
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80
Iterations Iterations
Once the cache set is identified, the variation from timing observations for dif-
ferent LLC slices leak the information of which LLC slice the secret maps to.
In the same experimental setup as in the previous section, we identify the slice
in which the actual secret resides, using timing analysis with the slice selection
function. Since we have already identified the LLC cache set with which the
Curious Case of Rowhammer: Flipping Secret Exponent Bits 617
2000 1400
Collision No Collision
cache access time in clock cycles 1900 No Collision Collision
1700
1300
1600
1500
1250
1400
1300 1200
1200
1100 1150
0 20 40 60 80 100 0 20 40 60 80 100
Iterations Iterations
secret collides, 12 data elements belonging to each slice of the particular set are
selected. Prime + Probe timing observations are noted for the set of 12 elements
for each slice. The slice observing collision with the secret exponent will suffer
from cache misses in the probe phase and thus have higher access time to other
slices.
We illustrate the timing observations for two scenarios in Fig. 7a and b. In
Fig. 7a, the secret is mapped to LLC slice 0, while in Fig. 7b, the secret gets
mapped to LLC slice 2. In both of the figures, access time for probing elements
for the cache slice for which the secret access collides is observed to be higher
than the other cache slice which belongs to the same set but do not observe cache
collision. Thus because of collision of accesses of both the processes to the same
slice, the spy observed higher probe time for slice 0, than slice 2 in Fig. 7a. On
the contrary, in a different run, the secret exponent got mapped to LLC slice 2,
which in Fig. 7b shows higher probe time than slice 0. Thus we can easily figure
out the cache slice for the particular set for which both the decryption and the
spy process accesses actually collides.
We also extended our experiment with the reverse engineered cache slice
functions from [12]. Figure 8b shows the timing observations when we use the
slice selection functions for a 4-core processor. The functions [12] are:
o0 = b6 ⊕b10 ⊕b12 ⊕b14 ⊕b16 ⊕b17 ⊕b18 ⊕b20 ⊕b22 ⊕b24 ⊕b25 ⊕b26 ⊕b27 ⊕b28 ⊕b30 ⊕b32 ⊕b33
o1 = b07 ⊕b11 ⊕b13 ⊕b15 ⊕b17 ⊕b19 ⊕b20 ⊕b21 ⊕b22 ⊕b23 ⊕b24 ⊕b26 ⊕b28 ⊕b29 ⊕b31 ⊕b33 ⊕b34
Similar to our previous observations, Fig. 8b shows that we were able to
identify the target cache slice from the timing observations using cache slice
reverse engineering functions from [12].
Determining the LLC set and slice in which secret maps, actually gives the
control to the adversary to flush the existing cache lines in these locations, and
thus everytime the decryption process have to access the main memory. In simple
words, accesses made by the adversary to this particular LLC set and slice acts
as an alternative to clflush instruction being added to the decryption process.
618 S. Bhattacharya and D. Mukhopadhyay
– The DRAM bank equations for Ivy Bridge [16] is decided by the physical
address bits: ba0 = b14 ⊕ b18 , ba1 = b15 ⊕ b19 , ba2 = b17 ⊕ b21 ,
– Rank is decided by r = b16 ⊕ b20 and the
– Channel is decided by, C = b7 ⊕ b8 ⊕ b9 ⊕ b12 ⊕ b13 ⊕ b18 ⊕ b19 .
– The DRAM row index is decided by physical address bits b18 , · · · , b31 .
80 70
70 60
60
50
50
frequencies
frequencies
40
40
30
30
20
20
10 10
0 0
350 400 450 500 550 600 650 700 750 300 350 400 450 500 550
time time
(a) Timing Observations in clock cy- (b) Timing Observations in clock cy-
cles for DRAM bank collision cles of separate DRAM bank access
Fig. 9. Timing observations for Row-buffer collision during DRAM bank accesses
In the same experimental setup as previous, the adversary targets each bank
at a time and selects elements from the memory map for which the physical
addresses map to that particular bank. The following process is repeated to
obtain significant timing observations:
1. The spy primes the LLC and requests decryption by sending ciphertext.
2. While the spy wait for the decrypted message, it selects an element for the
target bank from the memory map, clflush’es it from cache, and accesses
the element. The clflush instruction removes the element from all levels of
cache, and the following access to the element is addressed from the respective
bank of the DRAM.
3. The time to DRAM bank access is also noted.
Curious Case of Rowhammer: Flipping Secret Exponent Bits 619
In the previous section, we have illustrated that how the adversary is able to
distinguish the bank in which the secret exponent resides. The software imple-
mentation of the induction of bit flip is performed by repeated access to the
elements of the same bank. The following pseudo-code is used to hammer rows
in specific banks After each access to the element it is deliberately flushed from
the cache using the clflush instruction by the adversary.
Code-hammer-specific-bank
{
Select set of 10 data elements mapping to specific bank
Repeat
{
Access all elements of the set
Clflush each element of the set
}
jmp Code-hammer-specific-bank
}
observe for respective banks of a single Dual In-line Memory Module (DIMM).
The bit faults that we have observed in our experiments are bit-reset faults.
Fig. 10. Number of bit flips observed in all banks of a single DIMM
The row index of the location of the secret in the DRAM bank is determined
by the physical address bits of the secret. Thus this implies that the secret expo-
nent can sit in any of the rows in the target bank. Accordingly, we restricted our
hammering attempts in the target bank and we selected random accesses to the
target bank which eventually resulted in bit flips. Thus we slightly modified our
setup such that the code iteratively runs until and unless the decryption output
changes, which signifies that secret exponent bits have been successfully flipped.
The fault attack in [9] requires a single faulty signature to retrieve the secret.
Thus, bit flip introduced in the secret exponent by the rowhammer in a spe-
cific bank can successfully reveal the secret by applying fault analysis techniques
in [9]. The probability of bit flip is 1/214 , since there are 214 rows in a particular
bank. Interestingly, the size of the secret key has an effect on the probability of
bit flip in the secret exponent. In other words, we can say that the probability of
bit flip in the secret exponent will be more if the secret exponent size is larger.
5 Possible Countermeasures
There has been various countermeasures of rowhammer attacks proposed in lit-
erature. In [2], seven potential system level mitigation techniques were proposed
which range from designing secure DRAM chips, enforcing ECC protection on
them, increasing the refresh rate, identification of victim cells and retiring them
and refreshing vulnerable rows (for which the adjacent rows are getting accessed
frequently). As mentioned in [2], each of these solutions suffers from the trade-off
between feasibility, cost, performance, power and reliability. In particular, the
solution named as Probabilistic Adjacent Row Activation (PARA) has the least
overhead among the solutions proposed in [2]. The memory controller in PARA
is modeled such that every time a row closes, the controller decides to refresh
its adjacent rows with probability p (typically 1/2). Because of its probabilistic
nature, the approach is low overhead as it does not require any complex data
structure for counting the number of row activations.
Another hardware level mitigation is reported in [5], where it is mentioned
that the LPDDR4 standard for the DRAM incorporated two features for the
Curious Case of Rowhammer: Flipping Secret Exponent Bits 621
hardware level mitigation such as Targeted Row Refresh (TRR) and Maximum
Activate Count (MAC). Among which, it is reported that TRR technique is get-
ting deployed in the next generation DDR4 memory units [18,19]. TRR incor-
porates a special module which can track the frequently made row-activations
and can selectively refresh the rows adjacent to these aggressor rows. All of the
above discussed protections have to be incorporated in hardware, but this does
not eliminate the threat from rowhammer attacks since many of the manufac-
turers refer to these as optional modules.
There are few attempts which provide software level protection from rowham-
mer attacks. The clflush instruction was responsible for removing the target
element from the cache and that resulted in DRAM accesses. In order to stop
the security breaches from NaCl sandbox escape and privilege escalation [5],
Google NaCl sandbox was recently patched to disallow applications from using
the clflush instruction. The other software level mitigation is to double the
refresh rate from 64 ms to 32 ms by changing the BIOS settings, or alterna-
tively upgrading own BIOS with a patched one. It has been reported in [20],
that system manufacturers such as HP, Lenovo and Apple have already con-
sidered updating the refresh rate. But both of the techniques such as doubling
refresh rate and removing access to clflush instruction as a prevention tech-
nique has been proved to be ineffective in [20]. The paper illustrates a case study
of introducing bit flips inspite of having refresh interval as low as 16 ms and the
method does not use the clflush instruction. The paper also propose an effec-
tive, low cost software based protection mechanism named ANVIL. Instead [20]
propose a two-step approach which observe the LLC cache miss statistics from
the performance counters for a time interval, and examines if the number of
cache misses crosses the predetermined threshold. If there are significantly high
number of cache misses, then the second phase of evaluation starts, which sam-
ples the DRAM accesses of the suspected process and identifies if rows in a
particular DRAM bank is getting accessed. If repeated row activation in same
bank is detected, then ANVIL performs a selective refresh on the rows which
are vulnerable.
6 Further Discussion
The present paper’s main focus is to show that targeted faults can be inflicted
by rowhammer. As a consequence, we have cited the example of a fault analysis
on RSA, which is not protected by countermeasures. One of the objectives of
this paper, is to show that fault attacks are serious threats even when triggered
using software means. This makes the threat more probable as opposed to a fault
injection by hardware means: like voltage fluctuations etc. Thus, this emphasizes
more the need for countermeasures, at the software level.
Having said that, even standard libraries like OpenSSL use fault counter-
measures, but they are not fully protected against these class of attacks. For
example, in Black Hat 2012 [21], a hardware based fault injection was shown
to be of threat to OpenSSL based RSA signature schemes. It was reported that
622 S. Bhattacharya and D. Mukhopadhyay
the initial signature is verified by the public key exponent, however in case of a
fault, another signature is generated and this time it is not verified [21]. The final
signature is not verified because it is widely assumed that creating a controlled
fault on a PC is impractical. More so, the faults are believed to be acciden-
tal computational errors, rather than malicious faults. Hence, the probability of
inflicting two successive faults is rather low in normal computations! However,
in case of rowhammer, as the fault is created in the key, repeating the process
would again result in a wrong signature and thus get released.
Hence, the objective of the current paper is to highlight that inflicting con-
trolled faults are more probable through software techniques than popularly
believed, and hence ensuring that verification should be a compulsory step before
releasing signatures.
7 Conclusion
In this paper, we claim to illustrate in steps a combination of timing and fault
analysis attack exploiting vulnerability of recent DRAM’s disturbance error to
Curious Case of Rowhammer: Flipping Secret Exponent Bits 623
induce a bit flip targeting the memory bank shared by the secret. This is a
practical fault model and uses Prime + Probe cache access attack methodology
to narrow down the search space where the adversary is supposed to induce flip.
The experimental results illustrate that the timing analysis shows significant
variation and leads to the identification of LLC set and slices. In addition row-
buffer collision has been exploited to identify the DRAM bank which holds the
secret. The worst case complexity of inducing fault by repeated hammering of
rows in the specific memory bank typically is same as the number of rows in
bank. The proposed attack finds most relevance in cross-VM setup, where the
co-located VMs share the same underlying hardware and thus root privileges are
usually granted to the attack instance.
Acknowledgements. We would like to thank the anonymous reviewers for their valu-
able comments and suggestions. We would also like to thank Prof. Berk Sunar for his
insightful feedback and immense support. This research was supported in part by the
TCS Research Scholarship Program in collaboration with IIT Kharagpur. This work
was also supported in part by the Challenge Grant from IIT Kharagpur, India and
Information Security Education Awareness (ISEA), Deity, India.
References
1. Wikipedia: Rowhammer wikipedia page (2016). https://en.wikipedia.org/wiki/
Row-hammer
2. Kim, Y., Daly, R., Kim, J., Fallin, C., Lee, J.-H., Lee, D., Wilkerson, C., Lai,
K., Mutlu, O.: Flipping bits in memory without accessing them: an experimental
study of DRAM disturbance errors. In: ACM/IEEE 41st International Symposium
on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 14–18, 2014,
pp. 361–372. IEEE Computer Society (2014)
3. Huang, R.-F., Yang, H.-Y., Chao, M.C.-T., Lin, S.-C.: Alternate hammering test
for application-specific drams and an industrial case study. In: Groeneveld, P.,
Sciuto, D., Hassoun, S. (eds.) The 49th Annual Design Automation Conference
2012, DAC 2012, San Francisco, CA, USA, June 3–7, 2012, pp. 1012–1017. ACM
(2012)
4. Kim, D.-H., Nair, P.J., Qureshi, M.K.: Architectural support for mitigating row
hammering in DRAM memories. Comput. Archit. Lett. 14(1), 9–12 (2015)
5. Seaborn, M., Dullien, T.: Exploiting the DRAM rowhammer bug to
gain kernel privileges (2015). http://googleprojectzero.blogspot.in/2015/03/
exploiting-dram-rowhammer-bug-to-gain.html
6. Qiao, R., Seaborn, M.: A new approach for rowhammer attacks. In: HOST 2016
(2016)
7. Gruss, D., Maurice, C., Mangard, S.: Rowhammer.js a remote software-induced
fault attack in javascript. CoRR, abs/1507.06955 (2015)
8. Seaborn, M., Dullien, T.: Test DRAM for bit flips caused by the rowhammer
problem (2015). https://github.com/google/rowhammer-test,2015
9. Boneh, D., DeMillo, R.A., Lipton, R.J.: On the importance of checking crypto-
graphic protocols for faults. In: Fumy, W. (ed.) EUROCRYPT 1997. LNCS, vol.
1233, pp. 37–51. Springer, Heidelberg (1997)
10. JEDEC. Standard No. 79-3F. DDR3 SDRAM Specification (2012)
624 S. Bhattacharya and D. Mukhopadhyay
11. Yarom, Y., Falkner, K.: FLUSH+RELOAD: A high resolution, low noise, L3 cache
side-channel attack. In: Fu K., Jung, J. (eds.) Proceedings of the 23rd USENIX
Security Symposium, San Diego, CA, USA, August 20–22, 2014, pp. 719–732.
USENIX Association (2014)
12. Maurice, C., Le Scouarnec, N., Neumann, C., Heen, O., Francillon, A.: Reverse
engineering intel last-level cache complex addressing using performance coun-
ters. In: Bos, H., et al. (eds.) RAID 2015. LNCS, vol. 9404, pp. 48–65. Springer,
Heidelberg (2015). doi:10.1007/978-3-319-26362-5 3
13. Irazoqui, G., Eisenbarth, T., Sunar, B.: Systematic reverse engineering of cache
slice selection in intel processors. In: 2015 Euromicro Conference on Digital System
Design, DSD 2015, Madeira, Portugal, 26–28 August 2015, pp. 629–636. IEEE
Computer Society (2015)
14. Osvik, D.A., Shamir, A., Tromer, E.: Cache attacks and countermeasures: the
case of AES. In: Pointcheval, D. (ed.) CT-RSA 2006. LNCS, vol. 3860, pp. 1–20.
Springer, Heidelberg (2006)
15. Liu, F., Yarom, Y., Ge, Q., Heiser, G., Lee, R.B.: Last-level cache side-channel
attacks are practical. In: 2015 IEEE Symposium on Security and Privacy, SP 2015,
San Jose, CA, USA, 17–21 May 2015, pp. 605–622. IEEE Computer Society (2015)
16. Pessl, P., Gruss, D., Maurice, C., Mangard, S.: Reverse engineering intel DRAM
addressing and exploitation. CoRR, abs/1511.08756 (2015)
17. Hund, R., Willems, C., Holz, T.: Practical timing side channel attacks against
kernel space ASLR. In: 2013 IEEE Symposium on Security and Privacy, SP 2013,
Berkeley, CA, USA, 19–22 May 2013, pp. 191–205. IEEE Computer Society (2013)
18. JEDEC Solid State Technology Association: Low Power Double Data Rate 4
(LPDDR4) (2015)
19. Micron Inc. DDR4 SDRAM MT40A2G4, MT40A1G8, MT40A512M16 Data sheet,
2015 (2015)
20. Aweke, Z.B., Yitbarek, S.F., Qiao, R., Das, R., Hicks, M., Oren, Y., Austin, T.M.:
ANVIL: software-based protection against next-generation rowhammer attacks.
In: Conte, T., Zhou, Y. (eds.) Proceedings of the Twenty-First International Con-
ference on Architectural Support for Programming Languages and Operating Sys-
tems, ASPLOS 2016, Atlanta, GA, USA, 2–6 April 2016, pp. 743–755. ACM (2016)
21. Bertacco, V., Alaghi, A., Arthur, W., Tandon, P.: Torturing openSSL
22. Inci, M.S., Gülmezoglu, B., Irazoqui, G., Eisenbarth, T., Sunar, B.: Cache attacks
enable bulk key recovery on the cloud. IACR Cryptology ePrint Archive: 2016/596
A Design Methodology for Stealthy Parametric
Trojans and Its Application to Bug Attacks
Abstract. Over the last decade, hardware Trojans have gained increas-
ing attention in academia, industry and by government agencies. In order
to design reliable countermeasures, it is crucial to understand how hard-
ware Trojans can be built in practice. This is an area that has received
relatively scant treatment in the literature. In this contribution, we exam-
ine how particularly stealthy Trojans can be introduced to a given target
circuit. The Trojans are triggered by violating the delays of very rare
combinational logic paths. These are parametric Trojans, i.e., they do
not require any additional logic and are purely based on subtle manipu-
lations on the sub-transistor level to modify the parameters of the tran-
sistors. The Trojan insertion is based on a two-phase approach. In the
first phase, a SAT-based algorithm identifies rarely sensitized paths in a
combinational circuit. In the second phase, a genetic algorithm smartly
distributes delays for each gate to minimize the number of faults caused
by random vectors.
As a case study, we apply our method to a 32-bit multiplier cir-
cuit resulting in a stealthy Trojan multiplier. This Trojan multiplier
only computes faulty outputs if specific combinations of input pairs are
applied to the circuit. The multiplier can be used to realize bug attacks,
introduced by Biham et al. In addition to the bug attacks proposed pre-
viously, we extend this concept for the specific fault model of the path
delay Trojan multiplier and show how it can be used to attack ECDH
key agreement protocols.
Our method is a general approach to path delay faults. It is a ver-
satile tool for designing stealthy Trojans for a given circuit and is not
restricted to multipliers and the bug attack.
1 Introduction
On the other hand, there is scant treatment in literature about how to design
Trojans. Nevertheless, Trojan detection and design are closely related: in order to
design effective detection mechanisms and countermeasures, we need an under-
standing of how Hardware Trojans can be built. This holds in particular with
respect to Trojans that are specifically designed to avoid detection. The situation
is akin to the interplay of cryptography and cryptanalysis.
There are several different ways that hardware Trojans can be inserted into
an IC [14]. The insertion scenarios that have drawn the most attention in the
past are hardware Trojans introduced during manufacturing by an untrusted
semiconductor foundry. One of the main motivations behind this is the fact that
the vast majority of ICs world wide are fabricated abroad, and a foundry can
possibly be pressured by a government agency to maliciously manipulate the
design. However, we note that a similar situation can exist in which the original
IC designer is pressured by her own government to manipulate all or some of
the ICs, e.g., those that are used in overseas products. Similarly, 3rd party IP
cores are another possible insertion point.
The primary setting we consider is modification during manufacturing, but
the method also carries over to the other scenarios mentioned above. The Trojan
will be inserted by modifying a few gates at the sub-transistor level during
manufacturing, so that their delay values increase. The goal is to select and
chose the delays such that only for extremely rare input combinations these
delays add up to a path delay fault. There are many possible ways to increase
the delays in practice in stealthy ways. Since not a single transistor is removed
or added to the design and the changes to the individual gates are minor, the
Trojan is very difficult to detect post-manufacturing using reverse-engineering,
visual inspection, side-channel profiling or most other known detection methods.
Due to the extremely rare trigger conditions, it is also highly unlikely that the
Trojan will be detected during functional testing. Even full reverse-engineering
of the IC will not reveal the presence of the backdoor. Similarly, since the actual
Trojan will be inserted in the last step of the design flow, the Trojan will not
be present at higher abstraction levels such as the netlist. Accordingly, this
type of Trojan is also very interesting for the scenario of stealthy, government-
mandated backdoors. The number of engineers that are aware of the Trojan
would be reduced to a minimum since even the designers of the Trojan-infested
IP core would not be aware that such a backdoor has been inserted into the
product. This can be crucial to eliminate the risk of whistle blowers revealing
the backdoor. In summary, our method overcomes two major problems a Trojan
designer faces, namely making the Trojan detection resistant and to provide a
very rare trigger condition.
Lin et al. presented a Hardware Trojan that stealthily leaks out the crypto-
graphic key using a power side-channel [18]. This Hardware Trojan was also
inserted at the netlist or HDL level, similarly to the Hardware Trojans that
were designed as part of a student Hardware Trojan challenge at ICCD 2011 [19].
How to build stealthy Trojans at the layout-level was demonstrated in 2013 by
Becker et al. which showed how a Hardware Trojan can be inserted into a crypto-
graphically secure PRNG or a side-channel resistant SBox only by manipulating
the dopant polarity of a few registers [4]. Another idea proposed in the litera-
ture is the idea of building Hardware Trojans that are triggered by aging [23].
Such Trojans are inactive after manufacturing and only become active after the
IC has been in operation for a long time. Kumar et al. proposed a parametric
Trojan [17] that triggers probabilistically with a probability that increases under
reduced supply voltage.
Compared to research concerned with the design of Hardware Trojans, con-
siderably more results exist related to different Hardware Trojan detection mech-
anisms and countermeasures. Most research focuses on detecting Hardware Tro-
jans inserted during manufacturing. In many cases, a golden model is used that
is supposed to be Trojan free to serve as a reference. One important question is
how to get to a Trojan-free golden model. One approach proposed is to use visual
reverse-engineering of a few chips to ensure that these chips were not manipu-
lated. For this the layout is compared to SEM images of the chip. In [3] methods
of how to automatically do this are discussed. Please note that not all Hardware
Trojans are directly visibly in black-and-white SEM images. For example, to
detect the dopant-level Hardware Trojans additional steps are needed, e.g., the
method presented by Sugawara et al. [24]. One motivation of our work is that we
might achieve an even higher degree of stealthiness by only slowing down transis-
tors as opposed to completely changing transistors as has been done in [4]. Such
parametric changes can be done cleverly to make visual reverse-engineering very
difficult as discussed in Sect. 3. Another approach to Trojan detection uses power
profiles that are used to compare the chip-under-test with previously recorded
side-channel measurement of the golden chip. The most popular approach uses
power side channels, first proposed by Agrawal et al. [2]. The idea to build
specific Trojan detection circuitry has also been proposed, e.g., in [20]. How-
ever, these approaches usually suffer from the problem that a Trojan can also be
inserted into such detection circuitry. Preventing Hardware Trojans inserted at
the HDL level by third party IP cores has been discussed, e.g., in [13] and [26].
Efficient generation of test patterns for Hardware Trojans triggered by a rare
input signals is the focus of work by Chakraborty et al. in [8] and Saha et al.
in [21].
and thus difficult to detect with most standard methods and (ii) have very
rare trigger conditions.
– We present an automation flow for inserting the proposed style of Trojan.
We propose an efficient, SAT solver-based path selection algorithm, which
identifies suitably rare paths within a given target circuit. We also propose a
second algorithm, based on genetic algorithms, for distributing the necessary
delay along the rare path. The key requirement is to minimize the effect of
the added delay on the remaining circuit.
– As a case study for the effectiveness of the proposed method, a Trojan mul-
tiplier is designed. We were able to identify a rare path and perform specific
delay modification in a 32-bit multiplier circuit model in such a way that the
faulty behavior only occurs for very few combinations of two consecutive input
values. We note that the input space of the multiplier is (232 )2 = 264 so that
most random input values occur very rarely during regular operation.
– We show how the Trojan multiplier can used for realizing the bug attack by
Biham et al. [5,6] and propose a related attack on the ECDH key agreement
protocol. We provide probabilities for this new bug attack variant. A precom-
putation phase reduces the attack complexity and makes the attacks practical
for real-world scenarios. We show that the attacker can engineer the failure
probability to the desired level by increasing the introduced propagation delay
of the Trojan.
This work implements Trojan functionality in a given target circuit by using path
delay faults (PDF), without modification to logic circuit, to induce inaccurate
results for extremely rare inputs. Before describing the details of our method,
we first define the notion of a viable delay-based Trojan in the unmodified HDL
of the circuit as follows. A viable delay-based trojan must posses the following
two properties.
Triggerability. For secret inputs, which are known to the attacker, cause an
error with certainty or relatively high probability.
Stealthiness. For randomly chosen inputs, cause an error with extremely low
probability.
Fig. 1. Flowchart of the proposed method for creating a stealthy PDF (path delay
faults).
3 Delay Insertion
Delay faults occur when the total propagation delay along a sensitized circuit
path exceeds the clock period. Our algorithm causes delay faults by increasing
the delay of gates on a chosen path. While the approach is compatible with any
mechanism for controlling gate delays, in this section we provide background on
practical methods that a Trojan designer might use to implement slow gates. In
static CMOS logic, a path delay fault is not triggered by a single input vector,
but instead is triggered by a sequence of two input vectors applied on consecutive
cycles. The physical reason for delay being caused by a pair of inputs is that delay
depends on the charging or discharging of capacitances, and the initial states of
these capacitances in the second vector are determined as final states from the
first vector. Assuming the capacitances need to be charged or discharged along
a path, as is the case in delay faults, the delay of each gate depends on how
630 S. Ghandali et al.
VIN
VOUT 1
L W
Voltage
Current
Vin
Vout
n+ IN n+ iN
p
0
0ps 50ps 100ps
Time
(a) Annotated NMOS transistor (b) Switching Event
more slowly. Both, changing to dopant concentrations and body biasing, are
difficult to detect, even with invasive methods.
Increase Gate Length. Delay of chosen gates can be increased by gate length
biasing. Lengthening the gate of transistor causes a reduction in current, and
therefore increases delay [11]. Again, the likelihood of detection depends on
the degree of the alteration.
We note that the methods sketched above (and other slow-down alterations)
can be combined such that each manipulation is relatively minor and, thus, more
difficult to detect.
In this phase we try to select a path among huge number of paths existing in the
netlist of a multiplier circuit, in such a way that random inputs will very rarely
sensitize the path. The rareness is a first step towards ensuring stealthiness of
the Trojan.
tions to justify the transition. See Table 3 in the appendix for the formula used
to compute dif f j for each transition on each gate type. Note that our difficulty
metric is weighted to always prefer robust sensitization first, and only resort to
non-robust sensitization when there are no robustly sensitizable nodes in the list
of candidates. Whenever a node is prepended to π to create a candidate path π
(line 5) the sensitizability of π is checked by calling check-sensitizability func-
tion. In this function SAT-based techniques [9] are used to check sensitizability
of a path and to find a vector pair that justifies and propagates a transition
along the path (line 6). If the SAT solver returns SAT, then path π is known
to be a subpath of a sensitizable path from PIs to POs. Because the candidates
are visited in order of preference, there is no need to check other candidates
after finding a first candidate that produces a sensitizable path. At this point,
the algorithm updates π to be π and the algorithm exits the for loop having
extended the path by one node. If this newly added tail node is not a PI, then
the algorithm will again try to extend it backwards.
Fitness Function. Simply stated, the cost function that we want to minimize
is the probability of causing an error when random input vectors are applied
to the circuit. Because there is no simple closed-form expression for this, we
use random simulation to evaluate the cost of any delay assignment. When the
genetic algorithm in Matlab needs to evaluate the cost of a particular delay
assignment, it does so by executing a timing simulator. The timing simulator, in
our case ModelSim, applies random vectors to the circuit-under-evaluation and
a golden copy of the circuit and compares the respective outputs to count the
number of errors that occur. These errors are caused by the delay assignments
in the circuit-under-evaluation. The cost that is returned from the simulator
is the percentage of inputs that caused an error for this delay assignment. As
the genetic algorithm proceeds through more and more generations of solutions,
the quality of the solutions improve. Matlab’s genetic algorithm implementation
comes with a stopping criterion, so we simply allow the algorithm to run until
completion.
5 Experimental Results
We now evaluate the effectiveness of our method of designing Trojans, using a
32 × 32 Wallace tree multiplier as a test case. The circuit has a nominal critical
path of length 128, and the delay of this path is 2520 ps.
of Fig. 4 shows the result; the x-axis represents error rates, and the y-axis shows
how many of the paths have each error rate. The result shows that a majority
of paths would cause frequent errors if their delay is increased, and these paths
are thus unsuitable for stealthy Trojans. The rare path (RP) selected by our
algorithm caused an error for only 4 of 10,000 vectors. By comparison, the best
of the random paths caused an error in 174 of 10,000 vectors. In this experiment,
the path chosen by the path selection algorithm is 43x less likely to cause an error
than the best of 750 random paths. Note that this experiment is conservative
in that the amount of additional delay added is very large, and the delay is not
smartly distributed along the path to minimize detection.
Fig. 4. Fault simulation of rare path and 750 random paths of 32 × 32 Wallace tree
multiplier.
A Design Methodology for Stealthy Parametric Trojans 637
Fig. 5. Error probability of circuit before and after optimizing delay assignment of rare
path and 9 other paths in a 32 × 32 Wallace tree multiplier.
distribution. This result shows that, for a given total path delay, optimizing the
delay assignment along the path can reduce the probability of having an error
when random vectors are applied. It is important to note that this improvement
in stealthiness comes from minimizing the side effects of the added delay, and
does not impact triggerability when vectors are applied that actually sensitize
the entire chosen path.
Delay distribution
Uniform GA
Num. of times exceeding 2520 ps 57 0
Num. of random vectors applied 200,000 260M
Prob. of exceeding 2520 ps 0.0003 < 2−26
Fig. 6. Increasing the rare path delay increases the probability of causing an error
when random vectors are applied. This delay is allocated to gates according to the
delay distribution algorithm. The results are shown for different clock periods.
When the amount of delay added to the rare path is increased, and the prob-
ability of error grows above 2−26 , the error probability can feasibly be estimated
with random simulation. In this regime, we can evaluate the tradeoff of delay and
trigger probability. For example, when the chosen path is given a total delay of
3150 ps allocated using genetic algorithm for delay distribution, and the circuit
is operated at a clock period of 2800 ps (as might be reasonable for a nominal
critical path of 2520 ps) an erroneous output occurs with probability of roughly
2−24 (once every 16 million multiplications) when random inputs are applied.
The overall tradeoff is shown in Fig. 6 for different clock periods. One can exploit
this tradeoff to create a desired error probability by increasing or decreasing the
total amount of delay added to the chosen path.
The Trojan Multiplier introduced in the precious Section has a different fault
model than the one assumed in [5]. In particular, the output of the Trojan
Multiplier does not only depend on the current input but also on the previous
inputs, i.e., it has a state. We define the multiplication of two 32-bit numbers
a1 , b1 with our Trojan Multiplier as ỹ = M U La0 ,b0 (a1 , b1 ) where a0 , b0 is the
previous input pair to the multiplier. The list F of quadruples (a0 , b0 , a1 , b1 ) are
all input sequences for which the Trojan Multiplier computes a faulty response:
Outputs computed with the Trojan Multiplier are always represented with a
tilde. An ECC scalar multiplication of point Q ∈ E with an integer k is denoted
as R = k · Q. An elliptic curve scalar multiplication using the Trojan Multiplier
is denoted with an , i.e., R̃ = k Q. In the following we assume that an
attacker has knowledge of the Trojan Multiplier or access to a chip with the
Trojan Multiplier such that the attacker knows for which inputs R̃ = R.
The attack complexity strongly depends on the probability that a multiplica-
tion results in a faulty response. In order to be able to compute this probability
we make following definitions:
1. PM (a1 ,b1 ) : Probability that for two random 32-bit integers a1 , b1 there exits
at least one pair of 32-bit integers a0 , b0 such that ỹ = M U La0 ,b0 (a1 , b1 )
computes a faulty response
2. PM (a1 ) : Probability that for a random 32-bit integers a1 there exits at least
one triplet of 32-bit integers a0 , b0 , b1 such that ỹ = M U La0 ,b0 (a1 , b1 ) com-
putes a faulty response. Probability PM (b1 ) is defined in the same fashion.
3. PM (a0 ,b0 |a1 ,b1 ) : Probability that for two random 32-bit integers a0 , b0 and
two given integers a1 , b1 the multiplication ỹ = M U La0 ,b0 (a1 , b1 ) computes
a faulty response if there exists at least one other input pair a0 , b0 for which
ỹ = M U La0 ,b0 (a1 , b1 ) computes a faulty response
4. PM (a0 |a1 ,b1 =b0 ) : Probability that for a random 32-bit integers a0 , and two
given integers a1 , b1 the multiplication ỹ = M U La0 ,b0 (a1 , b1 ) with b0 = b1
computes a faulty response if there exists at least one other input pair a0 , b0
for which ỹ = M U La0 ,b0 (a1 , b1 ) computes a faulty response
Let us assume that the attacker tries to find a point Q for key bit i. Since the
attacker searches for a fault in the last Montgomery Ladder step, for every point
Q the attacker needs to compute i − 2 Montgomery Ladder steps (for the first
1
See Appendix B of the IACR ePrint version for the Montgomery Ladder algorithm.
642 S. Ghandali et al.
Table 2. Attack complexity of the proposed improved bug attack using the Trojan
multiplier assuming a 256 bit curve.
key bit no step is needed) and then two Montgomery Ladder steps for key bit 1
and 0 respectively to check if the multiplication fails. Hence, in total the attacker
needs an average of AM Montgomery Ladder steps to recover a 255 bit key:
255
2552 + 255
AM = (i · AQ ) = · AQ ≈ 216 · AQ
i=2
2
7 Conclusion
This paper introduced a new type of parametric hardware Trojans based on
rarely-sensitized path delay faults. While hardware Trojans using parametric
changes (i.e. that only modify the performance/parameters of gates) have been
proposed before, the previously proposed parametric hardware Trojans cannot
be triggered deterministically. They are instead either triggered after time by
aging [23], triggered randomly under reduced voltage [17] or are always on and
can leak keys using a power side-channel [4]. In contrast, the proposed paramet-
ric hardware Trojan in this paper can be triggered by applying specific input
A Design Methodology for Stealthy Parametric Trojans 643
sequences to the circuit. Hence, this paper introduces the first trigger-based
hardware Trojan that is realized solely by small and stealthy parametric changes.
To achieve this, a SAT-based algorithm is presented which efficiently searches a
combinational circuit for paths that are extremely rarely sensitized. A genetic
algorithm is then used to distribute delays over all the gates on the path so that
a path delay fault occurs when trigger inputs are applied, while for other inputs
the timing criteria are met. In this way, a faulty response is computed only for
a very small subset of input combinations.
To demonstrate the usefulness of the proposed technique, a 32-bit multiplier
is modified so that, for some multiplications, faulty responses are computed.
These faults can be so rare that they do not interfere with normal operations
but can still be used by the Trojan designer for a bug attack against public
key algorithms. As a motivating example, we showed how this can be achieved
for ECDH implementations. Please note that while we used a multiplier as our
case study, the general idea of path delay Trojans and the tool-flow and algo-
rithms presented in this paper are not restricted to multipliers. Hence, this work
shows that by only making extremely stealthy parametric changes to a design,
a malicious factory could insert backdoors to leak out secret keys.
Table 3. Computation of dif f j for different gate types. In the case of 2-input gates,
we assume without loss of generality that input A is the on-path input and B is the off-
path input. The first two columns show the output transition, and the input transition
that we are trying to justify for this output transition. Columns 3–6 show the values
that the on-path input (A) and off-path input (B) must take in the first and second
cycles to justify the desired transition. The final column shows the formula to compute
dif f j in terms of the controllability of the inputs.
Table 4. Computation of dif f p for different gate types. In the case of 2-input gates,
we assume without loss of generality that input A is the on-path input and B is the off-
path input. The first two columns show the output transition, and the input transition
that we are trying to propagate for this on-path input transition. Columns 3–6 show
the values that the output (X) and off-path input (B) must take in the first and second
cycles to propagate the desired transition. The final column shows the formula to
compute dif f p in terms of the controllability of the off-path input and observability
of output.
B Montgomery Ladder
To be able to compute the exact attack complexity the details of the Montgomery
Ladder are important to determine how many manipulations are performed in
each step. Algorithms 3 and 4 describe the details of the assumed Montgomery
Ladder implementation.
Computing the Failure Probability of a Scalar Multiplication. In this
subsection we describe how the failure probability of a Montgomery Ladder
The probability that no failure occurs during one Montgomery Ladder step is
therefore:
(1 − PM (a1 ,ab ) )42 · (1 − PM (a0 ,b1 |a1 ,b1 ) )7
A 255-bit scalar multiplication requires 254 Montgomery Ladder steps. Hence
the probability that a failure occurs is given by:
References
1. Genetic Algorithm. http://www.mathworks.com/discovery/genetic-algorithm.
html. Accessed 01 Feb 2016
2. Agrawal, D., Baktir, S., Karakoyunlu, D., Rohatgi, P., Sunar, B.: Trojan detection
using IC fingerprinting. In: IEEE Symposium on Security and Privacy (SP 2007),
pp. 296–310 (2007)
3. Bao, C., Forte, D., Srivastava, A.: On reverse engineering-based hardware Trojan
detection. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 35(1), 49–57 (2016)
4. Becker, G.T., Regazzoni, F., Paar, C., Burleson, W.P.: Stealthy dopant-level hard-
ware Trojans. In: Bertoni, G., Coron, J.-S. (eds.) CHES 2013. LNCS, vol. 8086,
pp. 197–214. Springer, Heidelberg (2013)
5. Biham, E., Carmeli, Y., Shamir, A.: Bug attacks. In: Wagner, D. (ed.) CRYPTO
2008. LNCS, vol. 5157, pp. 221–240. Springer, Heidelberg (2008)
6. Biham, E., Carmeli, Y., Shamir, A.: Bug attacks. J. Cryptology 1–31 (2015).
http://dx.doi.org/10.1007/s00145-015-9209-1
7. Brumley, B.B., Barbosa, M., Page, D., Vercauteren, F.: Practical realisation and
elimination of an ECC-related software bug attack. In: Dunkelman, O. (ed.) CT-
RSA 2012. LNCS, vol. 7178, pp. 171–186. Springer, Heidelberg (2012)
8. Chakraborty, R.S., Wolff, F., Paul, S., Papachristou, C., Bhunia, S.: MERO: a
statistical approach for hardware Trojan detection. In: Clavier, C., Gaj, K. (eds.)
CHES 2009. LNCS, vol. 5747, pp. 396–410. Springer, Heidelberg (2009)
9. Eggersgl, S., Wille, R., Drechsler, R.: Improved SAT-based ATPG: more
constraints, better compaction. In: IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), pp. 85–90 (2013)
10. Ghandali, S., Alizadeh, B., Navabi, Z.: Low power scheduling in high-level synthesis
using dual-Vth library. In: 16th International Symposium on Quality Electronic
Design (ISQED), pp. 507–511 (2015)
11. Gupta, P., Kahng, A.B., Sharma, P., Sylvester, D.: Gate-length biasing for runtime-
leakage control. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 25(8), 1475–1485
(2006)
12. Heragu, K., Agrawal, V., Bushnell, M.: FACTS: fault coverage estimation by test
vector sampling. In: Proceedings of IEEE VLSI Test Symposium, pp. 266–271
(1994)
13. Hicks, M., Finnicum, M., King, S.T., Martin, M.M., Smith, J.M.: Overcoming an
untrusted computing base: detecting and removing malicious hardware automati-
cally. In: IEEE Symposium on Security and Privacy (SP 2010), pp. 159–172 (2010)
14. Karri, R., Rajendran, J., Rosenfeld, K., Tehranipoor, M.: Trustworthy hardware:
identifying and classifying hardware Trojans. Computer 10, 39–46 (2010)
15. King, S.T., Tucek, J., Cozzie, A., Grier, C., Jiang, W., Zhou, Y.: Designing and
implementing malicious hardware. In: Proceedings of the 1st USENIX Workshop
on Large-scale Exploits and Emergent Threats (LEET 08), pp. 1–8 (2008)
16. Kulkarni, S.H., Sylvester, D.M., Blaauw, D.T.: Design-time optimization of post-
silicon tuned circuits using adaptive body bias. IEEE Trans. Comput. Aided Des.
Integr. Circ. Syst. 27(3), 481–494 (2008)
17. Kumar, R., Jovanovic, P., Burleson, W., Polian, I.: Parametric Trojans for fault-
injection attacks on cryptographic hardware. In: 2014 Workshop on Fault Diagnosis
and Tolerance in Cryptography (FDTC), pp. 18–28. IEEE (2014)
A Design Methodology for Stealthy Parametric Trojans 647
18. Lin, L., Kasper, M., Güneysu, T., Paar, C., Burleson, W.: Trojan side-channels:
lightweight hardware Trojans through side-channel engineering. In: Clavier, C.,
Gaj, K. (eds.) CHES 2009. LNCS, vol. 5747, pp. 382–395. Springer, Heidelberg
(2009)
19. Rajendran, J., Jyothi, V., Karri, R.: Blue team red team approach to hardware
trust assessment. In: IEEE 29th International Conference on Computer Design
(ICCD 2011), pp. 285–288, October 2011
20. Rajendran, J., Jyothi, V., Sinanoglu, O., Karri, R.: Design and analysis of ring
oscillator based design-for-trust technique. In: 29th IEEE VLSI Test Symposium
(VTS 2011), pp. 105–110 (2011)
21. Saha, S., Chakraborty, R.S., Nuthakki, S.S., Mukhopadhyay, D.: Improved test pat-
tern generation for hardware Trojan detection using genetic algorithm and Boolean
satisfiability. In: Güneysu, T., Handschuh, H. (eds.) CHES 2015. LNCS, vol. 9293,
pp. 577–596. Springer, Heidelberg (2015)
22. Sasdrich, P., Güneysu, T.: Implementing Curve25519 for side-channel-protected
elliptic curve cryptography. ACM Trans. Reconfigurable Technol. Syst. (TRETS)
9(1), 3 (2015)
23. Shiyanovskii, Y., Wolff, F., Rajendran, A., Papachristou, C., Weyer, D., Clay, W.:
Process reliability based Trojans through NBTI and HCI effects. In: NASA/ESA
Conference on Adaptive Hardware and Systems (AHS 2010), pp. 215–222 (2010)
24. Sugawara, T., Suzuki, D., Fujii, R., Tawa, S., Hori, R., Shiozaki, M., Fujino, T.:
Reversing stealthy dopant-level circuits. In: Batina, L., Robshaw, M. (eds.) CHES
2014. LNCS, vol. 8731, pp. 112–126. Springer, Heidelberg (2014)
25. Tang, X., Zhou, H., Banerjee, P.: Leakage power optimization with dual-Vth library
in high-level synthesis. In: 42nd Annual Design Automation Conference (DAC
2005), pp. 202–207 (2005)
26. Waksman, A., Sethumadhavan, S.: Silencing hardware backdoors. In: IEEE Sym-
posium on Security and Privacy (SP 2011), pp. 49–63 (2011)
Author Index