Efficient Non-Profiled Side Channel Attack Using Multi-Output Classification Neural Network
Efficient Non-Profiled Side Channel Attack Using Multi-Output Classification Neural Network
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2022.3213443
Abstract—Differential Deep Learning Analysis (DDLA) is entropy loss function, their models are optimized by minimiz-
the first deep learning based non-profiled side-channel attack ing a scalar loss value after training processes. Therefore, the
(SCA) on embedded systems. However, DDLA requires many accuracy metrics of key guesses are calculated by a custom
training processes to distinguish the correct key. In this letter,
we introduce a non-profiled SCA technique using multi-output function on each epoch, which separates the output and then
classification to mitigate the aforementioned issue. Specifically, matches the hypothesis values. The results of this function
a multi-output multi-layer perceptron and a multi-output con- are then used to determine the correct key. Despite being a
volutional neural network are introduced against various SCA very fast attack technique, the parallel architecture requires a
protected schemes, such as masking, noise generation, and trace high memory usage. To mitigate the disadvantage of parallel
de-synchronization countermeasures. The experimental results
on different power side channel datasets have clarified that architecture, the authors have introduced a shared-layer-based
our model performs the attack up to 9 and 30 times faster model that is reconstructed by the same DDLA model, except
than DDLA in the case of masking and de-synchronization for the output layer. To the best of our knowledge, this is the
countermeasures, respectively. In addition, regarding combined first model that can predict 256 keys hypotheses in only one
masking and noise generation countermeasure, our proposed training process.
model achieves a higher success rate of at least 20% in the cases
of the standard deviation equal to 1.0 and 1.5. An alternative and often more effective approach in the
DL domain is to develop a single neural network model that
Index Terms—Side channel attacks, embedded systems, deep
can learn multiple related tasks (i.e., outputs) at the same
learning, multi-output, multi-loss.
time, called multiple-output learning (MOL) [6]. From the
point of view of the SCA domain, MOL is a promising
I. I NTRODUCTION technique that could increase the performance of the SCA
Side channel attacks (SCA) have become a serious threat to evaluation process. In this letter, we propose a novel SCA
the cryptographic implementations on embedded systems. It attack based on multi-output classification, which can predict
has raised the awareness of the security research community 256 values of the key hypothesis in a single training without
to seek new techniques, which can be used to detect vul- any reference device. Specifically, a multi-output multi-layer
nerabilities [1] or counteract the SCA attacks [2]. However, perceptron (MLPMO ) and a multi-output convolutional neural
researching new SCA attacks is critical to point out the network (CNNMO ) are introduced. In which MLPMO is used
potential threats. In this letter, we introduce a new SCA attack for breaking the boolean masking [7] and reducing the effect
method using multi-output neural networks, which can reveal of noise-generation countermeasures [8], whereas CNNMO is
the secret key quickly in a non-profiled context. exploited to reveal the secret key from de-synchronization
countermeasure [9]. Our approach exploits multi-loss instead
Our work is motivated by the previous work that was
of using only binary cross entropy loss as in [4], [5]. Accord-
presented by Timon et al. [3]. Accordingly, based on deep
ingly, a separate loss corresponding to each output is calculated
learning (DL) techniques, their proposal called DDLA can
in the training process. Therefore, the training metrics (loss
reveal the secret key without any reference devices. However,
and accuracy) of each key hypothesis can be achieved easily
DDLA requires the attacker repeatedly perform the training
without any extra calculation. As a result, our proposal can
process to observe the training metrics, which are then used
perform attacks faster than a parallel network.
to determine the correct subkey byte. Recently, Kwon et
al. [4] have investigated the aforementioned drawbacks of the
DDLA technique and mitigated them by using a parallel neural II. P ROPOSED MULTI - OUTPUT DEEP LEARNING MODEL
network architecture. Kwon’s work can be considered as a FOR NON - PROFILED SCA
multi-label SCA approach as in [5]. Based on the binary cross
A. Data preparation
Manuscript received August 06, 2022; accepted September 30, 2022. Date
of publication October 07, 2022; date of current version October 07, 2022. To apply multi-output model in SCA domain, the data input
This research is funded by Vietnam National Foundation for Science and (power traces) should be labeled by the values corresponding
Technology Development (NAFOSTED) under grant number 102.02-2020.14.
This manuscript was recommended for publication by Ozgur Sinanoglu. to the model outputs. We aim to predict all key hypotheses
(Corresponding author: Van-Phuc Hoang.) in one training process. Therefore, the number of network
The authors are with Institute of System Integration, Le Quy Don Technical outputs is 256, which corresponds to 256 key guesses (0 to
University, Ha Noi, Vietnam, and also with Faculty of Communications and
Radar, Vietnam Naval Academy, Nha Trang, Vietnam. 255). To benchmark our proposed architecture, we use the
DOI:10.1109/LES.2022.0176 same LSB labeling technique as in previous works [3], [4],
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NASATI. Downloaded on November 09,2022 at 07:23:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2022.3213443
(m samples/trace)
1 1 1 1 1 1 Output 1
Dataset1
n Power traces
1 2 m 1 2 256 conv1d_1, conv1d_2,
Data.csv norm, norm,
Fig. 1. Structure of reconstructed dataset and proposed multi-output models. a) Multi-output dataset; b) MLPMO model; c) CNNMO model.
TABLE I B. MLPMO
S TRUCTURE OF RECONSTRUCTED DATASETS .
To solve the problem of DDLA, a feed-forward MO model
Dataset
ASCAD CW
Label based on the MLP architecture is proposed. As depicted in
Traces Samples Traces Samples Fig. 1.b, the overall architecture of our proposed network
Dataset1 20000 700 - - LSB
Dataset2 20000 700 - - LSB(vector) consists of an input layer, a shared layer followed by k
Dataset3 50000 700 - - LSB(vector) branches corresponding to k key hypotheses (k = 256).
Dataset4 - - 10000 480 LSB Each branch contains the same MLP architecture as MLPDDLA
Dataset5 - - 10000 480 LSB(vector)
(except the input layer) [3]. According to [11], we keep the
number of layers and the number of nodes in each layer as the
original MLPDDLA model (hidden layer: 20 × 10-Relu, output
TABLE II
D EEP LEARNING HYPER - PARAMETERS OF PROPOSED MODELS .
layer: 2-Softmax). The input layer of the proposed model has
the same size as the number of samples in the power trace.
Model MLP-MO CNN-MO The shared layer plays an important role in the proposed ar-
Input size 700 480
chitecture. It can be utilized the max shared as in MLPmax-shared
conv1d 1 ( 4 32 × 1 filters),
0/50/ [4]. However, using the same architecture of MLPDDLA except
pool 1 (2 × 1), norm, relu
Shared layer 200/400
conv1d 2 (4 16 × 1 filters), output layer, their architecture only decreases the execution
-Relu
pool 2 (4 × 1), norm, relu time without enhancing the success rate, especially in the
Branch 256 256
Hidden layer/branch 20x10-Relu 0 case of noisy data. In contrast, our proposal aims to decrease
Output layer/branch 2-Softmax 2-Softmax the computation time as well as enhance the success rate.
Batch 1000 50 Therefore, we do not make the comparison to MLPmax-shared
Initializing He uniform
in this work. In the case of the model without using the
shared layer, the first hidden layer of each branch is fully
connected to the input layer. Unlike MLPDDLA , the network
which is calculated by formula (1). As a result, the multi- parameters of our model are updated for all key hypotheses
output datasets used in this letter are constructed as depicted in each iteration instead of updating for only one key guess
in Fig. 1.a. as MLPDDLA architecture.
Since the same structure is applied for all branches, the
weights used for each branch are equivalent. Consequently, the
lji = LSB(Sbox(pi ⊕ kj )) (1) loss function of the whole network is calculated as follows:
256
X
Ltotal = γk ∗ L[k] (θ) (2)
where pi={1,n} denotes the ith plaintext encrypted in AES- k=1
128 algorithm, n is the number of plaintexts. kj={0,255} is
the key guess number j. where θ represents the set of all parameters of the model,
γk is used as weighted factor of branch number k th and set
as 1 for all branches (weights of each branch is equivalent),
In order to evaluate the efficiency of the proposed models,
L[k] denotes the loss results calculated for the k th branch. It
we consider two SCA data, same as in [3], including ASCAD
is noted that the same loss function is used for all branches,
data [7] and the data captured from ChipWhisperer-lite (CW)
which can be generally defined as follows:
board [10]. Regarding ASCAD data, the fixed key dataset is
2
selected to perform attacks on the first-order masking coun- 1 X
termeasure. The leakage model of ASCAD data is the output L[k] (θ) = − ytrue ln (z) (3)
Ns j=1
of the third Sbox with unknown mask values as described
in [7]. In the case of CW data, we select 10,000 power where ytrue and z are the ground-truth and the predicted val-
traces with the size of 480 samples/trace, which correspond ues, respectively. Ns denotes the number of training samples.
to the power consumption of the first Sbox output process. In For successful training, the deep learning algorithm needs to
addition, we simulate the de-synchronization countermeasure find the optimal value to minimize the loss function Ltotal . The
by using the same method as introduced in [4]. The structure network is trained in a series of iterations. In each iteration,
of reconstructed datasets is shown in Table I. the gradient of the loss function ∇Ltotal is computed for
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NASATI. Downloaded on November 09,2022 at 07:23:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2022.3213443
0 .7 5 0 .7 5 1 2 0 M L P -D D L A
0 .7 5 0 .7 5
S u c c e s s ra te (P e rc e n t)
C o rre c t k e y g u e ss C o rre c t k e y g u e ss M L P -D D L A N o n -S o S L C o rre c t k e y g u e ss C o rre c t k e y g u e ss
1 0 0 0
A tta c k tim e (S e c o n d )
0
0
0
N o n -S o S L
10
10
10
0 .7 0 1 0 0 In c o rre c t k e y g u e ss
21
96
In c o rre c t k e y g u e ss In c o rre c t k e y g u e ss S o S L -2 0 0 S o S L -2 0 0 0 .7 0 In c o rre c t k e y g u e ss 0 .7 0
0 .7 0
90
0 .6 8 1
2
2.
2
8 0 0
50
77
80
8 0 0 .6 4 8
85
7.
0 .6 5 0 .6 3 6 0 .6 6 7
A c c u ra c y
A c c u ra c y
0 .6 4 5 0 .6 5
A c c u ra c y
0 .6 5
A c c u ra c y
67
5.
0 .6 5 6 0 0
59
0 .6 0 9 6 0 0 .6 2 4
0 .6 0 0 .5 8 9 0 .6 0 9
0 .5 9 5 0 .6 0 0 .6 0
44
0 .6 0 4 0 0 4 0
36
0 .5 5
30
9. 2
7
0 .5 5 0 .5 5
1 0 .3 1
06
2 0
86 741
57
0 .5 5 2 0 0
39
6
0 .5 1 4
6 4 .3 6
0 .5 0
.9
.
12
.6
98
72
0 .4 9 5
0 0 0 .5 0 0 .5 0
0 .5 0 0 .5 1 1 .5 0 5 1 0 1 5 2 0 2 5 3 0 0 5 1 0 1 5 2 0 2 5 3 0
0 5 1 0 1 5 2 0 2 5 0 5 1 0 1 5 2 0 2 5 1 0 2 0 3 0
N u m b e r o f e p o c h s N u m b e r o f e p o c h s S ta n d a rd d e v ia tio n o f G a u s s ia n n o is e N u m b e r o f e p o c h s N u m b e r o f e p o c h s
N u m b e r o f e p o c h s b )
c ) a ) c )
a ) b )
Fig. 2. The experimental results on masking countermeasure. a, b) Accuracy Fig. 3. The experimental results on combined masking and noise generation.
of proposed model with and without shared layer, respectively; c) Comparison a) Success rate of MLPDDLA and MLPMO models on different levels of noise
of attack time. using 20,000 power traces; b) SoSL-200 on 20,000 power traces, σ = 1.5;
c) SoSL-200 on 50,000 power traces, σ = 1.5;
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NASATI. Downloaded on November 09,2022 at 07:23:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Embedded Systems Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LES.2022.3213443
0 .0 5 C o rre c t k e y g u e ss
In c o rre c t k e y g u e ss
0 .8 5
C o rre c t k e y g u e ss 0 .7 4 C o rre c t k e y g u e ss In addition, CNNMO provides a clear distinction at very early
0 .0 4 0 .8 0 In c o rre c t k e y g u e ss 0 .7 2 In c o rre c t k e y g u e ss
0 .7 0 epoch compared to that of CNNDDLA . These results have
C o rre la tio n
0 .0 3 0 .7 5
0 .6 8
L o ss
L o ss
0 .0 2 0 .7 0 0 .6 6 clarified that the proposed model outperforms previous work
0 .6 5 0 .6 4
0 .0 1
0 .6 0 0 .6 2 in terms of attack time.
0 .0 0 0 .6 0
0 1 0 0 2 0 0 3 0 0 4 0 0 0 2 0 4 0 6 0 8 0 1 0 0 0 2 0 4 0 6 0 8 0 1 0 0
S a m p le s N u m b e r o f e p o c h s N u m b e r o f e p o c h s
a ) b ) c ) IV. C ONCLUSION
In this letter, two multi-output models called MLPMO and
Fig. 4. Attack results on de-synchronized power traces using CPA, CNNDDLA ,
and CNNMO . a) CPA; b) CNNDDLA ; c) CNNMO .
CNNMO were introduced, which could perform SCA attacks
better than the single-output approach on different protected
schemes. Specifically, MLPMO reduces the execution time
are reconstructed as the same technique as DatasetX, where up to nine times compared to the MLPDDLA in the case
X = 1, 2, 3. In this case, we consider MLPDDLA is more of masking countermeasure. In addition, Non-SoSL model
reliable than MLPmax-shared on noisy data. Therefore, only the outperforms the state-of-the-art MLPPL in term of execu-
comparison between Non-SoSL, SoSL-200, and MLPDDLA is tion time. Regarding hiding countermeasures, two common
performed. By repeating the attacks using Non-SoSL, SoSL- hiding countermeasures, such as noise generation and de-
200, and MLPDDLA 50 times, we calculate the percentage synchronization, were used to evaluate the proposed models.
of successful attacks over total attacks. The comparison of The experimental results have shown that MLPMO can reduce
the success rate between MLPDDLA , Non-SoSL, and SoSL- the effect of noise and achieves a higher success rate of at least
200 is shown in Fig. 3.a. Evidently, all models achieve good 20% compared to MLPDDLA in the case of σ = 1.0 and 1.5. The
performance (100%) with the presence of a small level of experimental results have also clarified that CNNMO can break
additive noise (σ = 0.5). However, in the case of higher noise the de-synchronization countermeasure. More interesting, our
(σ = 1.0), the number of successful attacks drops from the proposed model performs SCA attacks faster, up to 30 times
100% to 80% and 90% corresponding to MLPDDLA and SoSL compared to CNNDDLA . However, by using a fixed number of
model, respectively. Interestingly, the success rate of SoSL- epochs, the attack results are not optimized. This problem will
200 only decreases slightly from 100% to 96%. A similar trend be investigated in the future work.
can be seen at the higher level of Gaussian noise (σ = 1.5).
The success rate goes down significantly because the models R EFERENCES
provide poor discrimination, as illustrated in Fig. 3.b, in the [1] P. Chakraborty, J. Cruz, C. Posada, S. Ray, and S. Bhunia, “HASTE:
Software security analysis for timing attacks on clear hardware assump-
case of SoSL-200 (0.681 and 0.667). However, our network tion,” IEEE Embedded Systems Letters, vol. 14, no. 2, pp. 71–74, jun
still achieves better results than MLPDDLA (44% and 36% 2022.
compared to 30%). We perform further attacks using SoSL- [2] I. M. Delgado-Lozano, E. Tena-Sanchez, J. Nunez, and A. J. Acosta,
“Gate-level design methodology for side-channel resistant logic styles
200 on a lager size of dataset (Dataset3-N3). A clear gap using TFETs,” IEEE Embedded Systems Letters, vol. 14, no. 2, pp. 99–
between correct and incorrect keys (0.648 and 0.624) can be 102, jun 2022.
seen in Fig. 3.c. More interesting, the success rate is 100%. [3] B. Timon, “Non-profiled deep learning-based side-channel attacks
with sensitivity analysis,” IACR Transactions on Cryptographic
It indicates that by using the reasonable value of SoSL, the Hardware and Embedded Systems, vol. 2019, no. 2, pp. 107–131,
proposed network can mitigate the effect of the additive noise Feb. 2019. [Online]. Available: https://tches.iacr.org/index.php/TCHES/
better. In addition, the attacker can perform DDLA attacks article/view/7387
[4] D. Kwon, S. Hong, and H. Kim, “Optimizing implementations of non-
with reasonable epochs, a larger number of traces, or more profiled deep learning-based side-channel attacks,” IEEE Access, vol. 10,
hyperparameters, which will, in turn, improve the success rate. pp. 5957–5967, 2022.
[5] L. Zhang, X. Xing, J. Fan, Z. Wang, and S. Wang, “Multi-label deep
learning based side channel attack,” in 2019 Asian Hardware Oriented
C. De-synchronized traces Security and Trust Symposium (AsianHOST). IEEE, dec 2019.
[6] D. Xu, Y. Shi, I. W. Tsang, Y.-S. Ong, C. Gong, and X. Shen, “Survey
Finally, we consider other protected datasets containing on multi-output learning,” IEEE Transactions on Neural Networks and
Learning Systems, vol. 31, no. 7, pp. 2409–2429, 2020.
de-synchronization power traces. To simulate this counter- [7] E. Prouff, R. Strullu, R. Benadjila, E. Cagli, and C. Dumas, “Study of
measure, we randomly shift each power trace of Dataset4 Deep Learning Techniques for Side-Channel Analysis and Introduction
and Dataset5 in a maximum of 20 samples. Consequently, to ASCAD Database,” CoRR, pp. 1–46, 2018.
[8] N. Kamoun, L. Bossuet, and A. Ghazel, “Correlated power noise
two new datasets called Dataset4-sh20, Dataset5-sh20 are generator as a low cost DPA countermeasures to secure hardware AES
used for training CNNDDLA and CNNMO , respectively. In this cipher,” in 2009 3rd International Conference on Signals, Circuits and
experiment, we used the loss metric to reveal the correct key. Systems (SCS). IEEE, nov 2009.
[9] J.-S. Coron and I. Kizhvatov, “An efficient method for random delay
Firstly, a CPA attack is performed on Dataset4 to validate generation in embedded software,” in Cryptographic Hardware and
the efficiency of this countermeasure. As depicted in Fig. 4.a, Embedded Systems - CHES 2009, C. Clavier and K. Gaj, Eds. Berlin,
the secret key can not be revealed. In contrast, a good result Heidelberg: Springer Berlin Heidelberg, 2009, pp. 156–170.
[10] C. O’Flynn and Z. Chen, “ChipWhisperer: An open-source platform for
in detecting the correct key can be seen in Fig. 4.b and hardware embedded security research,” in Constructive Side-Channel
Fig. 4.c. These results demonstrate that CNN model can Analysis and Secure Design. Springer International Publishing, 2014,
break the de-synchronization countermeasure based on the pp. 243–260.
[11] K. Kuroda, Y. Fukuda, K. Yoshida, and T. Fujino, “Practical Aspects on
translation-invariance property. However, the attack time of Non-profiled Deep-learning Side-channel Attacks against AES Software
CNNMO is shorter by approximately 30 times compared to Implementation with Two Types of Masking Countermeasures including
CNNDDLA (703.65 seconds compared to 20792.43 seconds). RSM,” pp. 29–40, 2021.
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: NASATI. Downloaded on November 09,2022 at 07:23:17 UTC from IEEE Xplore. Restrictions apply.