Design of Convolutional Neural Networks Architecture For Non Profiled Side Channel Attack Detection
Design of Convolutional Neural Networks Architecture For Non Profiled Side Channel Attack Detection
4, 2023
76
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 29, NO. 4, 2023
The greater use of cross-domain technology for SCA is the identifying the properties of the most significant trace,
primary feature of this stage. In particular, deep learning enabling grouping to be carried out by applying the chosen
methods such as multi-layer perceptron (MLP) [12] and characteristics. The findings of such experiments are
CNN are becoming more popular. CNNs have been shown presented in the article. In addition to these discoveries, the
to defeat jitter-based countermeasures, power trace scientists explain that this attack is carried out by using raw
misalignment, and disguised Advanced Encryption Standard trace data without any pre-processing.
(AES) implementations. As a result, this research uses In contrast to a template attack, which generally includes
CNNs. the adversary realigning the traces and selecting the points
of interest on their own, this one does not. Because of this,
II. LITERATURE REVIEW the findings show that CNNs are beneficial even when the
Maghrebi, Rioul, Guilley, and Danger [3] were the main traces are misaligned. On the other hand, overfitting is
investigators in exploiting CNNs for side-channel attacks potential due to the size and complexity of the CNN
(SCA), although they were not the learning methods that architecture that lies under the surface. They provide two
deployed deep learning approaches such as MLP, CNN, and data augmentation algorithms for misaligned traces as a
long short-term memory (LSTM) [4]. These techniques means of generating more training data to do this.
include random forest and support vector machine (SVM) Experiments were carried out to illustrate the efficacy of
[5]. The findings of their study show that deep learning is data augmentation options for misaligned traces.
superior to more conventional approaches to machine The findings corroborated by Kim, Picek, Heuser, Bhasin,
learning and, as a result, produces good outcomes. The and Hanjalic [11] show that their CNN framework performs
authors show this using two different data sets, one of which at the leading edge in the random delays (RD) data set. This
is an implementation that does not have any kind of gives more credence to the findings in [12]. In particular,
protection, and the other utilises a countermeasure for compared to DPAv4a data set considered a fundamental
masking. In addition, the results show that the CNN information set, an ideal network of its needs fewer attack
database, sometimes referred to as the side-channel analysis traces to recover the key of the RD data set [13]. In [14], the
data set, is in [6]. This database has been used in the researchers experimented with a wide variety of topologies
investigations of various researchers and was first presented and sets of pieces of information. The results of such
by the authors. After introducing the data set, they experiments showed that no single design succeeds with all
investigate the effect of hyperparameters to find the CNN data sets. Hence, this remains very necessary in selecting a
and MLP architectures [6] that will be the most effective. structure appropriate for issues present at that time. In
According to the findings of their study, Masure, Canovas, addition, the authors provide evidence that including
and Prouff [7] reveal the increase in the volume of the CNN distortion within the primary substrates of the networks
kernel, resulting in better behaviour if the network is helps performance by reducing the amount of overfitting
confronted with misaligned traces. However, they do not that occurs. When working with smaller data sets, this is
explain why increasing the kernel makes the attack more suggested by using increased noise levels, whereas working
effective, which is strange. This discovery, in our view, is with more extensive data sets requires a lower noise level to
fascinating and certainly deserves more discussion. get the best results.
Since both studies reveal that CNN performs successfully These studies suggest that CNNs include two essential
in various scenarios, further study was conducted on CNN’s qualities that make them suitable for side-channel analysis.
behaviour. Picek, Samiotis, Kim, Heuser, Bhasin, and Legay To begin with, they can determine the most critical features
[9] compared CNNs’ performance against machine learning independently and without any guidance. As a consequence
methods such as Random Forest, XGBoost, and Naive of this, prior processing on the traces was not needed to gain
Bayes. Their main objective is to investigate the greater behaviour. Compared to more conventional
circumstances under which CNNs perform better than the approaches, we consider this to be a considerable advantage.
other techniques described. According to the findings of According to the authors in [15], pre-processing is prone to
their study, CNNs can only improve performance in the errors, and poor selection of Points of Interest (PoI) leads to
aggregate. According to the authors, CNNs are most lower performance. Because CNNs are spatially invariant,
effective when the traces are not pre-processed, when noise they can identify characteristics regardless of their position
levels are lowered, and when information dimensions are within feature vectors. This is the second benefit of using
higher (i.e., their many features with many traces). On the CNNs. As a result of this quality, CNNs can perform at the
contrary, machine learning (ML) schemes could achieve cutting edge of the field when it comes to data sets
performance that is almost on par with that of CNNs. The originating from implementations that use a concealed
discovery that ML methods need noticeably fewer countermeasure. The methodologies used in the study that
processing resources than CNNs is a significant result. As a we have discussed up to this point are standard practises in
result, the researchers have severe reservations about the deep learning. According to further research, it has been
usefulness of CNNs. recommended that new innovative tactics be used, designed
After further research, CNNs were shown to have the explicitly for the side-channel attack, aiming to take
potential to surpass existing specific data sets with state-of- advantage of a few qualities.
the-art solutions. An implementation with a covert The researchers in [16] suggested a completely new CNN
countermeasure was the source of the measurements for framework that uses more domain information obtained
each data set. The authors in [10] performed tests to show through a side-channel attack. Data provided for creating
that CNNs can synchronise non-aligned traces by neural networks can be plaintext or ciphertext, and this
77
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 29, NO. 4, 2023
distinction is determined by the leaky model. The why this layer can offer the results it does. These questions
classification block of a CNN architecture is the component will be addressed in Section IV, at which point we will
that is given the domain information to use as a new feature investigate the spread layer in great detail and solve some of
vector. In the work, the authors compare several its faults that it has.
architectural concepts offered by various works of literature According to Jin, Kim, Kim, and Hong [20], the deep
with and without the architecture that they have provided. CNN framework works admirably for SCAs. Despite this,
They show how a design that uses domain knowledge can some issues remain concerning the training process for deep
improve performance for protected information and not neural networks. The primary issue is that training deep
protected information. However, if profile traces are neural networks can be complicated since gradients can
generated using a fixed key, this method is not proper. either vanish or grow as the training progresses. In the
Zaid, Bossuet, Dassance, Habrard, and Venelli [17] sections concerned, we will explain the latest advancements
strongly focus on the need for fine-tuning architecture and with the initiation of deep neural networks to solve the
hyper-parameters; models do not operate correctly without issues above.
an appropriate configuration. They point out that we cannot Much work has been done on parameter initialisation
realise the full potential of architecture if we do not topics; variables would often be picked randomly from a
understand the influence of a hyperparameter and explain Gaussian distribution. This was significantly reworked by
why this is the case. The authors provide three visualisation Glorot and Bengio [21], who also introduced the latest
methods to solve this problem: weight, gradient, and initialisation technique called “Xavier’s initialisation”
heatmap. These methods are utilised to improve the simultaneously. This method considers the number of inputs
readability and interpretability of each hyperparameter. and outputs associated with the parameter while
These approaches make it simpler to set the simultaneously deriving the parameter values from a
hyperparameters by allowing an opponent to determine the Gaussian distribution. This method is currently considered
influence of each one individually, which in turn makes it standard practise and is used to initialise the parameters of
easier to tune the hyperparameters. Using these three several extensive deep-learning libraries. When academics
visualisation approaches, they also propose implementation began looking into the architectures of deep neural
options for protected and unprotected environments. In networks, they found that several works ran into problems
particular, for data sets that include a concealed with the convergence of their designs. Convergence
countermeasure, it is recommended that the CNN kernel problems were experienced, e.g., by the well-known visual
measure be modified to equal 50 % of the highest delay of geometry group (VGG) architecture, which is trained in four
randomised delay. It remains one of the guidelines provided phases. The network is then enlarged with additional layers,
by their method. and training is performed at each stage to ensure that it
In contrast to the content provided in articles produced by converges correctly [22].
deep learning communities, the increase in substrate is A novel strategy for deep CNN initialisation is presented
recommended as opposed to the number of neurones by F.-X. Standaert in [23]. According to his research
contained within each layer [18]. The authors improved the findings, even though the Xavier initialisation was designed
state-of-the-art work on entire information sets by to work with linear activations, it is not appropriate for use
developing architectures and conducting tests with all with the rectified linear unit (ReLU). In addition, they argue
publicly available data sets using the methodologies that deeper networks have a more difficult time reaching a
described, which led to an increase in overall performance. point of convergence. A solution to such issues, provided by
On the other hand, the choice of hyperparameters is them, is the initialisation of “He”, which was developed
occasionally made without enough rationale, even though specifically for CNNs that use ReLU and, compared to other
their method offers cutting-edge performance for all initialisation methods, results in an improvement in the
publicly available data sets. For example, the authors do not degree to which deep neural networks converge. layer
explain how certain learning rates were determined for a few sequential unit variance (LSUV) initialisation is an
specific data sets or why they were used in the first place. alternative method of initialisation that is proposed in [24].
They also do not explain why they were used at all. Rather than being developed explicitly for designs that use
Pfeifer and Haddad [19] propose using a deep learning ReLU as an activation function, this approach exhibits an
layer known as the spread layer. This layer would be the additional generic character and is appropriate for various
first to be explicitly designed for side-channel attacks. As architectural kinds. They provide evidence of the viability of
demonstrated in their study, Haddad and Pfeifer depicted their approach by conducting experiments to validate their
with this layer that some substrate was needed for better claims. Both sets of research have shown how important it is
outcomes. Furthermore, the profiling phase needs fewer to have accurate initialisation of the network parameters for
traces, which speeds up the learning procedure. Such deep neural networks to converge. The published research
findings were intriguing about side-channel analysis has only recently begun to do investigations in actual SCA
communities because they suggest motivation to create circumstances where the attack traces and the profiling
substrates specially made to take advantage of the side- traces were obtained from identical gadgets. This was not
channel properties of traces. This is because these findings unusual for people to use the same key for both the attack
indicate a motivation to create layers specially made to take path and the profiling track. The results of these studies can,
advantage of the side-channel properties of traces. On the as a direct consequence of this fact, provide an inaccurate
other hand, the authors do not provide much information image of the effectiveness of several therapies, including
about how to establish the hyperparameters of the layer or template attack (TA), ML, and DL. Consequently, the SCA
78
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 29, NO. 4, 2023
community has begun to construct a more realistic sliding filters down the edge of the layer below them,
environment in which various gadgets are used to acquire convolutional layers can perform convolution on incoming
attack and profiling traces [25]. A comparison of various data. The fact that CNN minimises the loss function using
research methods on SCA is shown in Table I. filter weights allows it to learn invariant features during
translation. Power cords will not hinder the mobility of SCA
TABLE I. COMPARISON OF VARIOUS RESEARCH METHODS ON filters. The max-pooling method with kernel size [12] and
SCA.
Attacked Physical stride [12] is used for the first two layers, while the average-
Research works Limitations
Network Measurement pooling technique is used for the third and final layers.
Minimal (black Maximum and average pooling are non-linear layers that
Wu et al. [13], 2023 MLP, CNN EM
box)
Maji, Banerjee,
can bring down data dimensions. When comparing average
Fuller, and CNN, Methodology pooling with maximum pooling, it is important to note that
SPA
Chandrakasan [14], BNN specific to µC the former determines an average, while the latter
2022
determines a maximum. All convolutional layers in our
Shimada, Kuroda,
Fukuda, Yoshida, model use ReLU Piecewise linear means that when the input
MLP EM Intention paper
and Fujino [15], is positive, the output is also positive. Softmax is used to do
2022 the categorisation in the output layer. Table II shows the
Sako, Kuroda,
Fukuda, Yoshida, Systolic
Only the simulation parameters of the proposed CNN model and Fig.
CPA systolic array 1 details the structural makeup of our convolutional neural
and Fujino [22], array
is implemented
2022 network.
Shi, Sun, Wang, Specific to the
BNN Power
and Hu [24], 2020 line buffer TABLE II. SIMULATION PARAMETERS OF THE PROPOSED CNN
Using non- MODEL.
Yang, Xiang,
fine-tuned Weight
Huang, Fu, and CNN Power Layer Stride Activation
models once Shape
Yang [25], 2023
trained Convolutional (1) 1 × 3 × 16 - -
Batch Normalisation (1) - - ReLU
III. PROPOSED METHODOLOGY Max-Pooling (1) - [12] -
One of the most popular uses of CNN is image Convolutional (2) 1 × 3 × 24 - -
Batch Normalisation (2) - - ReLU
recognition [16]. They are effective in dividing time series Max-Pooling (2) - [12] -
[17]. CNNs are great models for the extraction of features Convolutional (3) 1 × 3 × 24 - -
and categorisation of complex data because they are Batch Normalisation (3) - - ReLU
invariant to translation. As a result, our attack into side- Average-pooling (1) - [12] -
FC-output - - Softmax
channel data attacks benefits from the use of CNNs. CNN
has the drawback of being trained for each major theory
separately. Our best guesses for the 8-bit key will require
256 trials of practise.
CNNs use layers of computation known as convolutional
and pooling layers. The batch normalisation layers will
complete these processes today. The batch normalisation of
Ioffe and Szegedy [18] reduces the internal covariate shift of
neural networks. The authors claim that this leads to more
efficient learning. When we put CNN through its paces, we
use a series of aligned power wires as our test subject. One
power trace sample would include too much data to be used
as CNN input features. Therefore, we use the correlation
coefficient in the first phase of power-trace processing. Fig. 1. Proposed model.
There are typically three parts to a CNN data set: the
training set, which is used to teach the network, the IV. RESULTS
validation set, which is used to test the accuracy of the The CW1173 ChipWhisperer board was subjected to our
network on unseen data, and the test set, which is used to testing [19]. This SCA platform has a target board equipped
assess the quality of the final prediction or classification. with an 8-bit Atmel AVR Xmega128 microcontroller
Details of the network architecture will be covered in a capable of executing AES-128. The ChipWhisperer’s
subsequent section. internal analog to digital converter (ADC) can capture the
− Experiment and Equipment Details continuous wave (CW) Lite signal. Because this system is
To evaluate our neural network models, we employed set up, we can deliver the software, the plaintext, and the
MATLAB. Three convolutional layers and three pooling key to the Xmega board while recording the traces on a
layers precede the fully connected layer and the laptop. For purposes of conducting tests, we have 5000
classification layer in the network, respectively. The first power traces available. 10,000 AES Round 1 and Round 2
convolution layer has 16 filters, each of which is [11] by samples are included in each power trace. An attack that is
[12] in size, and has an output layer of the same size as the not profiled will keep the same key throughout and will
input. The subsequent two convolutional layers are the same choose 5000 plaintexts at random. The only round of the
size as the first but have 24 and 32 filters, respectively. By AES we focus on attacking is the first.
79
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 29, NO. 4, 2023
V. CONCLUSIONS
According to the findings of this research, CNN creates
difficulties for SCA when aligned power traces include a
(a) large number of samples. After preparing the CNN training
80
ELEKTRONIKA IR ELEKTROTECHNIKA, ISSN 1392-1215, VOL. 29, NO. 4, 2023
data, we evaluated the power traces with the original data 50th International Symposium on Multiple-Valued Logic (ISMVL),
2020, pp. 58–63. DOI: 10.1109/ISMVL49045.2020.00-29.
and those with Gaussian noise. Our non-profiled SCA data [11] J. Kim, S. Picek, A. Heuser, S. Bhasin, and A. Hanjalic, “Make some
preparation method is based on CNN, which allows for noise. Unleashing the power of convolutional neural networks for
extracting key properties. Our method requires fewer power profiled side-channel analysis”, IACR Transactions on Cryptographic
Hardware Embedded Systems, vol. 2019, no. 3, pp. 148–179, 2019.
traces for attacks because the power traces are organised DOI: 10.13154/tches.v2019.i3.148-179.
into three distinct groups. These findings indicate that our [12] Y.-S. Won, D.-G. Han, D. Jap, S. Bhasin, and J.-Y. Park, “Non-
technique can effectively recover an increased number of profiled side-channel attack based on deep learning using picture
bytes from SCA compared to previous methods used in CPA trace”, IEEE Access, vol. 9, pp. 22480–22492, 2021. DOI:
10.1109/ACCESS.2021.3055833.
and DL-SCA without regularisation. The consistent findings [13] L. Wu et al., “Label correlation in deep learning-based side-channel
that our CNN architecture produces for attacks that are not analysis”, IEEE Transactions on Information Forensics and Security,
profiled highlight the considerable challenge posed by vol. 18, pp. 3849–3861, 2023. DOI: 10.1109/TIFS.2023.3287728.
[14] S. Maji, U. Banerjee, S. H. Fuller, and A. P. Chandrakasan, “A
Gaussian noise in power traces. To improve the performance threshold-implementation-based neural-network accelerator securing
of neural networks when faced with non-profiled attacks, we model parameters and inputs against power side-channel attacks”, in
will investigate several pre-processing strategies that aim to Proc. of 2022 IEEE International Solid-State Circuits Conference
(ISSCC), 2022, pp. 518–520. DOI:
decrease power trace noise. 10.1109/ISSCC42614.2022.9731598.
[15] S. Shimada, K. Kuroda, Y. Fukuda, K. Yoshida, and T. Fujino, “Deep
CONFLICTS OF INTEREST learning-based side-channel attacks against software-implemented
RSA using binary exponentiation with dummy multiplication”, IEICE
The authors declare that they have no conflicts of interest. Technical Report, vol. 122, no. 11, pp. 13–18, 2022.
[16] B. Sönmez, A. A. Sarıkaya, and Ş. Bahtiyar, “Machine learning based
side channel selection for time-driven cache attacks on AES”, in Proc.
REFERENCES of 2019 4th International Conference on Computer Science and
[1] G. Yang, H. Li, J. Ming, and Y. Zhou, “Convolutional neural network Engineering (UBMK), 2019, pp. 1–5. DOI:
based side-channel attacks in time-frequency representations”, in 10.1109/UBMK.2019.8907211.
Smart Card Research and Advanced Applications. CARDIS 2018. [17] G. Zaid, L. Bossuet, F. Dassance, A. Habrard, and A. Venelli,
Lecture Notes in Computer Science(), vol. 11389. Springer, Cham, “Ranking loss: Maximizing the success rate in deep learning side-
2019, pp. 1–17. DOI: 10.1007/978-3-030-15462-2_1. channel analysis”, IACR Transactions on Cryptographic Hardware
[2] B. Timon, “Non-profiled deep learning-based side-channel attacks Embedded Systems, vol. 2021, no. 1, pp. 25–55, 2021. DOI:
with sensitivity analysis”, IACR Transactions on Cryptographic 10.46586/tches.v2021.i1.25-55.
Hardware Embedded Systems, vol. 2019, no. 2, pp. 107–131, 2019. [18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
DOI: 10.13154/tches.v2019.i2.107-131. network training by reducing internal covariate shift”, in Proc. of the
[3] H. Maghrebi, O. Rioul, S. Guilley, and J.-L. Danger, “Comparison 32nd International Conference on Machine Learning, 2015, pp. 1–9.
between side-channel analysis distinguishers”, in Information and [19] C. Pfeifer and P. Haddad, “Spread: A new layer for profiled deep-
Communications Security. ICICS 2012. Lecture Notes in Computer learning side-channel attacks”, IACR Cryptology ePrint Archive, vol.
Science, vol. 7618. Springer, Berlin, Heidelberg, 2012, pp. 331–340. 2018, p. 880, 2018.
DOI: 10.1007/978-3-642-34129-8_30. [20] S. Jin, S. Kim, H. Kim, and S. Hong, “Recent advances in deep
[4] H. Maghrebi, “Deep learning based side channel attacks in practice”, learning‐based side‐channel analysis”, ETRI Journal, vol. 42, no. 2,
Cryptology ePrint Archive, vol. 2019, p. 578, 2019. pp. 292–304, 2020. DOI: 10.4218/etrij.2019-0163.
[5] B. Timon, “Non-profiled deep learning-based side-channel attacks”, [21] X. Glorot and Y. Bengio, “Understanding the difficulty of training
IACR Cryptology ePrint Archive, vol. 2018, p. 196, 2019. DOI: deep feedforward neural networks”, in Proc. of the Thirteenth
10.46586/tches.v2019.i2.107-131. International Conference on Artificial Intelligence and Statistics,
[6] D. Das, A. Golder, J. Danial, S. Ghosh, A. Raychowdhury, and S. 2010, pp. 249–256.
Sen, “X-DeepSCA: Cross-device deep learning side channel attack”, [22] M. Sako, K. Kuroda, Y. Fukuda, K. Yoshida, and T. Fujino, “Deep
in Proc. of 2019 56th ACM/IEEE Design Automation Conference learning side-channel attacks against hardware-implemented
(DAC), 2019, pp. 1–6. DOI: 10.1145/3316781.3317934. lightweight cipher Midori 64”, IEICE Technical Report, vol. 122, no.
[7] L. Masure, C. Canovas, and E. Prouff, “A comprehensive study of 11, pp. 7–12, 2022.
deep learning for side-channel analysis”, IACR Transactions on [23] F.-X. Standaert, “Introduction to side-channel attacks”, in Secure
Cryptographic Hardware and Embedded Systems, vol. 2020, pp. 348– Integrated Circuits and Systems. Integrated Circuits and Systems.
375, 2019. DOI: 10.13154/tches.v2020.i1.348-375. Springer, Boston, MA, 2010, pp. 27–42. DOI: 10.1007/978-0-387-
[8] J. J. Quisquater, “A new tool for non-intrusive analysis of smart cards 71829-3_2.
based on electro-magnetic emissions. The SEMA and DEMA [24] M. Wei, D. Shi, S. Sun, P. Wang, and L. Hu, “Convolutional neural
methods”, Eurocrypt Rump Session, 2000. network based side-channel attacks with customized filters”, in
[9] S. Picek, I. P. Samiotis, J. Kim, A. Heuser, S. Bhasin, and A. Legay, Information and Communications Security. ICICS 2019. Lecture
“On the performance of convolutional neural networks for side- Notes in Computer Science(), vol 11999. Springer, Cham, 2020, pp.
channel analysis”, in Security, Privacy, and Applied Cryptography 799–813. DOI: 10.1007/978-3-030-41579-2_46.
Engineering. SPACE 2018. Lecture Notes in Computer Science(), vol. [25] W. Yang, X. Xiang, C. Huang, A. Fu, and Y. Yang, “MCA-based
11348. Springer, Cham, 2018, pp. 157–176. DOI: 10.1007/978-3-030- multi-channel fusion attacks against cryptographic implementations”,
05072-6_10. IEEE Journal on Emerging and Selected Topics in Circuits and
[10] H. Wang, S. Forsmark, M. Brisfors, and E. Dubrova, “Multi-source Systems, vol. 13, no. 2, pp. 476–488, 2023. DOI:
training deep-learning side-channel attacks”, in Proc. of 2020 IEEE 10.1109/JETCAS.2023.3252085.
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0
(CC BY 4.0) license (http://creativecommons.org/licenses/by/4.0/).
81