0% found this document useful (0 votes)
72 views12 pages

Anomaly Detection of Industrial Control Systems Based On Transfer Learning

This document discusses using transfer learning to detect anomalies in industrial control system traffic. It proposes using a deep residual convolutional neural network trained on labeled data from one domain, then applying transfer learning to detect anomalies in an unlabeled target domain. The model is first trained on the KDDCUP99 dataset, then applied to a gas pipeline dataset where it achieves good anomaly detection results. Transfer learning allows the model to be quickly trained and detect unknown or differently distributed abnormal data in the target industrial control system domain.

Uploaded by

Nabeel Ahammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views12 pages

Anomaly Detection of Industrial Control Systems Based On Transfer Learning

This document discusses using transfer learning to detect anomalies in industrial control system traffic. It proposes using a deep residual convolutional neural network trained on labeled data from one domain, then applying transfer learning to detect anomalies in an unlabeled target domain. The model is first trained on the KDDCUP99 dataset, then applied to a gas pipeline dataset where it achieves good anomaly detection results. Transfer learning allows the model to be quickly trained and detect unknown or differently distributed abnormal data in the target industrial control system domain.

Uploaded by

Nabeel Ahammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

TSINGHUA SCIENCE AND TECHNOLOGY

ISSNll1007-0214 04/13 pp821–832


DOI: 1 0 . 2 6 5 9 9 / T S T . 2 0 2 0 . 9 0 1 0 0 4 1
Volume 26, Number 6, December 2021

Anomaly Detection of Industrial Control Systems


Based on Transfer Learning

Weiping Wang , Zhaorong Wang, Zhanfan Zhou, Haixia Deng, Weiliang Zhao, Chunyang Wang,
and Yongzhen Guo

Abstract: Industrial Control Systems (ICSs) are the lifeline of a country. Therefore, the anomaly detection of ICS
traffic is an important endeavor. This paper proposes a model based on a deep residual Convolution Neural Network
(CNN) to prevent gradient explosion or gradient disappearance and guarantee accuracy. The developed methodology
addresses two limitations: most traditional machine learning methods can only detect known network attacks and
deep learning algorithms require a long time to train. The utilization of transfer learning under the modification of the
existing residual CNN structure guarantees the detection of unknown attacks. One-dimensional ICS flow data are
converted into two-dimensional grayscale images to take full advantage of the features of CNN. Results show that the
proposed method achieves a high score and solves the time problem associated with deep learning model training.
The model can give reliable predictions for unknown or differently distributed abnormal data through short-term
training. Thus, the proposed model ensures the safety of ICSs and verifies the feasibility of transfer learning for ICS
anomaly detection.

Key words: anomaly detection; transfer learning; deep learning; Industrial Control System (ICS)

1 Introduction
 Weiping Wang and Chunyang Wang are with School of Computer
and Communication Engineering, the Beijing Key Laboratory Modern Industrial Control Systems (ICSs) have higher
of Knowledge Engineering for Materials Science, and the production efficiency than traditional industrial systems
Institute of Artificial Intelligence, University of Science and and can well process big data.
Technology Beijing, Beijing 100083, China, and with Shunde
However, increases in the type and frequency of
Graduate School, University of Science and Technology Beijing,
Guangzhou 528399, China. E-mail: [email protected]; network attacks and hacking incidents threaten the
[email protected]. security of ICSs based on data transmission. The
 Zhaorong Wang is with School of Automation and Electrical National Institute of Standards and Technology has
Engineering, University of Science and Technology Beijing, proposed the main sources of security issues for modern
Beijing 100083, China. E-mail: [email protected].
 Zhanfan Zhou and Weiliang Zhao are with School of Mechanical ICSs[1] , which include nonsecure communication
Engineering, University of Science and Technology Beijing, protocols, poor network isolation and access controls[2] ,
Beijing 100083, China. E-mail: [email protected]; and the lack of an ICS anomaly detection system[3] .
[email protected]. Intrusion detection technology is an important research
 Haixia Deng is with the Donlinks School of Economics and
Management, University of Science and Technology Beijing,
direction in the field of network security. The original
Beijing 100083, China. E-mail: [email protected]. flows of network equipment and servers have been
 Yongzhen Guo is with School of Automation, Beijing comprehensively analyzed[4] . When industrial control
Institute of Technology, Beijing 100081, and also with China networks are invaded or traffic data are abnormal,
Software Testing Center, Beijing 100048, China. E-mail:
intrusion detection technology can effectively predict
[email protected].
 To whom correspondence should be addressed. and take active defensive measures in a timely manner.
Manuscript received: 2020-09-02; accepted: 2020-09-23 Deep learning has shown great research significance

C The author(s) 2021. The articles published in this open access journal are distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
822 Tsinghua Science and Technology, December 2021, 26(6): 821–832

in intrusion detection technology. Feature values are The residual structure, which effectively prevents
extracted through a great amount of data training, gradient explosion or gradient disappearance while
parameters are constantly changed, and a system that ensuring the depth of the model effect, is introduced.
can identify abnormal traffic data is constructed. The detection ability of the model in unknown domain
Deep learning and traditional machine learning show datasets is excellent. After training the source domain,
certain similarities. The core aim of traditional machine that is the KDDCUP99 dataset, the model is used on
learning is to map features to the target space. In a gas pipeline dataset through transfer learning. The
traditional machine learning algorithms, the recognition model shows good anomaly detection effects on the gas
rate increases with increasing data size; however, pipeline dataset[7] , and its precision, recall, and F1-score
because a bottleneck period is often encountered are fairly high.
during processing, these models cannot handle massive This article is organized as follows. Section 2
amounts of data. Machine learning performs well in introduces the background of this study and the related
intrusion detection in closed environments. However, work. Section 3 describes the method. Section 4
machine learning will be exposed when entering an introduces the evaluation index. Section 5 describes
open-world scenario with various random traffic or the experimental process and results. Section 6 provides
noise, which could adversely affect its availability[4] . the conclusions and directions for future work.
Therefore, traditional machine learning algorithms are
unsuitable for detecting abnormal traffic in ICSs, and 2 Background and Related Work
finding abnormal data quickly and implementing active
2.1 Research status of industrial control system
measures with high accuracy are quite challenging.
anomaly detection
Compared with traditional machine learning, deep
learning has a strong generalizability for extracting high- Given rapid developments in informatization and
dimensional data. Deep learning uses back-propagation industrialization, ICS has been widely used in
algorithms to change and adjust parameters continuously national infrastructures. However, platform hardware
to achieve optimal results. This learning method can and software vulnerabilities and the openness of the
handle large amounts of data; indeed, the larger the network environment render ICSs vulnerable to security
data size, the better the resulting effect. Unfortunately, attacks. Therefore, ICS anomaly detection is very
although deep learning has good generalizability in important. Anomaly detection has a significant effect
processing images, it relies on labeled data and cannot on the active defense process of ICSs[8] . The anomaly
handle unknown abnormal data types[5] . In this article, detection approaches of ICSs mainly include three types,
we solve some of the problems of traditional machine namely, knowledge-based, statistics-based, and machine
learning by using a residual Convolution Neural Network learning-based. Reference [9] proposed an anomaly
(CNN) structure to model the source dataset and modify detection method based on state recognition to detect
the relevant parameters by transfer learning. We then attacks in ICSs by using a data-driven clustering method
apply the transfer learning algorithm using the relevant to identify the normal and critical states of a system.
information of the source domain and predicting the A statistical model for traffic detection in the time
target domain[6] . Transfer learning is finally employed to domain has also been introduced to detect network
train the model quickly and detect differently distributed anomalies and evaluate the performance of the method
or unknown datasets. in different scenarios[10] . Results show that the model
ICS flow data can usually be processed with one- can detect network anomalies in all scenarios faster
dimensional data sequences through preprocessing; in than other methods. Considering their consequences,
this work, however, we use mapping to convert ICS flow network attacks aimed at ICSs are very serious. More
data to an image format suitable for CNNs to take full importantly, they are difficult to detect[11] . In the context
advantage of the features of the latter. Fine-tuning is of industrial control environments, anomaly detection
utilized during transfer learning to ensure timeliness. based on machine learning does well in improving
After building an eight-layer residual neural network, the accuracy of finding abnormal behavior and is of
only the three deepest layers of the neural network are great importance in the establishment of efficient and
fine-tuned. intelligent intrusion detection models[8] .
Weiping Wang et al.: Anomaly Detection of Industrial Control Systems Based on Transfer Learning 823

2.2 Anomaly detection based on machine learning 2.2.3 Transfer learning based on model fine-tuning
2.2.1 Traditional machine learning algorithms Transfer learning is an effective approach to exploit
deep neural networks on small datasets. The essence
Decision tree, random forest, Support Vector Machine
of transfer learning is to transfer and reuse knowledge in
(SVM), and logistic regression are traditional machine
other fields. Mathematically, transfer learning includes
learning methods. An earlier study used an intelligent
two concepts, namely, domain and learning task. Model-
Markov model based on statistical learning to establish
based transfer learning methods are usually combined
a multimodel intrusion detection system for industrial
with deep learning models to transfer the structure and
process automation that could effectively detect actual
parameters of models that have been trained on large-
attack operations[12] . Other researchers used SVM to
scale datasets (e.g., AlexNet, VGGNet, and ResNet) to
establish a data detection model utilized in an industrial
new tasks and use the weights trained on the large dataset
control communication protocol[13] .
as the initial weights for the new task[18] . In contrast
Decision tree is a machine learning method with a tree
to deep learning, transfer learning can detect unknown
structure and high efficiency. It is easy to understand
information. Fine-tuning pretrained CNNs on images is
and highly effective for processing discrete data. SVM
an effective strategy to achieve transfer learning. This
is based on the principle of structural risk minimization.
technique is widely used in the field of image recognition
In this method, the optimal classification is found by
and provides new insights into the recognition of small-
learning the classification model of data samples in the
scale datasets. In transfer learning, the deep network
feature space.
structure is trained on a large natural image dataset, after
2.2.2 Deep learning model which the model is transferred to a small dataset by
Traditional machine learning methods have a number fine-tuning its parameters. The features extracted from
of disadvantages, such as low efficiency in processing the pretrained deep neural network are universal and
large-scale data and inability to solve samples with applicable to other datasets. Figure 1 shows a schematic
uneven distributions. Compared with traditional machine of the model.
learning methods, deep learning models have more 2.2.4 Fine-tune based on residual neural network
complex architectures and multiple layers. The most
Kaiming He, a researcher at Microsoft Research Asia,
important advantages of deep learning over traditional
designed residual neural networks with a deeper network
machine learning are that it can learn features directly
structure and a simpler network structure[19] . The
and automatically from the original data and has good
residual network consists of multiple residual blocks,
performance[14] .
and each residual block comprises a convolutional layer
Deep learning models are quite effective in the field
and a pooling layer. The blocks of the convolutional
of detecting industrial process anomalies. Almalawi
layer are skipped by using shortcut connections. The
et al.[15] proposed two novel techniques that are
use of identity shortcuts requires the same input and
an automatic identification of inconsistent states
output sizes[20] . In this case, the problem of model
of SCADA data and an automatic extraction of
attenuation caused by the disappearance of gradients
proximity detection rules from identified states. Gao[16]
is avoided by the superposition of gradients. This deep
developed an anomaly-based intrusion detection
residual neural network won five championships in two
system for the SCADA network and found a combined
major technical competitions, namely, ImageNet and
Intrusion Detection System (IDS) which includes
MS COCO. A unique feature of this network is that it
signature-based IDS and anomaly-based IDS. These
includes a network depth greater than 152 layers, which
studies show that deep learning methods have
Model Copy Fine-tuned model
good performance in anomaly detection and attack
classification.
CNNs are suitable for image classification[17] .
Source Target
Compared with other image recognition algorithms,
CNN uses not only deep learning methods but also some
special structures for feedforward neural networks and
has relatively little data to preprocess. Fig. 1 Transfer learning by model fine-tuning.
824 Tsinghua Science and Technology, December 2021, 26(6): 821–832

had never been achieved before. In previous experiments, redundant features are removed. Some normalization
the gradient disappeared as the number of network layers methods are used to process the training data, improve
increased and the error rate of such a network was higher the convergence speed of deep learning, and complete
than that of a neural network with a lower number of the task of anomaly detection.
layers. The emergence of the residual neural network 3.1.1 Target domain: Gas pipeline dataset
solves these issues well. Experiments showed that higher The steps are as follows:
accuracy could be obtained as the number of network (1) Data cleaning: A large amount of abnormal
layers trained by the residual network increased, which data will greatly affect the normalized results by
means the residual network can allow deeper network affecting the data distribution. Data cleaning is used
layer training, and the performance of the model was to clean duplicates, erroneous data, and useless features,
greatly enhanced[21] . thus improving the reliability and integrity of the
Because traditional machine learning algorithms data as well as the accuracy of the analysis results.
have poor generalizability, deep learning algorithms The gas pipeline dataset includes some negative and
present greater time costs and may be prone to unreasonably large values of key measurement data,
gradient disappearance or explosion. Although fine- which means cleaning is necessary. Next, the attributes
tuning technology based on deep neural networks can “commandlength”, “commwritefun”, “reset”, “gain”,
solve the problem of high time cost and detect unknown “deadband”, “cycletime”, “rate”, and “crcrate” are
attacks, model pretraining remains subject to gradient redundant attributes in the dataset[22] . We delete these
disappearance or explosion during deep learning. Fine- attributes from the dataset because they interfere with
tuning based on a deep residual neural network can solve data classification.
these problems simultaneously. (2) Feature mapping: This experiment uses MD
3 Method to perform feature mapping on the data. MD was
proposed by Mahalanobis[23] as a distance measurement
In this section, we will describe the processing flow in method and refers to the covariance distance of the
detail. Because of the correlations among the features data. In contrast to the Euclidean distance, MD
of industrial control flow data, we convert the one- ignores differences in measurement units and considers
dimensional data stream into a two-dimensional matrix the relationship between features, thus aligning the
and then convert this matrix into a Mahalanobis Distance relationship between features with the actual situation[24] .
(MD) matrix. The obtained matrix is converted into Therefore, MD is not affected by the measurement scale
one-dimensional data, saved, normalized, and then and the interference of correlations between variables
mapped into a black-and-white image. After building can be eliminated. Figure 2 shows the pseudo code for
an eight-layer CNN with a residual structure, we use the feature mapping of data using MD.
KDDCUP99 dataset for pretraining and then train the 3.1.2 Source domain: KDDCUP99 dataset
obtained model on the gas pipeline dataset by fine-tuning
We standardize the source domain, that is the
through the transfer learning method.
KDDCUP99 dataset. Standardized data are subtracted
3.1 Data preprocessing from the mean and then divided by the variance (or
This experiment uses a deep migration learning model standard deviation). When this data standardization
and the KDDCUP99 dataset as the source domain to
Main code of feature mapping method based on MD
perform migration learning on the gas pipeline dataset.
Because the data may be disturbed by noise, missing Input: industrial control network data stream
Output: transformed feature matrix
values, and inconsistent data, the presence of low- Do while Xi:
Diag = convert_to_diag(Xi)
quality data is inevitable. We improve the credibility # The function is responsible for converting the data stream into a diagonal
matrix, and the mapped matrix is Map_Matrix=map_matrix(diag)
of the data by preprocessing them and then improve # The function is responsible for converting the diagonal matrix into a matrix
the performance of model recognition. A preliminary Save_matrix(Map_Matrix)
# The function saves the transformed matrix
exploration of the data reveals the presence of attributes Xi =Xi+1
End while
with exactly the same feature values in the training data;
these attributes are not beneficial to the establishment of Fig. 2 Pseudo code for feature mapping of data using the
the model and affect its construction[8] . Therefore, the MD.
Weiping Wang et al.: Anomaly Detection of Industrial Control Systems Based on Transfer Learning 825

method is completed, the data are converted into a extraction. Thus, we constructed an eight-layer CNN
standard normal distribution. In general, the standard with a residual structure.
deviation is 1 and the mean is 0. The conversion function 3.3.1 Input layer
is in the following: After data visualization, the data stream from the
x 
xD (1) KDDCUP99 dataset is processed into a 7  7 grayscale
x
where x is the data,  is the mean, and  is the variance. image, and the data stream from the gas pipeline dataset
Standardizing the dataset can accelerate the search is processed into an 18  18 grayscale image. These
for optimal solutions. Standardization is conducive to two input sizes are relatively small. The stride of the
process initialization, avoids numerical problems when input layer is set to 1, the kernel size is set to 3  3, and
updating the gradient value, and helps adjust the learning the number of input channels is set to 3 (the algorithm
rate. It can also ensure that small values in the output automatically converts the grayscale image into the RGB
data are utilized. model) to utilize the data completely.
3.3.2 Residual blocks
3.2 Data visualization
Three residual blocks are utilized in the model. Each
3.2.1 Gas pipeline dataset residual module is composed of two weight layers
One-dimensional industrial flow data are transformed and two Relu activation functions. The weight layer
into a two-dimensional matrix via the feature-mapping is composed of a convolutional layer and a batch-
method. This section introduces the feature matrix normalization layer.
visualization method employed in this article. In this The batch-normalization layer transforms the input
paper, every element in the MapMatrix matrix is value distribution of any neuron in each layer of
regarded as a pixel, and the element value corresponds neural network into a standard normal distribution via
to the gray value of the pixel. a certain normalization method. Therefore, the batch-
3.2.2 KDDCUP99 dataset normalization layer prevents the model from gradient
vanishing and greatly accelerates its training speed[25] .
The 41-dimensional feature samples are converted into
The Relu function performs a nonlinear
8-bit-depth grayscale images measuring 7 pixel  7 pixel
transformation on the input. The input is not a
in size, the pixels number is from 0 to 255, and each
linear combination of the outputs of the previous layer
feature corresponds to a pixel.
but can be approximated to any function, thus ensuring
3.3 Eight-layer residual convolution neural the significance of the deep neural network[26] .
network The size of the kernel of the convolutional layer is
The existing residual CNN (layers 6 34) is shown in 3  3, and the stride is 1. As the network deepens, the
Fig. 3. number of kernels varies from 64 to 256. A schematic of
Although the residual neural network increases the each residual module is shown in Fig. 4. The equation
accuracy of predicting labels as the network deepens[18] , of the module is in the following:
it also leads to a longer training time, which is y D F .x/ C x (2)
unfavorable for anomaly detection in ICS. Compared where x is the input matrix and F(x) is the output after
with colored pictures, the data flow has fewer features the two-layer convolution operation. y is the input of the
and does not require a deep network structure for feature next residual module.

7×7 Basicblock×n Basicblock×n


conv, 64, 2 (3×3 conv, 64, 1)×2 (3×3 conv, 128, 2)×2

Full Basicblock×n
Basicblock×n
connection (3×3 conv, 512, 2)×2 (3×3 conv, 256, 2)×2
layer

Fig. 3 Original residual convolutional neural network


model structure. Fig. 4 Residual block structure.
826 Tsinghua Science and Technology, December 2021, 26(6): 821–832

3.3.3 Pooling layer number and contains over 490 000 datasets, while the
Commonly used pooling operations include maximum gas pipeline dataset is relatively small and contains only
down-sampling, average down-sampling, and spatial over 90 000 datasets. Employing the premise initially
down-sampling. Down-sampling is used in CNN introduced in this subsection, we use the KDDCUP99
to reduce model parameters. Among these pooling dataset to pretrain the model completely, retain the
operations, maximum down-sampling has been proven parameters of the convolutional layer that could extract
to have the best information retention capability. low-dimensional features, and then train and adjust the
Because the number of data features in ICSs is small, last three layers of the neural network through fine-
we only add a maximum pooling layer prior to the fully tuning.
connected layer to reduce information loss. The stride is
4 Evaluation Index
2, the kernel size is 3  3, and the number of in-channels
is 256. We utilize recall, precision, F1-score, False Positive Rate
3.3.4 Fully connected layer (FPR), and accuracy to evaluate the experimental results.
The fully connected layer is implemented by using a The percentage of positive samples in the data predicted
linear transformation function, which acts as a classifier by the model to be positive is reflected by precision,
for the entire neural network. Assuming that the recall reflects the proportion of real positive samples that
output image size of the previous layer is M  N, the are predicted to be positive, F1-score combines precision
number of kernels is K. Because we are studying a two- and recall, and FPR reflects the proportion of negative
class problem, the fully connected layer transforms the samples that are incorrectly classified as positive[27] .
TP
M  N  K-dimensional data into two-dimensional data, precision D I
TP C TN
that is the predicted probability of each label. The
TP
algorithm outputs the predicted label by finding the recall D I
TP C FN
greatest possibility of the label being obtained.
FP
Figure 5 shows the structure of the entire model. FPR D I
FP C TN
3.4 Fine-tuning 2TP
F1-score D I
Fine-tuning is performed according to the neural 2TP C FP C FN
network. As the network deepens, the extracted features TP C TN
accuracy D (3)
become more abstract. For two similar domains, the TP C TN C FP C FN
previous layer for extracting common features can be We assume that normal samples in the actual samples
retained after source domain training, and the target are positive samples and that abnormal attack samples
domain only needs to train the deepest several layers of are negative samples. The total number of positive
the network. samples predicted to be correct is True Positive (TP),
In this study, because the KDDCUP99 and gas and the total number of errors is False Negative (FN).
pipeline datasets are anomaly detection datasets with The total of negative samples predicted to be correct
fixed-dimensional data features and certain correlations is True Negative (TN), and the total of errors is False
between features, we can use transfer learning on the Positive (FP).
basis of the data features of the datasets described 5 Experiment
above. The KDDCUP99 dataset has a sufficient sample
Batch
5.1 Dataset description
Normalization+Relu
The datasets used in this experiment are the gas pipeline
and KDDCUP99 datasets. The gas pipeline dataset is an
cv cv
industrial control network laboratory-scale ICS dataset
based on Modbus application layer protocol published by
Normal Max Professor T. Morris of Mississippi State University[28] .
Pooling
Attack The KDDCUP99 dataset is a network connection dataset
Fully
obtained from a simulated US Air Force LAN costing 9
Connected
c
weeks[28] .
Fig. 5 Model structure used in this article. The KDDCUP99 dataset is a public dataset used
Weiping Wang et al.: Anomaly Detection of Industrial Control Systems Based on Transfer Learning 827

to verify network anomaly detection algorithms. This accounts for 10%, and the test set of gas pipeline dataset
dataset is employed in the present work to verify accounts for 20%.
the effectiveness of the proposed anomaly detection (2) Visualize the new data samples obtained.
algorithm. The dataset contains 41-dimensional data (3) Use KDDCUP99 to pretrain the Resnet8 model.
samples and includes 22 attack types divided into four Then, save the model and model parameters after testing
categories, namely, Denial of Service attack (DoS), the model performance.
probing, R2L, and U2R[29] . The gas pipeline dataset (4) Load the pretrained model and model parameters,
contains 26 features and a category label. The number use the gas pipeline dataset to fine-tune the last three
of attack categories in the training and test sets is layers of the neural network of Resnet8, and obtain the
equal, and no unknown attack category is present. This model test indicators.
dataset contains seven types of attacks, namely Original The experiments were performed on a computer with
Malicious Response Injection (OMRI), Malicious an i7-8550U CPU processor, 1.8 GHz frequency, and 8
Status Command Injection (MSCI), Complex Malicious GB RAM.
Response Injection (CMRI), Malicious Parameter 5.3 Experimental results and analysis
Command Injection (MPCI), DoS, Malicious Function
Command Injection (MFCI), and Reconnaissance Attack This section introduces the results of the pretraining
(RA)[27] . model, describes the effects of fine-tuning different
Because the KDDCUP99 dataset has a total of 5 numbers of layers, and discusses the effects of model
million items, which is massive, we take only 10 % of fine-tuning and random initialization parameter training
these items for experimentation. The experimental data with the target domain. After obtaining the results, we
include approximately 100 000 items, which accounts explain the benefits of using transfer learning and why
for approximately 20% of this dataset. The gas pipeline the three-layer method of fine-tuning is used. We also
dataset has a total of 97 019 items, of which 61 156 are demonstrate the superiority of the proposed algorithm
normal samples. by comparing the results with those of other existing
algorithms.
5.2 Experimental settings
5.3.1 Data preprocessing and visualization
In view of the different sample sizes of the source and Every data stream in the target is processed into 324
target domain data, we divide the datasets randomly as pieces of data and source domains are processed into 41
follows: 90% of the KDDCUP99 dataset is used for the pieces of data by preprocessing each set of traffic data.
training set, and 10% is used for the test set. Moreover, Figure 6 shows the results of data visualization. Each
80% of the gas pipeline dataset is used for the training dataset in the source domain is processed into a grayscale
set, and 20% is used for the test set. image of 7 pixel  7 pixel by a 6-bit 0 supplement, and
We use PyTorch to construct the Resnet8 model, each dataset in the target domain is processed into a
multiply the cross-entropy loss function by 1.5–5 and grayscale image of 18 pixel  18 pixel pixels. Processing
use the result as a loss indicator, and apply the stochastic samples in this form to the residual CNN is clearly
gradient optimizer. The learning rate of the source
domain training is set to 0.001, the batch size is set
to 128, the duration is set to 4 epochs. The learning rate
of the target domain is set to 0.0003, the batch size is set
to 64, and the duration is set to 5 epochs.
(a) Normal-type samples in (b) Attack-type samples in
The experimental procedure is as follows. the source domain the source domain

(1) Read the dataset samples of the target and source


domains and then digitize, standardize, and normalize
the source domain data. Next, digitize the target domain
data, remove redundant features, delete the outliers of
individual sites, perform MD calculations, and then
normalize the data column by column. After processing
(c) Normal-type samples in the (d) Attack-type samples in the
into new samples, randomly divide the test and training target domain target domain

set by percentage. The test set of KDDCUP99 dataset Fig. 6 Data visualization results.
828 Tsinghua Science and Technology, December 2021, 26(6): 821–832

feasible.
5.3.2 Model pretraining
Figure 7 shows the change curve of the evaluation index
on the training and test sets during model pretraining.

Indicator
The graph shows that the loss and FPR of the source
domain continuously decrease during model pretraining
until the values stabilize. F1-score, recall, precision, A
P
and accuracy steadily rise until values of 100% are R

obtained. This result means the model converges well on A


P
the source domain, and the evaluation index indicates R
that the model can be used for training in the target
domain. Epoch quantity
(a) Accuracy, precision, recall, and F1-score of fine-tuning and deep learning observed
5.3.3 Model fine-tuning and deep learning during pretraining process
In this experiment, all eight layers of the model are
fine-tuned by utilizing transfer learning. The deep Loss

learning method used in this article initializes the model L

randomly and then optimizes all model parameters


without transfer learning. The result in Fig. 8 shows
that the pretrained model using fine-tuning converges
Indicator

faster and has a smaller loss and higher accuracy than


the model using deep learning when the gas pipeline

Accuracy
Precision
Recall
F1-score

Epoch quantity
(b) Loss and FPR of fine-tuning and deep learning observed during pretraining process
Indicator

Fig. 8 Fine-tuning and training of the model using deep


learning convergence comparison curves.
dataset is used for training. This finding indicates that
transfer learning is significant in this environment. Both
training methods are completed in approximately 81
minutes. The comparison shown in Table 1 reveals that
the score of the method of fine-tuning three layers in the
Epoch quantity
(a) Accuracy, precision, recall, and F1-score observed during pretraining process ICS flow anomaly detection index is close to the first two
methods, which also greatly reduces the training time of
Loss
FPR the model.
5.3.4 Fine-tuning of the different layers of the
model
Figure 9 shows the changes in loss and accuracy
Indicator

Table 1 Transfer learning effect verification form.


F1- FPR Accuracy Training
Recall Precision score time
Deep
0.9915 0.9955 0.9935 0.0085 0.9915 80 min 44 s
learning
Fine-tuning
0.9929 0.9953 0.9941 0.0088 0.9923 81 min 13 s
the model
Epoch quantity
Fine-tuning
(b) Loss and FPR observed during pretraining 0.9906 0.9955 0.9931 0.0085 0.9909 51 min 58 s
process
three layers
Fig. 7 Pretraining process index change curves.
Weiping Wang et al.: Anomaly Detection of Industrial Control Systems Based on Transfer Learning 829

anomaly detection of ICS flow data, the model training


time and FPR are subject to stringent requirements.
Given comprehensive consideration, the method of fine-
tuning three layers appears to be the most appropriate.
Such fine-tuning results in the precision and other
Accuracy

indicators exceeding 99% and FPR decreasing to 0.85%,


which indicates that the abnormal attack types OMRI,
MSCI, CMRI, MPCI, DoS, MFCI, and RA, have been
effectively detected.
5.3.5 Algorithm comparison
The comparison results in Table 3 show that the
Epoch quantity
precision indicators of reciprocal data and AutoEncoder
(a) Accuracy observed during the fine-tuning of different numbers of layers (AE)COne Class Support Vector Machine (OCSVM)
are higher than those of other machine learning
algorithms. However, the recall indicator of these
algorithms is low, which means the detected positive
samples are actually not all positive. The recall index
of Generative Adversarial Networks (GAN) is high
but its other indices are low, thus indicating that its
Loss

comprehensive performance is poor. Although the F1-


score of AE+OCSVM is high, which indicates that it
has good overall performance, its recall value is quite
low, which means some negative samples (abnormal
types) may be misclassified as positive samples. These
Epoch quantity
indicators are unsuitable for abnormal detection in
(b) Loss observed during the fine-tuning of different numbers of layers ICSs. The algorithms proposed in this paper are clearly
Fig. 9 Index change curves obtained during the fine-tuning superior to those algorithms in terms of the indicators of
of different numbers of layers. interest.
observed during training. Other indicators are listed in
6 Conclusion and Future Work
Table 2. Figure 10 shows a visualization of the prediction
results. Network security is a popular and important topic.
The results show that the indicators improve as the The network security of ICSs is of great importance
number of fine-tuning layers increases. In terms of for a country. This paper uses data visualization to
model convergence, the effects of fine-tuning three and convert flow data into images. Specifically, we build an
five layers are not much different, but the effect of eight-layer residual neural network and use fine-tuning
fine-tuning one layer is relatively unsatisfactory. In the technology for transfer learning to detect abnormal
datasets of ICSs.
Table 2 Effect of fine-tuning of different numbers of layers. Experimental results show that transfer learning for
Number Training residual CNNs is effective in this field. The depth of the
Recall Precision F1-score FPR
of layers time
model also ensures that it has a certain generalizability.
1 0.9902 0.9940 0.9921 0.0113 37 min 5 s
The residual structure effectively prevents gradient
3 0.9906 0.9949 0.9929 0.0087 51 min 58 s
5 0.9915 0.9954 0.9934 0.0087 65 min 56 s Table 3 Performance comparison of different algorithms.
Algorithm Recall Precision F1-score
GAN 0.9973 0.7498 0.8621
AE+OCSVM 0.8747 0.9907 0.9284
DEC 0.8821 0.8893 0.8909
RDA 0.7301 0.9913 0.8411
(a) (b) Fine-tune+Resnet8 0.9906 0.9955 0.9931
Fig. 10 Visualization of the prediction results. Note: DEC means deep embedded clustering.
830 Tsinghua Science and Technology, December 2021, 26(6): 821–832

explosion or gradient disappearance. The model can [5] R. Zhao, R. Q. Yan, Z. H. Chen, K. Z. Mao, P. Wang, and
provide reliable predictions for unknown or differently R. X. Gao, Deep learning and its applications to machine
distributed abnormal data through short-term training health monitoring: A survey, Mechanical System and Signal
by transfer learning. Compared with other anomaly Processing, vol. 115, pp. 213–237, 2019.
[6] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
detection algorithms, the algorithm proposed in this
M. Matena, Y. Q. Zhou, W. Li, and P. J. Liu, Exploring
paper results in superior indicators. The method we
the limits of transfer learning with a unified text-to-text
proposed not only solves the problem associated with
transformer, Journal of Machine Learning Research, vol.
training time for deep learning models by transfer 21, no. 140, pp. 1–67, 2020.
learning, but also meets the requirements of ICSs in [7] S. N. Shirazi, A. Gouglidis, K. N. Syeda, S. Simpson, A.
terms of evaluation indicators. Mauthe, I. M. Stephanakis, and D. Hutchison, Evaluation of
At present, the model we constructed solves the two- anomaly detection techniques for SCADA communication
classification problem, but a refined classification of resilience, in Proceedings of the 2016 Resilience Week
abnormal traffic data is still desirable. In the future work, (RWSr), Chicago, IL, USA, 2016, pp. 140–145.
[8] Y. Lai, J. Zhang, and Z. liu,, Industrial anomaly detection
we will perform multiclassification of abnormal traffic
and attack classification method based on convolutional
data, track the characteristics of different abnormal data
neural network, Security and Communication Networks,
types, and then reliably classify them to further ensure
doi: 10.1155/2019/8124254.
network security in ICSs. [9] J. Hurley, A. Munoz, and S. Sezer, ITACA: Flexible,
Acknowledgment scalable network analysis, in Proceedings of the 2012
IEEE International Conference on Communications (ICC),
This work was supported in part by 2018 industrial Internet Ottawa, Canada, 2012, pp. 1069–1073.
innovation and development project “Construction of [10] G. Thatte, U. Mitra, and J. Heidemann, Parametric methods
Industrial Internet Security Standard System and Test for anomaly detection in aggregate traffic, IEEE/ACM
and Verification Environment”, in part by the National Transactions On Networking, vol. 19, no. 2, pp. 512–525,
Industrial Internet Security Public Service Platform, 2010.
in part by the Fundamental Research Funds for the [11] A. Terai, S. Abe, K. Shoya, Y. Takano, and I.
Central Universities (Nos. FRF-BD-19-012A and FRF- Koshijima, Cyber-attack detection for industrial control
TP-19-005A3), in part by the National Natural Science system monitoring with support vector machine based on
Foundation of China (Nos. 81961138010, U1736117, and communication profile, in Proceedings of the 2017 IEEE
U1836106), and in part by the Technological Innovation European Symposium on Security and Privacy Workshops
Foundation of Shunde Graduate School, University of (EuroS&PW), Paris, France, 2017, pp. 132–138.
[12] C. Zhou, S. Huang, N. Xiong, S. Yang, H. Li, Y. Qin, and
Science and Technology Beijing (No. BK19BF006).
X. Li, Design and analysis of multimodel-based anomaly
References intrusion detection systems in industrial process automation,
IEEE Transactions on Systems, Man, and Cybernetics:
[1] A. R. Sadeghi, C. Wachsmann, and M. Waidner, Security Systems, vol. 45, no. 10, pp. 1345–1360, 2015.
and privacy challenges in industrial Internet of Things, in [13] M. Zhang, B. Y. Xu, and J. Gong, An anomaly detection
Proceedings of the 2015 52nd ACM/EDAC/IEEE Design model based on one-class SVM to detect network intrusions,
Automation Conference (DAC), San Francisco, CA, USA, in Proceedings of the 2015 11th International Conference
2015, pp. 1–6. on Mobile Ad-hoc and Sensor Networks (MSN), Shenzhen,
[2] L. Obergon, InfoSec reading room secure architecture China, 2015, pp. 102–107.
for industrial control systems, SANS Institute InfoSec, [14] S. C. Zhang, X. Y. Xie, and Y. Xu, Intrusion detection
GIAC(GSEC) Gold Certification, vol. 1, pp. 1–27, 2014. method based on a deep convolutional neural network,
[3] C. Markman, A. Wool, and A. A. Cardenas, A new burst- Tsinghua Science and Technology, vol. 59, no. 1, pp. 44–52,
DFA model for SCADA anomaly detection, in Proceedings 2019.
of the 2017 Workshop on Cyber-Physical Systems Security [15] A. Almalawi, X. H. Yu, Z. Tari, A. Fahad, and I. Khalil,
and PrivaCy, Dallas, TX, USA, 2017, pp. 1–12. An unsupervised anomaly-based detection approach for
[4] M. Mantere, I. Uusitalo, M. Sailio, and S. Noponen, integrity attacks on SCADA systems, Computers & Security,
Challenges of machine learning based monitoring for vol. 46, pp. 94–110, 2014.
industrial control system networks, in Proceedings of [16] W. Gao, Cyberthreats, attacks and intrusion detection
the 2012 26th International Conference on Advanced in supervisory control and data acquisition networks,
Information Networking and Applications Workshops, PhD dissertation, Department of Electronic & Computer
Fukuoka, Japan, 2012, pp. 968–972. Engineering, Mississippi State University, Mississippi, MS,
Weiping Wang et al.: Anomaly Detection of Industrial Control Systems Based on Transfer Learning 831

USA, 2013. support vector machine, Journal of Computer Applications,


[17] J. Liang, J. H. Chen, X. Q. Zhang, Y. Zhou, and J. J. Lin, vol. 38, no. 5, pp. 1360–1365, 2018.
One-hot encoding and convolutional neural network based [23] P. C. Mahalanobis, On the generalised distance in statistics,
anomaly detection, Tsinghua Science and Technology, vol. in Proceedings of the National Institute of Science of India,
59, no. 7, pp. 523–529, 2019. Calcutta, India, 1936, pp. 49–55
[18] Y. Wang, C. Wang, L. Luo, and Z. Zhou, Image [24] S. Xiang, F. Nie, and C. Zhang, Learning a Mahalanobis
classification based on transfer learning of convolutional distance metric for data clustering and classification, Pattern
neural network, in Proceedings of the 2019 Chinese Control Recognition, vol. 41, no. 12, pp. 3600–3612, 2008.
Conference (CCC), Guangzhou, China, 2019, pp. 7506– [25] S. Ioffe and C. Szegedy, Batch normalization: Accelerating
7510. deep network training by reducing internal covariate shift,
[19] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual arXiv preprint arXiv: 1502.03167, 2015.
learning for image recognition, in Proceedings of the IEEE [26] A. F. Agarap, Deep learning using rectified linear units
Conference on Computer Vision and Pattern Recognitio, (Relu), arXiv preprint arXiv:1803.08375, 2018.
Las Vegas, NV, USA, 2016, pp. 770–778. [27] G. J. Wang, J. Feng, M. Z. A. Bhuiyan, R. X. Lu, Security,
[20] E. Rezende, G. Ruppert, T. Carvalho, F. Ramos, and P. Privacy and Anonymity in Computation, Communication
de Geus, Malicious software classification using transfer
and Storage. Berlin, Germany: Springer, 2019.
learning of resnet-50 deep neural network, in Proceedings of [28] X. Zhang, H. Zeng, and L. Jia, Research of intrusion
the 2017 16th IEEE International Conference on Machine
detection system dataset-KDDCUP99, Computer
Learning and Applications (ICMLA), Cancun, Mexico,
Engineering and Design, vol. 31, no. 22, pp. 4809–4812,
2017, pp. 1011–1014.
2010.
[21] Z. Chen, Z. Xie, W. Zhang, and X. Xu, ResNet and model
[29] I. S. Thaseen and C. A. Kumar, Intrusion detection model
fusion for automatic spoofing detection, in Proceedings of
using fusion of chi-square feature selection and multi class
the Interspeech, Stockholm, Sweden, 2017, pp. 102–106.
[22] W. Liu, J. Qin, and H. Qu, Intrusion detection algorithm SVM, Journal of King Saud University-Computer and
of industrial control network based on improved one-class Information Sciences, vol. 29, no. 4, pp. 462–472, 2017.

Weiping Wang received the the PhD Zhanfan Zhou is currently an


degree in telecommunications physics undergraduate student at the School
electronics from Beijing University of Posts of Mechanical Engineering, University
and Telecommunications, Beijing, China of Science and Technology Beijing.
in 2015. She is currently an associate His current research interests include
professor at the School of Computer and auto-driving vehicle formation control,
Communication Engineering, University brain-like computing, intelligent control,
of Science and Technology Beijing. She machine learning, and anomaly detection.
received the support from National Key Research and
Development Program of China, the State Scholarship Fund Haixia Deng is currently an undergraduate
of China Scholarship Council, the National Natural Science student at the Donlinks School of
Foundation of China, the postdoctoral fund, and other basic Economics and Management, University
scientific research projects. Her current research interests include of Science and Technology Beijing.
auto-driving vehicle formation control, brain-like computing, Her current research interests include
memrisitive neural network, associative memory awareness auto-driving vehicle formation control,
simulation, complex network, and network security and image brain-like computing, intelligent control,
encryption. machine learning, and anomaly detection.

Zhaorong Wang is currently an Weiliang Zhao is currently an


undergraduate student at the School undergraduate student at the School
of Automation and Electrical Engineering, of Mechanical Engineering, University
University of Science and Technology of Science and Technology Beijing.
Beijing. His current research interests His current research interests include
include auto-driving vehicle formation auto-driving vehicle formation control,
control, brain-like computing, intelligent brain-like computing, intelligent control,
control, machine learning, and anomaly machine learning, and anomaly detection.
detection.
832 Tsinghua Science and Technology, December 2021, 26(6): 821–832

Yongzhen Guo received the master degree Chunyang Wang received the BS degree
in control theory and control engineering from Shandong Agricultural University,
from Tianjin University, Tianjin, China in China in 2019. He is currently a master
2010. He is now a PhD candidate at the student at the University of Science and
School of Automation, Beijing Institute of Technology Beijing. His current research
Technology (BIT). He is also the general interests include auto-driving vehicle
manager of Industrial Control System formation control, brain-like computing,
Evaluation and Certification Department of intelligent control, machine learning, and
China Software Testing Center. He received the National Science anomaly detection.
and Technology Major Projects and National Key Research
and Development Programs. His research interests include
security and cryptography, safety and reliability, and system
evaluation and certification. As a member of SAC/TC124/SC10,
SAC/TC196, ISO/TC 199/G8, and IEC/TC65/SC65C/WG18,
he is participating in a number of international standards and
national standards setting and revising.

You might also like