Anomaly Detection of Industrial Control Systems Based On Transfer Learning
Anomaly Detection of Industrial Control Systems Based On Transfer Learning
Weiping Wang , Zhaorong Wang, Zhanfan Zhou, Haixia Deng, Weiliang Zhao, Chunyang Wang,
and Yongzhen Guo
Abstract: Industrial Control Systems (ICSs) are the lifeline of a country. Therefore, the anomaly detection of ICS
traffic is an important endeavor. This paper proposes a model based on a deep residual Convolution Neural Network
(CNN) to prevent gradient explosion or gradient disappearance and guarantee accuracy. The developed methodology
addresses two limitations: most traditional machine learning methods can only detect known network attacks and
deep learning algorithms require a long time to train. The utilization of transfer learning under the modification of the
existing residual CNN structure guarantees the detection of unknown attacks. One-dimensional ICS flow data are
converted into two-dimensional grayscale images to take full advantage of the features of CNN. Results show that the
proposed method achieves a high score and solves the time problem associated with deep learning model training.
The model can give reliable predictions for unknown or differently distributed abnormal data through short-term
training. Thus, the proposed model ensures the safety of ICSs and verifies the feasibility of transfer learning for ICS
anomaly detection.
Key words: anomaly detection; transfer learning; deep learning; Industrial Control System (ICS)
1 Introduction
Weiping Wang and Chunyang Wang are with School of Computer
and Communication Engineering, the Beijing Key Laboratory Modern Industrial Control Systems (ICSs) have higher
of Knowledge Engineering for Materials Science, and the production efficiency than traditional industrial systems
Institute of Artificial Intelligence, University of Science and and can well process big data.
Technology Beijing, Beijing 100083, China, and with Shunde
However, increases in the type and frequency of
Graduate School, University of Science and Technology Beijing,
Guangzhou 528399, China. E-mail: [email protected]; network attacks and hacking incidents threaten the
[email protected]. security of ICSs based on data transmission. The
Zhaorong Wang is with School of Automation and Electrical National Institute of Standards and Technology has
Engineering, University of Science and Technology Beijing, proposed the main sources of security issues for modern
Beijing 100083, China. E-mail: [email protected].
Zhanfan Zhou and Weiliang Zhao are with School of Mechanical ICSs[1] , which include nonsecure communication
Engineering, University of Science and Technology Beijing, protocols, poor network isolation and access controls[2] ,
Beijing 100083, China. E-mail: [email protected]; and the lack of an ICS anomaly detection system[3] .
[email protected]. Intrusion detection technology is an important research
Haixia Deng is with the Donlinks School of Economics and
Management, University of Science and Technology Beijing,
direction in the field of network security. The original
Beijing 100083, China. E-mail: [email protected]. flows of network equipment and servers have been
Yongzhen Guo is with School of Automation, Beijing comprehensively analyzed[4] . When industrial control
Institute of Technology, Beijing 100081, and also with China networks are invaded or traffic data are abnormal,
Software Testing Center, Beijing 100048, China. E-mail:
intrusion detection technology can effectively predict
[email protected].
To whom correspondence should be addressed. and take active defensive measures in a timely manner.
Manuscript received: 2020-09-02; accepted: 2020-09-23 Deep learning has shown great research significance
C The author(s) 2021. The articles published in this open access journal are distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
822 Tsinghua Science and Technology, December 2021, 26(6): 821–832
in intrusion detection technology. Feature values are The residual structure, which effectively prevents
extracted through a great amount of data training, gradient explosion or gradient disappearance while
parameters are constantly changed, and a system that ensuring the depth of the model effect, is introduced.
can identify abnormal traffic data is constructed. The detection ability of the model in unknown domain
Deep learning and traditional machine learning show datasets is excellent. After training the source domain,
certain similarities. The core aim of traditional machine that is the KDDCUP99 dataset, the model is used on
learning is to map features to the target space. In a gas pipeline dataset through transfer learning. The
traditional machine learning algorithms, the recognition model shows good anomaly detection effects on the gas
rate increases with increasing data size; however, pipeline dataset[7] , and its precision, recall, and F1-score
because a bottleneck period is often encountered are fairly high.
during processing, these models cannot handle massive This article is organized as follows. Section 2
amounts of data. Machine learning performs well in introduces the background of this study and the related
intrusion detection in closed environments. However, work. Section 3 describes the method. Section 4
machine learning will be exposed when entering an introduces the evaluation index. Section 5 describes
open-world scenario with various random traffic or the experimental process and results. Section 6 provides
noise, which could adversely affect its availability[4] . the conclusions and directions for future work.
Therefore, traditional machine learning algorithms are
unsuitable for detecting abnormal traffic in ICSs, and 2 Background and Related Work
finding abnormal data quickly and implementing active
2.1 Research status of industrial control system
measures with high accuracy are quite challenging.
anomaly detection
Compared with traditional machine learning, deep
learning has a strong generalizability for extracting high- Given rapid developments in informatization and
dimensional data. Deep learning uses back-propagation industrialization, ICS has been widely used in
algorithms to change and adjust parameters continuously national infrastructures. However, platform hardware
to achieve optimal results. This learning method can and software vulnerabilities and the openness of the
handle large amounts of data; indeed, the larger the network environment render ICSs vulnerable to security
data size, the better the resulting effect. Unfortunately, attacks. Therefore, ICS anomaly detection is very
although deep learning has good generalizability in important. Anomaly detection has a significant effect
processing images, it relies on labeled data and cannot on the active defense process of ICSs[8] . The anomaly
handle unknown abnormal data types[5] . In this article, detection approaches of ICSs mainly include three types,
we solve some of the problems of traditional machine namely, knowledge-based, statistics-based, and machine
learning by using a residual Convolution Neural Network learning-based. Reference [9] proposed an anomaly
(CNN) structure to model the source dataset and modify detection method based on state recognition to detect
the relevant parameters by transfer learning. We then attacks in ICSs by using a data-driven clustering method
apply the transfer learning algorithm using the relevant to identify the normal and critical states of a system.
information of the source domain and predicting the A statistical model for traffic detection in the time
target domain[6] . Transfer learning is finally employed to domain has also been introduced to detect network
train the model quickly and detect differently distributed anomalies and evaluate the performance of the method
or unknown datasets. in different scenarios[10] . Results show that the model
ICS flow data can usually be processed with one- can detect network anomalies in all scenarios faster
dimensional data sequences through preprocessing; in than other methods. Considering their consequences,
this work, however, we use mapping to convert ICS flow network attacks aimed at ICSs are very serious. More
data to an image format suitable for CNNs to take full importantly, they are difficult to detect[11] . In the context
advantage of the features of the latter. Fine-tuning is of industrial control environments, anomaly detection
utilized during transfer learning to ensure timeliness. based on machine learning does well in improving
After building an eight-layer residual neural network, the accuracy of finding abnormal behavior and is of
only the three deepest layers of the neural network are great importance in the establishment of efficient and
fine-tuned. intelligent intrusion detection models[8] .
Weiping Wang et al.: Anomaly Detection of Industrial Control Systems Based on Transfer Learning 823
2.2 Anomaly detection based on machine learning 2.2.3 Transfer learning based on model fine-tuning
2.2.1 Traditional machine learning algorithms Transfer learning is an effective approach to exploit
deep neural networks on small datasets. The essence
Decision tree, random forest, Support Vector Machine
of transfer learning is to transfer and reuse knowledge in
(SVM), and logistic regression are traditional machine
other fields. Mathematically, transfer learning includes
learning methods. An earlier study used an intelligent
two concepts, namely, domain and learning task. Model-
Markov model based on statistical learning to establish
based transfer learning methods are usually combined
a multimodel intrusion detection system for industrial
with deep learning models to transfer the structure and
process automation that could effectively detect actual
parameters of models that have been trained on large-
attack operations[12] . Other researchers used SVM to
scale datasets (e.g., AlexNet, VGGNet, and ResNet) to
establish a data detection model utilized in an industrial
new tasks and use the weights trained on the large dataset
control communication protocol[13] .
as the initial weights for the new task[18] . In contrast
Decision tree is a machine learning method with a tree
to deep learning, transfer learning can detect unknown
structure and high efficiency. It is easy to understand
information. Fine-tuning pretrained CNNs on images is
and highly effective for processing discrete data. SVM
an effective strategy to achieve transfer learning. This
is based on the principle of structural risk minimization.
technique is widely used in the field of image recognition
In this method, the optimal classification is found by
and provides new insights into the recognition of small-
learning the classification model of data samples in the
scale datasets. In transfer learning, the deep network
feature space.
structure is trained on a large natural image dataset, after
2.2.2 Deep learning model which the model is transferred to a small dataset by
Traditional machine learning methods have a number fine-tuning its parameters. The features extracted from
of disadvantages, such as low efficiency in processing the pretrained deep neural network are universal and
large-scale data and inability to solve samples with applicable to other datasets. Figure 1 shows a schematic
uneven distributions. Compared with traditional machine of the model.
learning methods, deep learning models have more 2.2.4 Fine-tune based on residual neural network
complex architectures and multiple layers. The most
Kaiming He, a researcher at Microsoft Research Asia,
important advantages of deep learning over traditional
designed residual neural networks with a deeper network
machine learning are that it can learn features directly
structure and a simpler network structure[19] . The
and automatically from the original data and has good
residual network consists of multiple residual blocks,
performance[14] .
and each residual block comprises a convolutional layer
Deep learning models are quite effective in the field
and a pooling layer. The blocks of the convolutional
of detecting industrial process anomalies. Almalawi
layer are skipped by using shortcut connections. The
et al.[15] proposed two novel techniques that are
use of identity shortcuts requires the same input and
an automatic identification of inconsistent states
output sizes[20] . In this case, the problem of model
of SCADA data and an automatic extraction of
attenuation caused by the disappearance of gradients
proximity detection rules from identified states. Gao[16]
is avoided by the superposition of gradients. This deep
developed an anomaly-based intrusion detection
residual neural network won five championships in two
system for the SCADA network and found a combined
major technical competitions, namely, ImageNet and
Intrusion Detection System (IDS) which includes
MS COCO. A unique feature of this network is that it
signature-based IDS and anomaly-based IDS. These
includes a network depth greater than 152 layers, which
studies show that deep learning methods have
Model Copy Fine-tuned model
good performance in anomaly detection and attack
classification.
CNNs are suitable for image classification[17] .
Source Target
Compared with other image recognition algorithms,
CNN uses not only deep learning methods but also some
special structures for feedforward neural networks and
has relatively little data to preprocess. Fig. 1 Transfer learning by model fine-tuning.
824 Tsinghua Science and Technology, December 2021, 26(6): 821–832
had never been achieved before. In previous experiments, redundant features are removed. Some normalization
the gradient disappeared as the number of network layers methods are used to process the training data, improve
increased and the error rate of such a network was higher the convergence speed of deep learning, and complete
than that of a neural network with a lower number of the task of anomaly detection.
layers. The emergence of the residual neural network 3.1.1 Target domain: Gas pipeline dataset
solves these issues well. Experiments showed that higher The steps are as follows:
accuracy could be obtained as the number of network (1) Data cleaning: A large amount of abnormal
layers trained by the residual network increased, which data will greatly affect the normalized results by
means the residual network can allow deeper network affecting the data distribution. Data cleaning is used
layer training, and the performance of the model was to clean duplicates, erroneous data, and useless features,
greatly enhanced[21] . thus improving the reliability and integrity of the
Because traditional machine learning algorithms data as well as the accuracy of the analysis results.
have poor generalizability, deep learning algorithms The gas pipeline dataset includes some negative and
present greater time costs and may be prone to unreasonably large values of key measurement data,
gradient disappearance or explosion. Although fine- which means cleaning is necessary. Next, the attributes
tuning technology based on deep neural networks can “commandlength”, “commwritefun”, “reset”, “gain”,
solve the problem of high time cost and detect unknown “deadband”, “cycletime”, “rate”, and “crcrate” are
attacks, model pretraining remains subject to gradient redundant attributes in the dataset[22] . We delete these
disappearance or explosion during deep learning. Fine- attributes from the dataset because they interfere with
tuning based on a deep residual neural network can solve data classification.
these problems simultaneously. (2) Feature mapping: This experiment uses MD
3 Method to perform feature mapping on the data. MD was
proposed by Mahalanobis[23] as a distance measurement
In this section, we will describe the processing flow in method and refers to the covariance distance of the
detail. Because of the correlations among the features data. In contrast to the Euclidean distance, MD
of industrial control flow data, we convert the one- ignores differences in measurement units and considers
dimensional data stream into a two-dimensional matrix the relationship between features, thus aligning the
and then convert this matrix into a Mahalanobis Distance relationship between features with the actual situation[24] .
(MD) matrix. The obtained matrix is converted into Therefore, MD is not affected by the measurement scale
one-dimensional data, saved, normalized, and then and the interference of correlations between variables
mapped into a black-and-white image. After building can be eliminated. Figure 2 shows the pseudo code for
an eight-layer CNN with a residual structure, we use the feature mapping of data using MD.
KDDCUP99 dataset for pretraining and then train the 3.1.2 Source domain: KDDCUP99 dataset
obtained model on the gas pipeline dataset by fine-tuning
We standardize the source domain, that is the
through the transfer learning method.
KDDCUP99 dataset. Standardized data are subtracted
3.1 Data preprocessing from the mean and then divided by the variance (or
This experiment uses a deep migration learning model standard deviation). When this data standardization
and the KDDCUP99 dataset as the source domain to
Main code of feature mapping method based on MD
perform migration learning on the gas pipeline dataset.
Because the data may be disturbed by noise, missing Input: industrial control network data stream
Output: transformed feature matrix
values, and inconsistent data, the presence of low- Do while Xi:
Diag = convert_to_diag(Xi)
quality data is inevitable. We improve the credibility # The function is responsible for converting the data stream into a diagonal
matrix, and the mapped matrix is Map_Matrix=map_matrix(diag)
of the data by preprocessing them and then improve # The function is responsible for converting the diagonal matrix into a matrix
the performance of model recognition. A preliminary Save_matrix(Map_Matrix)
# The function saves the transformed matrix
exploration of the data reveals the presence of attributes Xi =Xi+1
End while
with exactly the same feature values in the training data;
these attributes are not beneficial to the establishment of Fig. 2 Pseudo code for feature mapping of data using the
the model and affect its construction[8] . Therefore, the MD.
Weiping Wang et al.: Anomaly Detection of Industrial Control Systems Based on Transfer Learning 825
method is completed, the data are converted into a extraction. Thus, we constructed an eight-layer CNN
standard normal distribution. In general, the standard with a residual structure.
deviation is 1 and the mean is 0. The conversion function 3.3.1 Input layer
is in the following: After data visualization, the data stream from the
x
xD (1) KDDCUP99 dataset is processed into a 7 7 grayscale
x
where x is the data, is the mean, and is the variance. image, and the data stream from the gas pipeline dataset
Standardizing the dataset can accelerate the search is processed into an 18 18 grayscale image. These
for optimal solutions. Standardization is conducive to two input sizes are relatively small. The stride of the
process initialization, avoids numerical problems when input layer is set to 1, the kernel size is set to 3 3, and
updating the gradient value, and helps adjust the learning the number of input channels is set to 3 (the algorithm
rate. It can also ensure that small values in the output automatically converts the grayscale image into the RGB
data are utilized. model) to utilize the data completely.
3.3.2 Residual blocks
3.2 Data visualization
Three residual blocks are utilized in the model. Each
3.2.1 Gas pipeline dataset residual module is composed of two weight layers
One-dimensional industrial flow data are transformed and two Relu activation functions. The weight layer
into a two-dimensional matrix via the feature-mapping is composed of a convolutional layer and a batch-
method. This section introduces the feature matrix normalization layer.
visualization method employed in this article. In this The batch-normalization layer transforms the input
paper, every element in the MapMatrix matrix is value distribution of any neuron in each layer of
regarded as a pixel, and the element value corresponds neural network into a standard normal distribution via
to the gray value of the pixel. a certain normalization method. Therefore, the batch-
3.2.2 KDDCUP99 dataset normalization layer prevents the model from gradient
vanishing and greatly accelerates its training speed[25] .
The 41-dimensional feature samples are converted into
The Relu function performs a nonlinear
8-bit-depth grayscale images measuring 7 pixel 7 pixel
transformation on the input. The input is not a
in size, the pixels number is from 0 to 255, and each
linear combination of the outputs of the previous layer
feature corresponds to a pixel.
but can be approximated to any function, thus ensuring
3.3 Eight-layer residual convolution neural the significance of the deep neural network[26] .
network The size of the kernel of the convolutional layer is
The existing residual CNN (layers 6 34) is shown in 3 3, and the stride is 1. As the network deepens, the
Fig. 3. number of kernels varies from 64 to 256. A schematic of
Although the residual neural network increases the each residual module is shown in Fig. 4. The equation
accuracy of predicting labels as the network deepens[18] , of the module is in the following:
it also leads to a longer training time, which is y D F .x/ C x (2)
unfavorable for anomaly detection in ICS. Compared where x is the input matrix and F(x) is the output after
with colored pictures, the data flow has fewer features the two-layer convolution operation. y is the input of the
and does not require a deep network structure for feature next residual module.
Full Basicblock×n
Basicblock×n
connection (3×3 conv, 512, 2)×2 (3×3 conv, 256, 2)×2
layer
3.3.3 Pooling layer number and contains over 490 000 datasets, while the
Commonly used pooling operations include maximum gas pipeline dataset is relatively small and contains only
down-sampling, average down-sampling, and spatial over 90 000 datasets. Employing the premise initially
down-sampling. Down-sampling is used in CNN introduced in this subsection, we use the KDDCUP99
to reduce model parameters. Among these pooling dataset to pretrain the model completely, retain the
operations, maximum down-sampling has been proven parameters of the convolutional layer that could extract
to have the best information retention capability. low-dimensional features, and then train and adjust the
Because the number of data features in ICSs is small, last three layers of the neural network through fine-
we only add a maximum pooling layer prior to the fully tuning.
connected layer to reduce information loss. The stride is
4 Evaluation Index
2, the kernel size is 3 3, and the number of in-channels
is 256. We utilize recall, precision, F1-score, False Positive Rate
3.3.4 Fully connected layer (FPR), and accuracy to evaluate the experimental results.
The fully connected layer is implemented by using a The percentage of positive samples in the data predicted
linear transformation function, which acts as a classifier by the model to be positive is reflected by precision,
for the entire neural network. Assuming that the recall reflects the proportion of real positive samples that
output image size of the previous layer is M N, the are predicted to be positive, F1-score combines precision
number of kernels is K. Because we are studying a two- and recall, and FPR reflects the proportion of negative
class problem, the fully connected layer transforms the samples that are incorrectly classified as positive[27] .
TP
M N K-dimensional data into two-dimensional data, precision D I
TP C TN
that is the predicted probability of each label. The
TP
algorithm outputs the predicted label by finding the recall D I
TP C FN
greatest possibility of the label being obtained.
FP
Figure 5 shows the structure of the entire model. FPR D I
FP C TN
3.4 Fine-tuning 2TP
F1-score D I
Fine-tuning is performed according to the neural 2TP C FP C FN
network. As the network deepens, the extracted features TP C TN
accuracy D (3)
become more abstract. For two similar domains, the TP C TN C FP C FN
previous layer for extracting common features can be We assume that normal samples in the actual samples
retained after source domain training, and the target are positive samples and that abnormal attack samples
domain only needs to train the deepest several layers of are negative samples. The total number of positive
the network. samples predicted to be correct is True Positive (TP),
In this study, because the KDDCUP99 and gas and the total number of errors is False Negative (FN).
pipeline datasets are anomaly detection datasets with The total of negative samples predicted to be correct
fixed-dimensional data features and certain correlations is True Negative (TN), and the total of errors is False
between features, we can use transfer learning on the Positive (FP).
basis of the data features of the datasets described 5 Experiment
above. The KDDCUP99 dataset has a sufficient sample
Batch
5.1 Dataset description
Normalization+Relu
The datasets used in this experiment are the gas pipeline
and KDDCUP99 datasets. The gas pipeline dataset is an
cv cv
industrial control network laboratory-scale ICS dataset
based on Modbus application layer protocol published by
Normal Max Professor T. Morris of Mississippi State University[28] .
Pooling
Attack The KDDCUP99 dataset is a network connection dataset
Fully
obtained from a simulated US Air Force LAN costing 9
Connected
c
weeks[28] .
Fig. 5 Model structure used in this article. The KDDCUP99 dataset is a public dataset used
Weiping Wang et al.: Anomaly Detection of Industrial Control Systems Based on Transfer Learning 827
to verify network anomaly detection algorithms. This accounts for 10%, and the test set of gas pipeline dataset
dataset is employed in the present work to verify accounts for 20%.
the effectiveness of the proposed anomaly detection (2) Visualize the new data samples obtained.
algorithm. The dataset contains 41-dimensional data (3) Use KDDCUP99 to pretrain the Resnet8 model.
samples and includes 22 attack types divided into four Then, save the model and model parameters after testing
categories, namely, Denial of Service attack (DoS), the model performance.
probing, R2L, and U2R[29] . The gas pipeline dataset (4) Load the pretrained model and model parameters,
contains 26 features and a category label. The number use the gas pipeline dataset to fine-tune the last three
of attack categories in the training and test sets is layers of the neural network of Resnet8, and obtain the
equal, and no unknown attack category is present. This model test indicators.
dataset contains seven types of attacks, namely Original The experiments were performed on a computer with
Malicious Response Injection (OMRI), Malicious an i7-8550U CPU processor, 1.8 GHz frequency, and 8
Status Command Injection (MSCI), Complex Malicious GB RAM.
Response Injection (CMRI), Malicious Parameter 5.3 Experimental results and analysis
Command Injection (MPCI), DoS, Malicious Function
Command Injection (MFCI), and Reconnaissance Attack This section introduces the results of the pretraining
(RA)[27] . model, describes the effects of fine-tuning different
Because the KDDCUP99 dataset has a total of 5 numbers of layers, and discusses the effects of model
million items, which is massive, we take only 10 % of fine-tuning and random initialization parameter training
these items for experimentation. The experimental data with the target domain. After obtaining the results, we
include approximately 100 000 items, which accounts explain the benefits of using transfer learning and why
for approximately 20% of this dataset. The gas pipeline the three-layer method of fine-tuning is used. We also
dataset has a total of 97 019 items, of which 61 156 are demonstrate the superiority of the proposed algorithm
normal samples. by comparing the results with those of other existing
algorithms.
5.2 Experimental settings
5.3.1 Data preprocessing and visualization
In view of the different sample sizes of the source and Every data stream in the target is processed into 324
target domain data, we divide the datasets randomly as pieces of data and source domains are processed into 41
follows: 90% of the KDDCUP99 dataset is used for the pieces of data by preprocessing each set of traffic data.
training set, and 10% is used for the test set. Moreover, Figure 6 shows the results of data visualization. Each
80% of the gas pipeline dataset is used for the training dataset in the source domain is processed into a grayscale
set, and 20% is used for the test set. image of 7 pixel 7 pixel by a 6-bit 0 supplement, and
We use PyTorch to construct the Resnet8 model, each dataset in the target domain is processed into a
multiply the cross-entropy loss function by 1.5–5 and grayscale image of 18 pixel 18 pixel pixels. Processing
use the result as a loss indicator, and apply the stochastic samples in this form to the residual CNN is clearly
gradient optimizer. The learning rate of the source
domain training is set to 0.001, the batch size is set
to 128, the duration is set to 4 epochs. The learning rate
of the target domain is set to 0.0003, the batch size is set
to 64, and the duration is set to 5 epochs.
(a) Normal-type samples in (b) Attack-type samples in
The experimental procedure is as follows. the source domain the source domain
set by percentage. The test set of KDDCUP99 dataset Fig. 6 Data visualization results.
828 Tsinghua Science and Technology, December 2021, 26(6): 821–832
feasible.
5.3.2 Model pretraining
Figure 7 shows the change curve of the evaluation index
on the training and test sets during model pretraining.
Indicator
The graph shows that the loss and FPR of the source
domain continuously decrease during model pretraining
until the values stabilize. F1-score, recall, precision, A
P
and accuracy steadily rise until values of 100% are R
Accuracy
Precision
Recall
F1-score
Epoch quantity
(b) Loss and FPR of fine-tuning and deep learning observed during pretraining process
Indicator
explosion or gradient disappearance. The model can [5] R. Zhao, R. Q. Yan, Z. H. Chen, K. Z. Mao, P. Wang, and
provide reliable predictions for unknown or differently R. X. Gao, Deep learning and its applications to machine
distributed abnormal data through short-term training health monitoring: A survey, Mechanical System and Signal
by transfer learning. Compared with other anomaly Processing, vol. 115, pp. 213–237, 2019.
[6] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
detection algorithms, the algorithm proposed in this
M. Matena, Y. Q. Zhou, W. Li, and P. J. Liu, Exploring
paper results in superior indicators. The method we
the limits of transfer learning with a unified text-to-text
proposed not only solves the problem associated with
transformer, Journal of Machine Learning Research, vol.
training time for deep learning models by transfer 21, no. 140, pp. 1–67, 2020.
learning, but also meets the requirements of ICSs in [7] S. N. Shirazi, A. Gouglidis, K. N. Syeda, S. Simpson, A.
terms of evaluation indicators. Mauthe, I. M. Stephanakis, and D. Hutchison, Evaluation of
At present, the model we constructed solves the two- anomaly detection techniques for SCADA communication
classification problem, but a refined classification of resilience, in Proceedings of the 2016 Resilience Week
abnormal traffic data is still desirable. In the future work, (RWSr), Chicago, IL, USA, 2016, pp. 140–145.
[8] Y. Lai, J. Zhang, and Z. liu,, Industrial anomaly detection
we will perform multiclassification of abnormal traffic
and attack classification method based on convolutional
data, track the characteristics of different abnormal data
neural network, Security and Communication Networks,
types, and then reliably classify them to further ensure
doi: 10.1155/2019/8124254.
network security in ICSs. [9] J. Hurley, A. Munoz, and S. Sezer, ITACA: Flexible,
Acknowledgment scalable network analysis, in Proceedings of the 2012
IEEE International Conference on Communications (ICC),
This work was supported in part by 2018 industrial Internet Ottawa, Canada, 2012, pp. 1069–1073.
innovation and development project “Construction of [10] G. Thatte, U. Mitra, and J. Heidemann, Parametric methods
Industrial Internet Security Standard System and Test for anomaly detection in aggregate traffic, IEEE/ACM
and Verification Environment”, in part by the National Transactions On Networking, vol. 19, no. 2, pp. 512–525,
Industrial Internet Security Public Service Platform, 2010.
in part by the Fundamental Research Funds for the [11] A. Terai, S. Abe, K. Shoya, Y. Takano, and I.
Central Universities (Nos. FRF-BD-19-012A and FRF- Koshijima, Cyber-attack detection for industrial control
TP-19-005A3), in part by the National Natural Science system monitoring with support vector machine based on
Foundation of China (Nos. 81961138010, U1736117, and communication profile, in Proceedings of the 2017 IEEE
U1836106), and in part by the Technological Innovation European Symposium on Security and Privacy Workshops
Foundation of Shunde Graduate School, University of (EuroS&PW), Paris, France, 2017, pp. 132–138.
[12] C. Zhou, S. Huang, N. Xiong, S. Yang, H. Li, Y. Qin, and
Science and Technology Beijing (No. BK19BF006).
X. Li, Design and analysis of multimodel-based anomaly
References intrusion detection systems in industrial process automation,
IEEE Transactions on Systems, Man, and Cybernetics:
[1] A. R. Sadeghi, C. Wachsmann, and M. Waidner, Security Systems, vol. 45, no. 10, pp. 1345–1360, 2015.
and privacy challenges in industrial Internet of Things, in [13] M. Zhang, B. Y. Xu, and J. Gong, An anomaly detection
Proceedings of the 2015 52nd ACM/EDAC/IEEE Design model based on one-class SVM to detect network intrusions,
Automation Conference (DAC), San Francisco, CA, USA, in Proceedings of the 2015 11th International Conference
2015, pp. 1–6. on Mobile Ad-hoc and Sensor Networks (MSN), Shenzhen,
[2] L. Obergon, InfoSec reading room secure architecture China, 2015, pp. 102–107.
for industrial control systems, SANS Institute InfoSec, [14] S. C. Zhang, X. Y. Xie, and Y. Xu, Intrusion detection
GIAC(GSEC) Gold Certification, vol. 1, pp. 1–27, 2014. method based on a deep convolutional neural network,
[3] C. Markman, A. Wool, and A. A. Cardenas, A new burst- Tsinghua Science and Technology, vol. 59, no. 1, pp. 44–52,
DFA model for SCADA anomaly detection, in Proceedings 2019.
of the 2017 Workshop on Cyber-Physical Systems Security [15] A. Almalawi, X. H. Yu, Z. Tari, A. Fahad, and I. Khalil,
and PrivaCy, Dallas, TX, USA, 2017, pp. 1–12. An unsupervised anomaly-based detection approach for
[4] M. Mantere, I. Uusitalo, M. Sailio, and S. Noponen, integrity attacks on SCADA systems, Computers & Security,
Challenges of machine learning based monitoring for vol. 46, pp. 94–110, 2014.
industrial control system networks, in Proceedings of [16] W. Gao, Cyberthreats, attacks and intrusion detection
the 2012 26th International Conference on Advanced in supervisory control and data acquisition networks,
Information Networking and Applications Workshops, PhD dissertation, Department of Electronic & Computer
Fukuoka, Japan, 2012, pp. 968–972. Engineering, Mississippi State University, Mississippi, MS,
Weiping Wang et al.: Anomaly Detection of Industrial Control Systems Based on Transfer Learning 831
Yongzhen Guo received the master degree Chunyang Wang received the BS degree
in control theory and control engineering from Shandong Agricultural University,
from Tianjin University, Tianjin, China in China in 2019. He is currently a master
2010. He is now a PhD candidate at the student at the University of Science and
School of Automation, Beijing Institute of Technology Beijing. His current research
Technology (BIT). He is also the general interests include auto-driving vehicle
manager of Industrial Control System formation control, brain-like computing,
Evaluation and Certification Department of intelligent control, machine learning, and
China Software Testing Center. He received the National Science anomaly detection.
and Technology Major Projects and National Key Research
and Development Programs. His research interests include
security and cryptography, safety and reliability, and system
evaluation and certification. As a member of SAC/TC124/SC10,
SAC/TC196, ISO/TC 199/G8, and IEC/TC65/SC65C/WG18,
he is participating in a number of international standards and
national standards setting and revising.