CSSE - Anomaly Detection in ICS Datasets With Machine Learning Algorithms
CSSE - Anomaly Detection in ICS Datasets With Machine Learning Algorithms
DOI:10.32604/csse.2021.014384
Article
1
International Islamic University Malaysia, Jalan Gombak, 53100, Malaysia
2
Sunway University, Selangor, 47500, Malaysia
Corresponding Author: Mohamed Hadi Habaebi. Email: [email protected]
Received: 17 September 2020; Accepted: 14 December 2020
1 Introduction
Control system is defined as the hardware and software component of an Industrial Automation and
Control System (IACS). The key components of the industrial control system (ICS) include Supervisory
Control and Data Acquisition (SCADA), Human Machine Interface (HMI), Programmable Logic
Controllers (PLC), Remote Terminal Unit (RTU), and Distributed Control System (DCS). A SCADA
system helps to collect data from field sensors that enable us to control the system through a human-
machine interface (HMI) software.
Cybersecurity solutions for Information Technology (IT) are well established and secured but less work
has been done on cybersecurity for operational technology (OT) (Sentryo, 2019). In the IT sector, the
confidentiality of information has the highest priority, whereas, in OT-ICS, the highest priority is for the
This work is licensed under a Creative Commons Attribution 4.0 International License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original
work is properly cited.
34 CSSE, 2021, vol.37, no.1
availability of information. The OT systems lack the cybersecurity culture and with increased digitization,
more cyber-attacks surface. IT risks are sources of fraud, financial losses, privacy, and data leaks, wherein
OT risks are sources of health, safety, and environmental casualties. At present, OT networks have an
inconsistent deployment of security policies and standards wherein IT networks have strong security
policies [1]. Applications and protocols in the OT domain are customized in SCADA, HMI, and DCS,
whereas for the IT domain it is already standardized in email, internet, video, etc.
Intrusion detection systems have proved to be a reliable security process for anomaly detection in
traditional IT, which identifies all inbound and outbound network traffic for security breach and check the
traffic for matching signatures. Then, it signals an alarm when the matching is not found. Network-based
IDS (NIDS) scans entire networks and detects malicious traffic activity, whereas Host-based IDS (HIDS)
scans for a specific host and monitors each system event.
Intrusion detection systems can work conjointly with IT security systems, but unfortunately, IT systems
do not meet the industrial requirements. However, the ICS cyber threats are growing at an alarming rate on
industrial automation applications. The continuity of services with the safe operation is of great importance
since many ICSs are in a position where a failure can result in a threat to human lives, environmental safety,
or production output.
Some of the main challenges faced by OT ICS are [2] the lack of asset visibility for brownfield control
systems, ongoing modifications, and upgradations in process plants. Multiple Original Equipment
Manufacturers (OEM) in single plant operation are using different communication protocols. ICS vendors
are not familiar with IT cybersecurity protocols or technology, and they do not have hands-on experience
with ICS devices due to a shortage of experienced cybersecurity personnel.
Furthermore, many universities have difficulties to build their own OT ICS Cyber Range lab facilities
dedicated to industrial use-case scenarios to carry out the research activities due to financial constraints.
Currently, many researchers utilize publicly available ICS datasets for analysis of detection techniques
with machine learning algorithms, as the industrial entities are reluctant to disclose the operational
datasets to the public due to the sensitivity and criticality of industrial assets.
In recent years, cyber-attacks on industrial control systems had been increased many-fold due to the
digitization of the industrial sector. The prime examples of notable recent industrial control system cyber-
attack incidents include- Stuxnet attack on Iran nuclear facility, the Duqu & Flame attack on Iran offshore
facility, the Havex remote access trojan, the Shamoon attack on Saudi Aramco, the Petya Ransomware
attack in India, and the Triton-Triconex Safety Instrumented System attack on Saudi Aramco [2].
Little research had been carried to identify the advantages of using machine learning in ICS SCADA
systems with real network traffic data testbed simulation and its behavior analysis for anomaly detection.
The architecture of a typical modern SCADA reference model is shown in Fig. 1 consists of the
following layers [3].
The root causes of cyber vulnerabilities in ICS SCADA systems are due to poorly secured legacy
systems, delayed patch updates of software vulnerabilities, lack of cyber-security situational awareness,
remote access for maintenance, large deployment areas, distributed operating mode, growing
interconnectivity, and lack of built-in security with SCADA protocols.
The contribution of this paper is to highlight the machine learning techniques, for attack detection with
SCADA public dataset and introduce innovative data profiling with flow-based behavior analysis using
packet inspection of network traffic data. The dataset is processed and profiled for modeling for the
abnormal prediction detection with the anomaly-based machine learning algorithms for intrusion detection
in ICS systems.
CSSE, 2021, vol.37, no.1 35
The rest of the paper is subdivided as follows. Section 2 deals with the literature review of Machine
learning techniques for ICS SCADA intrusion detection systems and different types of public SCADA
datasets. It also explicates the ML-based SCADA IDS steps and the performance metrics criteria for the
evaluation of algorithms. Section 3 provides ML analysis of the public dataset and its performance
evaluation comparison along with validation. Section 4 illustrates the network traffic analysis by data
profiling and the baseline is determined by the traffic flow activities of the network with packet
inspection. The feature extracted processed dataset is modeled to predict the abnormality classification in
the network traffic data with the comparison of different machine learning algorithms and the paper
concludes with future works for behavior analysis with multiple port-based protocol analysis and multiple
anomaly criteria with hybrid machine learning algorithms. The paper concludes in section 5.
2 Related Work
This section discusses the Machine learning applications which are predominantly deployed across
various industries and their applications due to their computing power, data collection, and storage
capabilities. An intrusion detection system (IDS) integrated with machine learning (supervised and
unsupervised techniques) can improve the detection rates of attacks for SCADA systems [4].
Machine learning algorithms are widely implemented in the intrusion detection system (IDS) to
overcome the high false positives issue in prediction. Different machine learning techniques- such as
supervised and unsupervised, which uses statistical techniques to learn, classify and predict the outcome
methods, can be analyzed as mentioned in Tabs. 1 and 2 [5].
In supervised methods, the pre-labeled dataset feature is required (classification/regression) whereas
unsupervised methods do not need pre-labeled data (dimensional reduction, clustering) for analysis.
Clustering is mainly applied for forensic analysis, regression for network packet parameters prediction,
and comparison with the normal ones, whereas classification is applied to identify different classes of
network attacks such as scanning and spoofing [6]. An anomaly detection method for deception attacks in
36 CSSE, 2021, vol.37, no.1
the industrial control system is introduced by investigating the behavior of normal, attack-free activities [7].
Different existing intrusion detection systems using artificial neural networks (ANN) for detecting malicious
network activity for different datasets have been reviewed [8]. A novel dataset focused on IoT combined
network, power features and attacks utilizing WEKA application to train, test, and cross-validate the
dataset for classification of detection with Naive Bayes (NB), support vector machines (SVMs),
multilayer perceptron (MLP), Random Forest (RF), ZeroR ML classifiers has been introduced [9].
Furthermore, a cyber-physical ICS testbed can provide a hands-on simulation platform with real-time network
data with various types of cyber-attacks for security evaluation and testing environment for research purposes.
In the data cleaning and mining stage, missing values in the SCADA dataset are corrected, split
randomly into training and test sets for better results. Then data normalization is followed where the
improper features are replaced with mean and normalized values. Then the dataset features are extracted,
and the model is built based on the machine learning algorithms to detect anomalies in the dataset.
Finally, the IDS performance can be analyzed with parameters, such as Accuracy, precision, sensitivity
(recall), receiver operating curve (ROC), F-score, etc.
The common IDS, like Snort and Bro, alerts can be classified as mentioned in Tab. 3, with True Negative
(TN) for no attack-no alert, False Positive (FP) for no attack-alert, False Negative (FN) for attack-no alert,
and True Positive (TP) for attack-alert.
The dataset is randomly split into 80% training sets for modeling and the rest 20% test sets for ML
algorithm evaluation. The dataset includes 274628 instances with a training set of 219,702 and a test set
with 54,926 observation instances.
S. N 1 2 3 4 5 6 7 8 9 10
Features A F L P PID Parameters S
Features
Value
4 16 90 115 0.2 0.5 1 0 0 0
11 12 13 14 15 16 17 18 19 20
C P S P C CRC Time Stamp Output Label
1 0 0 ? 1 17219 141862164.995592 1 1 1
Logistic regression function is used to model with the training set and probability for binary attribute
classification is predicted on the test set which detects the normal/attack status. The logistic regression (LR)
confusion matrix evaluation is performed with the R-studio platform for the test dataset and provides a
prediction accuracy of 99.99%, for total observations of 54926 instances, as mentioned in below Tab. 5.
CSSE, 2021, vol.37, no.1 39
The data analysis with the KNN ML technique mentioned in Tab. 6, shows an accuracy parameter of
83.72% for the total observations of 7500 instances.
The following five feature attributes are extracted, filtered, and ranked based on the information gain
ranking attribute selection criteria method, as mentioned in Tab. 7.
The ML performance metrics of hybrid classifier algorithm with instances–25000 and 274628, 5-fold
cross-validation, for the five attributes are evaluated in Tab. 8.
The dataset has 86799 instances with 06 data columns, without any null values and ‘NA’ character
values, as highlighted in Fig. 5.
The traffic flows are a port-wise set of packets and have different protocols such as TCP, UDP, ICMP has
different flow properties. The Transmission Control Protocol (TCP) operates at transport layer of OSI layer
for services/applications with dedicated destination port: HTTP (80), FTP (20, 21), Telnet (23), SMTP (25),
DNS (53), HTTPS (443), Modbus (502), ISO-on-TCP–Siemens S7 Communication (S7comm) Protocol
(RFC 1006)—(102), whereas the User Datagram Protocol (UDP) includes: SNMP (161), Syslog (514) [18].
The pre-processing of data is based on the communication protocols which is “TCP” and is assigned to
discrete output value: 1, for further analysis and classification. The dataset is post-processed with a column
named Info, which is then further extracted and assigned to each separate column for prediction and
classification analysis. This profiling is done with functions with the Spyder python platform.
The feature extraction of the dataset is obtained by splitting the Info column of network traffic data
which is a critical part of data analysis and each of the relevant features from Info columns such as
source port, destination port, Ack, Seq, Len Packet, Window is extracted by filtering unwanted
characters, which is vital for training and testing is shown in Tab. 12.
The behavior analysis of traffic data is performed with packet inspection, the data flow is analyzed, and
classification output is identified based on the column Info data baseline, as shown in Fig. 6. In normal
scenario result, an output of ‘1’ is assigned for classification, whereas in anomaly scenario result, an
output of ‘0’ is assigned based on keywords “Dup ACK,” “Previous segment not captured.”
[TCP Dup ACK 400#3] 43624 > 80 [ACK] Seq=1 Ack=310012 Win=65535 Len=0
TCP Dup ACK 86798#1 32586 502 ACK Seq=1141 Ack=2186 Win=64768 Len=0
The output has 44814 counts for result value ‘0’ where-as the result value ‘1’ has 41220 counts. The
network traffic dataset having labels and relevant features are trained and modeled with different machine
learning algorithms and result classification is predicted with machine learning analysis.
CSSE, 2021, vol.37, no.1 43
The logistic regression model is evaluated, and the predicted binary outputs are represented as a
probability function that is converted to discrete ‘0’ and ‘1’. The training accuracy is at 65.23% where-as
maximum test accuracy is at 65.15%.
K-Nearest Neighbor (KNN) model is applied with nearest neighbors algorithm and the K-value which is
the threshold point at which the performance of train/test accuracy start to dip or decrease is determined for
train and test dataset. K-value is the odd increment value. Each instance and the training dataset has a K-value
of 9 for the 5th element while the test dataset has a K-value of 7 for the 4th element as identified in below
Fig. 9. The training accuracy is at 71.43% where-as maximum test accuracy is at 69.75%.
The Naïve Bayes model has two options for independent variables. The Gaussian method has more accuracy
for continuous variables, whereas the Multinomial method has higher accuracy for categorical discrete
independent variables. The training accuracy is at 61.69% where-as maximum test accuracy is at 61.30%.
44 CSSE, 2021, vol.37, no.1
The decision tree model has two options—the entropy method for information gain where the root node
is identified, and the other option is the Gini method for the impurity measurement. The Decision tree with
entropy method exhibits the highest train accuracy parameter with 96.18 whereas test accuracy Random
CSSE, 2021, vol.37, no.1 45
Forest model averages the multiple decision trees and provides better accuracy. The training accuracy is at
61.69% where-as maximum test accuracy is at 61.30%.
Artificial Neural Network (ANN) is a hybrid model with the black box technique, where each layer can
have 100 networks. The training accuracy is at 65.94% where-as maximum test accuracy is at 65.45%.
Support Vector Machine (SVM) uses a hyperplane (linear boundary) method and has different kernel
types-rbf, poly, sigmoid which can reduce overfitting.
The comparison of machine learning algorithms performance is mentioned in Tab. 13.
References
[1] ISA, “Security for industrial automation and control systems, Part 3-3: System Security Requirements and Security
Levels,” 2013.
[2] D. McMillen, “Security attacks on industrial control systems,” 2016. [Online]. Available: https://
securityintelligence.com/attacks-targeting-industrial-control-systems-ics-up-110-percent/.
[3] L. Van, “Sequential detection and isolation of cyber-physical attacks on SCADA systems,” Ph.D. Thesis.
University of Technology of Troyes, 2015.
[4] M. Keshk, N. Moustafa, E. Sitnikova and G. Creech, “Privacy preservation intrusion detection technique for
SCADA systems,” in Military Communications and Information Systems Conf. (MilCIS). Canberra, Australia,
1–6, pp. 2017.
[5] L. A. Maglaras and J. Jiang, “Intrusion detection in SCADA systems using machine learning techniques,” Ph.D.
Thesis. University of Huddersfield, UK, 2018.
[6] Q. Qassim, “An anomaly detection technique for deception attacks in industrial control systems,” in IEEE 5th Intl.
Conf. on Big Data Security on Cloud. Washington, DC, USA, 267–272, 2019.
[7] R. L. Perez, F. Adamsky, S. Ridha and E. Thomas, “Machine learning for reliable network attack detection in
scada systems,” in 17th IEEE Int. Conf. on Trust, Security and Privacy in Computing and Communications
(IEEE TrustCom-18). New York, USA, 2018.
[8] I. Solomon, A. Jatain and B. Shalini, “Neural network-based intrusion detection: State of the art.” India:
International Conf. on Sustainable Computing in Science, Technology and Management (SUSCOM-2019),
Amity University Rajasthan, 2019.
[9] J. Foley, N. Moradpoor and H. Ochen, “Employing a machine learning approach to detect combined internet of
things attacks against two objective functions using a novel dataset,” Security and Communication Networks, vol.
2020, no. 2, pp. 1–17, 2020.
[10] A. Almalawi, X. Yu, Z. Tari, A. Fahad and I. Khalil, “An unsupervised anomaly-based detection approach for
integrity attacks on scada systems,” Computers & Security, vol. 46, pp. 94–110, 2014.
[11] N. Moradpoor and A. Hall, “Insider threat detection using supervised machine learning algorithms on an extremely
imbalanced dataset,” International Journal of Cyber Warfare and Terrorism, vol. 10, no. 2, pp. 1–26, 2020.
[12] M. A. Teixeira, T. Salman, M. Zolanvari, R. Jain, N. Meskin et al., “SCADA system testbed for cybersecurity
research using machine learning approach,” Future Internet, vol. 10, no. 76, pp. 1–15, 2018.
[13] A. Robles-Durazno, N. Moradpoor, J. McWhinnie and G. Russell, “A supervised energy monitoring-based
machine learning approach for anomaly detection in a clean water supply system,” in Proceedings of the IEEE
Int. Conf. on Cyber Security and Protection of Digital Services, Glasgow, UK, 2018.
[14] I. Turnipseed, “A new SCADA dataset for intrusion detection research,” Ph.D. Thesis. Mississippi State
University, USA, 2015.
[15] A. Polyakov, “Machine learning for cybersecurity 101, Dzone, AI Zone, 2018. [Online]. Available: https://dzone.
com/articles/machine-learning-for-cybersecurity-101.
[16] R. Vinayakumar, “Deep learning approach for intelligent intrusion detection system,” IEEE Access, vol. 7, pp.
41525–41550, 2019.
[17] P. Singh, “Machine learning with PySpark,” Apress, Springer Nature Publishing Co., NY, USA, 2018.
[18] A. Almehmadi, “SCADA networks anomaly-based intrusion detection system,” in SIN’18: Proceedings of the
11th Int. Conf. on Security of Information and Networks, Cardiff, UK, pp. 1–4, 2018.
[19] R. Malaiya, D. Kwon, C. Suh, H. Kim and J. Kim, “An empirical evaluation of deep learning for network anomaly
detection,” IEEE Access, vol. 7, pp. 140806–140817, 2019.
[20] S. D. Anton, L. Ahrens, D. Fraunholz and H. D. Schotten, “Time is of the essence: Machine learning-based
intrusion detection in industrial time series data,” IEEE Int. Conf. on Data Mining Workshops (ICDMW), pp.
1–6, 2018.