0% found this document useful (0 votes)
34 views23 pages

Article 1

This article investigates the scalability of Big Data Cyber Security Analytics (BDCA) systems which use big data technologies like Apache Spark to analyze large volumes of security event data. The study first analyzes the scalability of a Spark-based BDCA system with default Spark settings, then identifies key Spark configuration parameters that impact scalability. Based on this, the study proposes an approach called SCALER to optimize a BDCA system's scalability through parameter adaptation. Experiments on a large OpenStack cluster show that SCALER improves scalability by 20.8% compared to default settings.

Uploaded by

ked25482
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views23 pages

Article 1

This article investigates the scalability of Big Data Cyber Security Analytics (BDCA) systems which use big data technologies like Apache Spark to analyze large volumes of security event data. The study first analyzes the scalability of a Spark-based BDCA system with default Spark settings, then identifies key Spark configuration parameters that impact scalability. Based on this, the study proposes an approach called SCALER to optimize a BDCA system's scalability through parameter adaptation. Experiments on a large OpenStack cluster show that SCALER improves scalability by 20.8% compared to default settings.

Uploaded by

ked25482
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Journal of Network and Computer Applications 198 (2022) 103294

Contents lists available at ScienceDirect

Journal of Network and Computer Applications


journal homepage: www.elsevier.com/locate/jnca

On the scalability of Big Data Cyber Security Analytics systems


Faheem Ullah *, M. Ali Babar
CREST – Centre for Research on Engineering Software Technologies, The University of Adelaide, Australia

A R T I C L E I N F O A B S T R A C T

Keywords: Big Data Cyber Security Analytics (BDCA) systems use big data technologies (e.g., Apache Spark) to collect, store,
Big data and analyse a large volume of security event data for detecting cyber-attacks. The volume of digital data in
Cyber security general and security event data in specific is increasing exponentially. The velocity with which security event
Adaptation
data is generated and fed into a BDCA system is unpredictable. Therefore, a BDCA system should be highly
Scalability
Configuration parameter
scalable to deal with the unpredictable increase/decrease in the velocity of security event data. However, there
Spark has been little effort to investigate the scalability of BDCA systems to identify and exploit the sources of scal­
ability improvement. In this paper, we first investigate the scalability of a Spark-based BDCA system with default
Spark settings. We then identify Spark configuration parameters (e.g., execution memory) that can significantly
impact the scalability of a BDCA system. Based on the identified parameters, we finally propose a parameter-
driven adaptation approach, SCALER, for optimizing a system’s scalability. We have conducted a set of exper­
iments by implementing a Spark-based BDCA system on a large-scale OpenStack cluster. We ran our experiments
with four security datasets. We have found that (i) a BDCA system with default settings of Spark configuration
parameters deviates from ideal scalability by 59.5% (ii) 9 out of 11 studied Spark configuration parameters
significantly impact scalability and (iii) SCALER improves the BDCA system’s scalability by 20.8% compared to
the scalability with default Spark parameter setting. The findings of our study highlight the importance of
exploring the parameter space of the underlying big data framework (e.g., Apache Spark) for scalable cyber
security analytics.

1. Introduction technologies for implementing deep learning to classify encrypted mo­


bile traffic. In the healthcare domain, several studies (e.g. (Pramanik
The volume and velocity of digital data are increasing enormously. et al., 2020; Mehta and Pandit, 2018; Alexander and Wang, 2017),)
The amount of digital data increased to 40 trillion gigabytes in 2020, explored the incorporation of big data technologies for medical image
which was merely 1.2 trillion gigabytes back in 2010 (Gontz and Rien­ analysis, genomic analysis, and prediction of various diseases. Similarly,
sel, 2012). Given that “data is the new oil of the digital economy”, the big data technologies are increasingly used in the oil and gas sector for
amount of data analysed has jumped from 0.5% in 2012 to 37% in 2019 analysing the large volume, velocity, and variety of data related to
(Gontz and Riensel, 2012; Economist, 2017). However, the traditional drilling, exploration, and production (Nguyen et al., 2020; Baaziz et al.,
software systems (e.g., relational database and data warehouse) are 2014).
unable to collect, store, and analyse such a large volume of data. Similar to the other domains such as bioinformatics (Greene et al.,
Therefore, big data storage and processing technologies (e.g., Apache 2014) and healthcare (Groves et al., 2016), the role of big data tech­
Hadoop, Apache Spark, and Cassandra) are being increasingly leveraged nologies is on the rise in the cyber security domain too. The significance
in various fields to deal with the massive volume, velocity, and variety of of big data technologies in the cyber security domain was first high­
data such is evident from the fact that 97.2% organizations are investing lighted by Cloud Security Alliance (CSA) in 2013 (Allaince, 2013). A
in big data (Oussous et al., 2018; Partners, 2019). For instance, Persico CSA report emphasizes the important need of enabling the traditional
et al. (2018) compared two big data architectures (Lambda and Kappa) cyber security systems (e.g., intrusion detection system and malware
using Microsoft Azure cloud platform for social network data analysis. In detection system) to deal with the massive volume, velocity, and variety
another study (Aceto et al., 2019), Aceto et al. leveraged big data of security event data such as NetFlow, Firewall logs, and Packet data

* Corresponding author.
E-mail addresses: [email protected] (F. Ullah), [email protected] (M.A. Babar).

https://doi.org/10.1016/j.jnca.2021.103294
Received 12 January 2021; Received in revised form 18 November 2021; Accepted 23 November 2021
Available online 3 December 2021
1084-8045/© 2021 Elsevier Ltd. All rights reserved.
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

(Allaince, 2013). The merger of cyber security systems and big data and devise an approach for improving the scalability”. Given that there
technologies has given birth to a new breed of a software system called exist several big data processing frameworks (e.g., Spark (Zahariaet al.,
Big Data Cyber Security Analytics (BDCA) system, which is defined as “A 2016), Hadoop (2009), Storm (2011), Samza (2014), and Flink (Apa­
system that leverages big data technologies for collecting, storing, and cheFlink, 2011)), we investigate Spark as it is currently the most widely
analyzing a large volume of security event data to protect organizational used framework in the domain of BDCA. We have observed that 14
networks, computers, and data from unauthorized access, damage, or attack” BDCA studies published in 2014 used Hadoop and only four used Spark
(Ullah and Babar, 2019a). A recent study of BDCA systems indicates that which changed to four studies using Hadoop and five studies using Spark
72% of organizations that employed big data technologies in their cyber in 2017 (Ullah and Babar, 2019a). A similar dominance of Spark over
security landscape reported significant improvement in their cyber Hadoop in the industry is observed (Spark, 2016). To achieve the
agility (Obitade, 2019). aforementioned aim, this paper contributes to the state-of-the-art by
BDCA systems are primarily classified into two categories based on answering the following Research Questions (RQ).
their attack detection capability – Generic BDCA systems and Specific To answer the three research questions, we developed an experi­
BDCA systems (Ullah and Babar, 2019a). Generic BDCA systems (e.g., an mental infrastructure on a large-scale OpenStack cloud. We imple­
intrusion detection system supported with big data technologies) aim to mented a Spark-based BDCA system that ran on an OpenStack cloud in a
detect a variety of attacks such as SQL injection, cross-site scripting, and fully distributed fashion. We used two evaluation metrics – the accuracy
brute force. Specific BDCA systems (e.g., a phishing detection system built and scalability of a BDCA system. For measuring accuracy, we leveraged
using big data technologies) are focused on detecting a specific attack the commonly used measures such as F1 score, precision, and recall. For
type such as phishing. The main characteristics of BDCA systems, that measuring scalability, we used the scalability scoring measure reported
distinguishes them from traditional cyber security systems, include (i) in Section 2.4.2. We used four security datasets (i.e., KDD (KDD, 1999),
monitoring diverse assets of an enterprise such as data storage systems, DARPA (MIT, 1998), CIDDS (Ring et al., 2017), and CICIDS2017
computing machines, and end-user applications (ii) integrating security (Sharafaldin et al., 2018)) in our experimentation and evaluated the
data from multiple sources such as IDS, firewall, and anti-virus (iii) BDCA system with four learning algorithms (i.e., Naïve Bayes, Random
analysing large volume of security event data in near real-time (iv) Forest, Support Vector Machine, and Multilayer Perceptron) that are
enabling deep and holistic security analytics for unfolding complex at­ employed in the system for classifying security data into benign and
tacks such as Advanced Persistent Threats (APT) and (v) analysing malicious categories. Based on our comprehensive experimentation, we
heterogeneous streams of security event data (Ullah and Babar, 2019a). have found that:
Like any software system, certain quality attributes (e.g., interoper­
ability and reliability) are expected in a BDCA system. Ullah and Ali (i) A BDCA system with default Spark configuration parameters does
Babar (Ullah and Babar, 2019a) reported the 12 most important quality not scale ideally. The deviation from ideal scalability is around
attributes of a BDCA system, where scalability is ranked as the third 59.5%. This means a system only takes 41.5% benefit from the
most important quality attribute of a BDCA system. Scalability is defined additional resources.
as “the system’s ability to increase speed-up as the number of processors in­ (ii) Among the 11 investigated Spark parameters, changing the value
crease” (Sun and Rover, 1994). The rationale behind the need for a BDCA of nine parameters significantly impacts a BDCA system’s scal­
system being highly scalable is twofold – (a) the volume of security event ability. The optimal value of a parameter (with respect to scal­
data is rapidly increasing, which requires a BDCA system to scale up (by ability) varies from dataset to dataset.
adding more computational power) to process data without impacting (iii) We proposed and evaluated a parameter-driven adaptation
the response time of a system (Lee and Lee, 2013; Cheng et al., 2016) approach, SCALER, that automatically selects the most optimal
and (b) the velocity of security event data generation fluctuates (Hong value for each parameter at runtime. The evaluation results show
et al., 2015; Kumar and Hanumanthappa, 2013). For example, a BDCA that on average, SCALER improves a BDCA system’s scalability by
system analysing network traffic of a bank experiences a higher work­ 20.8%.
load during working hours as compared to non-working hours. There­
fore, a BDCA system should efficiently use the commodity or third-party The rest of this paper is structured as follows. Section 2 reports the
resources to scale up during working hours and scale down during security datasets, our BDCA system, the instrumentation setup, and
non-working hours. In other words, a BDCA system should take evaluation metrics. Our adaptation approach is presented in Section 3.
maximum benefit from the additional resources. Section 4 presents the detailed findings of our study with respect to the
Among the 74 studies on BDCA reviewed in (Ullah and Babar, three research questions. Section 5 presents our reflections on the
2019a), 40 studies highlight the importance of scalability for a BDCA findings. Section 6 positions the novelty of our work with respect to the
system. However, none of the studies have either investigated the factors related work. Finally, Section 7 concludes the paper by highlighting the
that impact the scalability of a BDCA system or have proposed any so­ implications of our study for practitioners and researchers.
lutions for improving scalability. Several BDCA studies (e.g. (Aljarah
and Ludwig, 2013a; Zhao et al., 2015; Holtz et al., 2011),) hint at factors 2. Research methodology
such as machine learning algorithm employed in a system, quality of
security event data, and big data processing framework that can This section describes the datasets, our BDCA system, the instru­
potentially impact scalability. Among these factors, the most prominent mentation setup, and evaluation metrics.
is the underlying big data processing framework such as Spark or
Hadoop, which is an integral part of any BDCA system. One of the core 2.1. Security datasets
features of any big data processing framework is its configuration pa­
rameters (e.g., executor memory) (Zahariaet al., 2016), which guide In order to answer the three research questions (Section 1), we used
how a framework should process data. For example, executor memory four security datasets: KDD (KDD, 1999), DARPA (MIT, 1998), CIDDS
specifies how much memory should be allocated to an executor process. (Ring et al., 2017), and CICIDS2017 (Sharafaldin et al., 2018). These
The importance of parameter configuration for big data processing datasets are briefly described in the following with their details pre­
frameworks has been highlighted by several studies (e.g. (Lee and Lee, sented in Table 1. We selected these four datasets as they vary from each
2013; Davidson and Or, 2013; Gounaris et al., 2017),). However, none of other in terms of attack types, the number of training and testing in­
the previous studies have investigated their impact on the scalability of a stances, dataset size, publication dates, redundancy, and the number of
big data system. Therefore, this paper aims to “investigate the impact of features (e.g., source IP, source port, and payload). These characteristics
Spark configuration parameters on the scalability of a BDCA system of the selected datasets are expected to provide rigour and

2
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

RQ1: How does a BDCA system scales with default Spark configuration settings?
RQ2: What is the impact of tuning Spark configuration parameters on the scalability of a BDCA system?
RQ3: How to improve the scalability of a BDCA system?

Table 1 are relatively old datasets. The CIDDS dataset consists of four-week
Number of training and testing instances in each dataset. NetFlow data directed towards two servers, i.e., OpenStack and
Dataset Number of No. of Instances in No. of Instances in External Server. The training dataset contains 5,634,347 records and the
Features Training Dataset Testing Dataset testing dataset contains 2,788,463 records. Each record represents a
KDD 41 494,022 292,300
network connection – consisting of nine features. The dataset contains
DARPA 6 2,723,496 1,522,310 four types of attacks: pingScan, portScan, bruteForce, and DoS. More de­
CIDDS 9 5,634,347 2,788,463 tails on the dataset are available in (Ring et al., 2017).
CICIDS2017 77 1,311,822 445,061 CICIDS2017: This is also a recently developed dataset, which con­
tains a variety of state-of-the-art attacks. The dataset consists of five days
of network traffic directed towards a network consisting of three servers,
generalization to our findings. It is important to note that we used the
a firewall, a switch, and 10 PCs. The training dataset consists of
whole of these datasets, instead of using a small sample of each dataset,
1,311,822 records and the testing dataset consists of 445,061 records.
in our experiments.
Each record consists of 77 features. This dataset contains six types of
KDD: The KDD dataset contains 494,022 records as training data and
attacks: bruteForce, heartBleed, botNet, DoS, Distributed DoS, webAttack,
292,300 records as testing data. Each record represents a network
and infiltration attack (Sharafaldin et al., 2018).
connection – consisting of 41 features. Each record is labelled as
belonging to either the normal class or one of the four attack classes, i.e.,
Denial of Service, Probing, Remote to Local, and User to Root. The 2.2. Our BDCA system
testing data includes attack types that are not present in the training
data, which makes the evaluation more realistic. More details on the The overview of our BDCA system is depicted in Fig. 1. This system
dataset are available in (KDD, 1999). consists of three layers –Security Analytics Layer, Big Data Support Layer,
DARPA: Similar to KDD, the records in this dataset are divided into and Adaptation Layer. In the following, we describe Security Analytics
training and testing subsets. The training data consists of 2,723,496 Layer and Big Data Support Layer, while the details of the Adaptation
records, while the testing data consists of 1,522,310 records. Each re­ Layer are presented in Section 3.
cord represents a network connection – consisting of six features. Each
record is labelled as 0 or 1, where 0 specifies a normal connection and 1 2.2.1. Security Analytics Layer
specifies an attack. The attack types present in DARPA are the same as This layer processes the security event data for detecting cyber-
KDD. More details on the DARPA dataset are available in (MIT, 1998). attacks. The layer consists of three phases (i.e., data engineering,
CIDDS: This dataset has been recently developed as KDD and DARPA feature engineering, and data processing), which are described below.
Data Engineering: This phase pre-processes the data to handle

Fig. 1. An overview of our BDCA system.

3
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

missing values and remove incorrect values and outliers (Ullah and created as the bridge between the external network (floating IPs) and
Babar, 2019a). A negative value indicates that the number of features for subnets (internal IPs). We used Scala programming language for various
the instances is incomplete, hence, the instance is removed. Incorrect implementations on Spark.
values (e.g., standard deviation = − 1) specify data points in the dataset
that are unacceptable for the Machine Learning (ML) model employed in 2.4. Evaluation metrics
a system for the classification of security event data into normal and
attack classes. Therefore, we use the filter method of DataFrame avail­ In this study, we assess two qualities of our BDCA system – accuracy
able in the Spark package, i.e., org.apache.spark.sql (Spark, 2014a) to and scalability. Accuracy measures how accurately a BDCA system
remove the incorrect values. The existence of the outliers in the training classifies the instances in the datasets into normal and attack categories.
dataset affects the accuracy of the machine learning model (Batista Scalability measures to what extent our system takes advantage of the
et al., 2004). We, therefore, removed the values that were larger than additional hardware resource added to a system in the form of
Double.MaxValue. CICIDS2017 has missing values, therefore, we computing nodes.
removed the instances with the missing values by simply investigating
whether the value of the last feature is negative. 2.4.1. Measuring accuracy
Feature Engineering: This phase generates new features and/or For assessing accuracy, we used five evaluation metrics that are
transforms the values of features into a new range (Ullah and Babar, commonly used in the BDCA domain (Ullah and Babar, 2019a). These
2019a). For all four datasets, we assembled the features to transform metrics include False Positive Rate, F-Score, Recall, Accuracy, and Pre­
multiple columns of features into one column of feature vector for fitting cision. Table 2 provides the definition and brief description of each of
the ML model. We used VectorAssembler method in org.apache.spark.ml. the metrics.
feature for the implementation of assembling the features. Since some
algorithms (e.g., Naïve Bayes) in SparkML library cannot handle 2.4.2. Measuring scalability
non-numeric features, we used StringIndexer (from org.apache.spark.ml. Several studies (e.g. (Grama et al., 1993; Sun et al., 2005; Jogalekar
feature) to transform the label features (i.e., normal and attack) in the and Woodside, 2000),) have proposed metrics for measuring the scal­
KDD dataset from string to indices. Given the relatively smaller number ability of a system. However, the previous metrics are not suitable for
of features in the DARPA dataset, we expanded the features to a poly­ the scalability analysis in our study for two reasons: 1) these metrics do
nomial space. We used PolynomialExpansion method in org.apache.spark. not quantify scalability with respect to ideal scalability, which is
ml.feature for feature expansion. required for evaluating the effectiveness of our adaptation approach
Data Processing: This phase leverages ML/DL algorithm to classify presented in Section 3; 2) these metrics are primarily suitable for
the instances in security data as either normal or attack. In our system, measuring scalability in cases where a system is partly executed in
we separately used four ML/DL algorithms - Naïve Bayes (NB), Random parallel mode and partly in sequential mode whereas our system
Forest (RF), and Support Vector Machine (SVM), for classifying the in­ (implemented using Apache Spark (Zahariaet al., 2016; Spark, 2011)) is
stances. These four algorithms have been selected based on (i) their executed fully in parallel mode. For this study, we used Eq. (1) to
widespread use in the domain of BDCA (Ullah and Babar, 2019a) (ii) measure the scalability of a BDCA system. In Eq. (1), S(c) denotes the
popularity in Kaggle competitions and (iii) availability in Spark ML li­ scalability score for curve ‘c’. Gap denotes the quantified gap value be­
brary and DeepLearning4j (Johnsirani Venkatesan et al., 2019). We used tween the achieved and ideal response time (i.e., training time or testing
Spark package org.apache.spark.ml.classification for implementing the time), which is calculated using Eq. (2). In Eq. (2), ωn represents the
ML algorithms. For cross-validation of ML models, we used Cross­ user-defined weight that specifies the importance of the gap between
Validator method available in org.apache.spark.ml.tuning. achieved and ideal response time at ‘n’ worker nodes. For example, ω2 is
the weight for specifying the importance of gap at two worker nodes and
2.2.2. Big Data Support Layer ω4 is the weight for specifying the importance of the gap at four worker
This layer manages the distributed storage and processing of data on nodes. In Eq. (1), ωn+1 denotes the weight for specifying the importance
multiple computing nodes. The layer consists of big data processing of gap at one size larger than the existing cluster size, e.g., to specify the
framework (i.e., Spark) and big data storage (i.e., HDFS). Apache Spark importance of gap beyond eight nodes if n (cluster size) equals to eight.
is an open-source big data processing framework that uses in-memory The sum of all weights is equal to 1 (as presented in Eq. (5)). In Eq. (2),
primitives to process a large amount of data. Spark is quite suitable Gn defines the ratio of unaccomplished response time improvement to
for ML tasks, which requires iterative processing that best suits Spark the response time improvement in the ideal case with ‘n’ worker nodes.
architecture (Spark, 2014a). Moreover, Spark is not only much faster Gn is calculated using Eq. (3), where ATn denotes achieved response time
than Hadoop, but is also compatible with multiple file systems such as with ‘n’ worker nodes and ITn denotes the ideal response time with ‘n’
HDFS, MongoDB, and Cassandra. Hadoop Distributed File System worker nodes. In Eq. (1), Trend, which is calculated using Eq. (4), de­
(HDFS) is a data storage system that enables the distributed storage of a notes how response time decreases between the last two cluster setups
massive amount of data (Shvachko et al., 2010). By default, HDFS rep­ such as from six to eight worker nodes in with cluster of size 8, the higher
licates each block of data on three nodes, which makes it quite
fault-tolerant. Table 2
Evaluation metrics for assessing accuracy and their descriptions. TP – Ture
2.3. Instrumentation setup Positive, FP – False Positive, TN – True Negative, and FN – False Negative.
Metric Definition Description
We configured Spark and Hadoop (for HDFS) on an OpenStack
Precision TP Proportion of instances correctly
cluster consisting of 10 computing nodes. Each node is installed with P=
TP + FP classified as attack instances
Ubuntu 16.04 Xenial Xerus operating system. Each node runs Spark Recall TP Proportion of attack instances that are
R=
2.4.0, Hadoop 2.9.2, and JDK 1.8. The 10 computing nodes are divided TP + FN correctly classified
into master and slave nodes. There is one master with m1.large flavour F-score
F=2×
P × R Harmonic mean of precision and
P + R recall
(8 GB RAM, 80 GB Hard disk, and 8 virtual CPUs) and nine worker nodes
Accuracy A= Proportion of correctly classified
with m1.small flavour (2 GB RAM, 10 GB Hard disk, and one virtual TP + TN instances
CPU). Each node in the cluster has a floating IP for communicating with TP + TN + FP + FN
the external world and an internal IP for communicating with other False Positive FP Proportion of normal instances
FPR =
Rate FP + TN classified as attack instances
nodes in the cluster. To associate floating IP with internal IP, a router is

4
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

of which indicates the probability that the response time tends to which is high – indicating a positive trend of scalability after eight
decrease with more than eight nodes. worker nodes. The values of Gn is calculated using Eq. (3) are G2 = 1.2,
G4 = 0.46, G3 = 0.28, and G8 = 0.2, which gives Gap = 0.42. Hence, the
S(c) = 1 − Gap − ωn+1 × (1 − Trend) (1)
scalability score for Scenario-1 is 0.85, which indicates poor scalability

n as compared to ideal scalability. This is also observable from the com­
Gap = ω2i G2i (2) parison of the two curves, i.e., Ideal Scenario and Scenario-1 as depicted
i=1 in Fig. 2. As compared to Scenario-1, there is a higher reduction trend in
response time with the increase in the number of nodes in the ideal case.
ATn − ITn
Gn = (3) Thus, the scalability score of Scenario-1 is smaller as compared to the
IT1 − ITn
ideal scenario.
ATn− 1 − ATn The scalability score for Scenario-2 is 0.83, which is slightly lower
Trend = (4) than the scalability score for Scenario-1. The slight difference is mainly
ITn− 1 − ITn
due to the difference in Trend for the two scenarios, i.e., a reduction

n from 4 to 3 in Scenario-1 and a reduction from 3.75 to 3.2 in Scenario-2.
ω2i = 1 (5) The scalability score for Scenario-4 is 0.31, which is quite lower as
compared to Scenario-1 and Scenario-2. If we observe the scalability
i=1

Example Scenario: We illustrate the use of the scalability metric with curve for Scenario-4 in Fig. 2, the response time reduces quite signifi­
an example, which includes eight hypothetical scalability scenarios for a cantly as we increase the number of worker nodes from 1 to 4 nodes.
software system. Table 3 presents the hypothetical response times for However, there is almost no reduction as the number of worker nodes
eight different scenarios with respect to five different cluster configu­ are increased from 4 to 8, which is why the scalability score is much
rations, i.e., 1 worker, 2 workers, 4 workers, 6 workers, and 8 workers. lower as compared to more smoother curves such as the curves for
Fig. 2 shows the eight scalability curves drawn using the response times Scenario-1 and Scenario-2. In Scenario-5, the sudden upward jump in the
reported in Table 3. The eight scenarios (i.e., ideal scenario and Scenario curve from 2 nodes to 4 nodes impacts the scalability score of the whole
1 – Scenario 7) presented in Table 3 and Fig. 2 differ from each other curve. Therefore, the scalability is quite low, i.e., 0.16. In Scenario-6, the
with respect to two parameters – the number of worker nodes and response time increases (unlike as expected) at two transitions, i.e., from
response time. The number of worker nodes is the independent param­ 2 nodes to 4 nodes and from 6 nodes to 8 nodes. The spike in response
eter that we change to observe the impact on the dependent parameter i. time from 6 to 8 nodes is quite high. Therefore, the negative impact on
e., response time. As shown in Fig. 2, the impact of change in the number the response time at two transitions significantly impacts the scalability
of worker nodes is not consistent across scenarios. This could possibly be score and Eq. (1) generates a much lower scalability score (i.e., -0.56)
due to multiple reasons in a real-world scenario. For example, in Ideal for Scenario-6. The response time in Scenario-7 does not change with the
Scenario, the system utilizes the underlying resources such as CPU and change in the number of nodes, therefore, the scalability score for Sce­
RAM more efficiently as compared to Scenario-1. Therefore, the response nario-7 is 0.00.
time in the Ideal Scenario reduces more significantly with the increase in
the number of worker nodes as compared to the reduction in response 3. Our adaptation approach
time in Scenario-2.
Ideal Scenario underlines the case where each time the number of To optimize the scalability of a BDCA system, we present SCALER -
nodes is doubled, the response time is reduced to half. For calculating an adaptation approach that automatically triggers the tuning process
the scalability score, we use a value of 0.2 for all weights (e.g., ω2 , ω4 , and tune Spark configuration parameters. By tuning, we mean to select a
ω6 , ω8 , ω10 ). For calculating Trend in this scenario, AT6 = 1.33, AT8 = combination of parameters, which generates a scalability score that is
1, IT6 = 1.33, and IT8 = 1 as shown in Table 3, hence, Trend = 1 using above the predefined threshold (Section 3.3).
Eq. (4). Since there is no gap between achieved and ideal response time, Spark parameters control most of the application settings and
the value of all gaps is equal to zero (i.e., G2 = 0, G4 = 0, G6 = 0, and G8 directly impact the way an application runs (Spark, 2014b). All Spark
= 0). Thus, the overall gap is zero (i.e., Gap = 0) calculated using Eq. (3). parameters have a default configuration; however, the default configu­
Feeding these values into Eq. (1) gives us S(ideal) = 1.00. For Scenario-1, ration is not suitable for each application (Spark, 2014b). Therefore, the
AT6 = 4, AT8 = 3, IT6 = 1.66, and IT8 = 1.25, hence, Trend = 2.44, parameters need to be configured separately for each application. The
spark parameters investigated in this study for their impact on the
scalability of a system are presented in Table 4. We selected 11 pa­
Table 3 rameters based on the following criteria – (i) the parameters have
Response time (in seconds) and scalability score for the eight hypothetical proven impact on different aspects of Spark such as scheduling,
scalability scenarios. S-1, S-2 and so on denote Scenario-1, Scenario-2 and so on. compression, and serialization (ii) the parameters contribute to Spark
Number of Response Time (sec) running time as highlighted in (Gounaris and Torres, 2018) and (Nguyen
Worker
Ideal S-1 S-2 S-3 S-4 S-5 Scenario- S-7
et al., 2018) and (iii) the parameters impact multiple levels (e.g., ma­
Nodes chine level and cluster level) of a BDCA system as reported through
Scenario 6
industry practices (Spark, 2014b, 2016). Although SCALER considers 11
1 8.00 10.00 10.00 8.00 8.00 9.50 8.00 8.00
2 4.00 11.00 6.87 5.50 7.00 5.00 7.00 8.00 Spark parameters, it is worth noting that SCALER can be easily extended
4 2.00 6.00 5.00 4.00 6.00 8.00 7.20 8.00 to incorporate more parameters if needed. In the following, we describe
6 1.33 4.00 3.75 3.00 5.80 5.50 6.00 8.00 our adaptation approach that automatically tunes Spark configuration
8 1.00 3.00 3.20 2.90 5.70 6.00 7.20 8.00
parameters for improving scalability. We present our adaptation
Scalability 1.00 0.85 0.83 0.61 0.31 0.16 − 0.56 0.00
Score
approach as per the guidelines for adaptation approaches presented by
Villegas et al. (2011).

5
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Fig. 2. Hypothetical scalability scenarios (drawn based on Table 3) to illustrate the use of the scalability metric.

reference input. Ideal scalability implies that when the number of


Table 4
computing nodes is doubled, the job completion time (i.e., data pro­
Spark parameters considered in this study for their scalability impact and sub­
cessing time) is reduced by half. As illustrated in Section 2.4.2, the
sequent tuning.
scalability score for ideal scalability is 1.0, which SCALER aims to
ID Spark Parameters Default Value Description
achieve. Whilst reference inputs are specified by a user, the measured
P1 Spark.executor.memory 1024 MB Amount of memory used by outputs are the actual measures collected from a running system (Vil­
each executor process in legas et al., 2011). The measured outputs are then compared with the
Spark
reference inputs to assess the extent to which a system has achieved the
P2 Spark.shuffle.sort. 200 Underlines the threshold of
bypassMergeThreshold reduce partitions for target state (indicated by reference input) through adaptation. For our
avoiding merge-sorting data adaptation approach, the measured outputs are the scalability scores
P3 Spark.shuffle.compress TRUE Whether (or not) to calculated when a system is under operation. The scalability score is
compress the map output
then compared with the scalability score for the ideal scalability (i.e.,
files
P4 Spark.memory. 0.5 Amount of memory
1.0) to assess the extent to which our adaptation approach has achieved
storageFraction available for task execution its target.
P5 Spark.shuffle.file.buffer 32 KB Amount of memory
available to buffer file 3.3. Adaptation trigger
output streams
P6 Spark.reducer. 48 m Maximum size of map
maxSizeInFlight output to be fetched for An adaptation trigger defines the condition for triggering the adap­
reducer tation process (Villegas et al., 2011). In our approach, we define a
P7 Spark.memory.fraction 0.6 Proportion of heap size used threshold value for the scalability score. Our approach constantly
for execution and storage
monitors the scalability score of a system and whenever the scalability
P8 Spark.serializer. 100 To allow or stop garbage
objectStreamReset collection of objects
score is less than the threshold value, the adaptation is triggered to
P9 Spark.rdd.compress FALSE Whether (or not) to optimize the scalability score. In Algorithm 1, line 5 specifies the
compress Resilient Data adaptation trigger condition. In order to determine the threshold value
Distributed Datasets (RDD) for the evaluation of our approach, we first calculated the scalability
P10 Spark.shuffle. \(Deprecated) Proportion of Java heap size
score for the 108 use cases (i.e., 9 parameter configurations × 4 datasets
memoryFraction used for aggregation.
Beyond this limit, contents × 4 algorithms) reported in Section 4.2 using Eq. (1) to Eq. (5). For
start spelling to the disk calculating the scalability scores, we used the same value (i.e., 0.2) for
P11 Spark.driver.memory 1024 MB Amount of memory used for all weights (e.g., ω2 , ω4 , ω6 , ω8 , ω10 ) in Eq. (1) to Eq. (5) to specify the
initializing Spark context same level of importance for all the gaps between ideal and achieved
response time. We computed the mean scalability score of the 108 use
3.1. Adaptation goal cases, which gives a value of 0.55. Although the value 0.55 underlines
the mean scalability score for the considered scenario that can be set as a
The adaptation goal is stimulated by the main reason for adaptation, threshold, we increased the threshold value by setting the value of
i.e., why a BDCA system needs to adapt (Villegas et al., 2011). Our incrementthreshold equal to 0.0365. This is because setting increment­
adaptation goal is driven by the results collected for RQ1 (Section 4.1) threshold to a positive value makes the evaluation of SCALER more
and RQ2 (Section 4.2). These findings indicate that (i) with default robust as reported in Section 4. We selected the value 0.0365 based on
settings of Spark parameters, a BDCA system does not scale ideally and the fact that our threshold value should be higher than both mean and
(ii) 9 out of 11 studied spark parameters significantly impact the scal­ median. Hence, adding 0.0365 to the mean scalability score gives us a
ability of a BDCA system. Therefore, our adaptation approach aims to value of 0.58, which is equal to the median value too. As a result, the
“automatically tune Spark parameters for improving the scalability of a final threshold value for triggering adaptation is 0.58. Whilst we
BDCA system”. selected the value of incrementthreshold keeping in view the median
value of the 108 cases, the main objective of having the increment­
threshold is to render flexibility to users in terms of how robust the user
3.2. Reference inputs and measured outputs wants the system to be in terms of adaptation.

Reference inputs delineate the target to be achieved through adap­


tation (Villegas et al., 2011). For our approach, ideal scalability is the

6
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Algorithm. 1
Algorithm for adapting configuration setting of Spark-based BDCA system

3.4. Control actions


Table 5
Once adaptation is triggered, the control actions are automatically Impactful Spark parameters used in the adaptation approach and their potential
taken by our adaptation approach to adapt a system. The adaptation value options.
process is only triggered when a system’s scalability score gets below the ID Spark Parameters Default Value Modified Value
threshold scalability score and so the control actions aim to make the
P1 Spark.executor.memory 1024 MB (P1- 1250 MB
scalability score above the threshold. In our approach, the controls ac­ A) (P1–B)
tions execute a system with different Spark parameter configurations P2 Spark.shuffle.sort. 200 (P2-A) 400 (P2–B)
with the aim to find a combination of parameters with which a system’s bypassMergeThreshold
scalability score gets above the threshold scalability score. The control P3 Spark.shuffle.compress TRUE (P3-A) FALSE (P3–B)
P4 Spark.memory.storageFraction 0.5 (P4-A) 0.7 (P4–B)
actions only try changing the parameters that have a significant impact P5 Spark.shuffle.file.buffer 32 KB (P5-A) 64 KB (P5–B)
on scalability as determined in RQ2 (Section 4.2). These parameters P6 Spark.reducer.maxSizeInFlight 48 m ((P6-A) 96 m (P6–B)
with a significant impact on scalability are presented in Table 5 with the P7 Spark.memory.fraction 0.6 (P7-A) 0.8 (P7–B)
default and modified value for each parameter. With respect to value P8 Spark.serializer.objectStreamReset 100 (P8-A) − 1 (P8–B)
P9 Spark.rdd.compress FALSE (P9-A) TRUE (P9–B)
options, there are two types of parameters - Boolean parameters and
numerical parameters. A Boolean parameter takes only two values (i.e.,

7
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Table 6
Combination of Parameter Values (CPV) executed at runtime for identifying CPV with scalability score above the threshold. ‘A’ and ‘B’ specify the default and modified
value respectively.
CPV ID Combination of Parameter Values (CPV) P1 P2 P3 P4 P5 P6 P7 P8 P9

1 {A, A, A, A, A, A, A, A, A} A A A A A A A A A
2 {B, A, A, A, A, A, A, A, A} B A A A A A A A A
3 {A, B, A, A, A, A, A, A, A} A B A A A A A A A
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
512 {B, B, B, B, B, B, B, B, B} B B B B B B B B B

TRUE and FALSE). The values options for numerical parameters pre­ 4 to 6 nodes and from 6 to 8 nodes in Eq. (4). We then apply the two
sented in Table 5 are chosen based on academic and industrial recom­ sample cases to the hypothetical scenarios presented in Fig. 2. On
mendations (Spark, 2014b, 2016). For example, the execution memory average, the time required to calculate Trend in case (a) and case (b) is
can be set as 1024 m (default value) or 1250 m (modified value). In 9.24 seconds and 14.9 seconds respectively. Hence, the time required to
Table 5, ‘A’ represents the default value and ‘B’ represents the modified calculate the trend in case (a) is 38.12% less than the time required to
value for a parameter. For instance, P1-A and P1–B are default and calculate the trend in case (b). This difference in time required to
modified values for parameter P1 (Spark.executor.memory). Some sample calculate trend increases with an increase in including more transitions
Combinations of Parameter Values (CPV) are shown in Table 6. (e.g., from 1 to 2 and 2 to 4 nodes) in Eq. (4). Furthermore, the results
Since our approach considers a total of nine parameters each with presented in Section 4.2 show that the region from 6 to 8 nodes is the
two possible values, there are a total of 512 (29) CPVs. The execution of most accurate region to determine if the scalability curve is showing any
these many CPVs to find the CPV with a scalability score that is above unexpected variation (see Section 4.2 for details). Consequently, other
the threshold is a computationally expensive task. The time required to transitions (e.g., from 4 to 6 nodes) in Eq. (4) will have minimal impact
execute and search through the large search space of 512 potential CPVs on accuracy but a far significant impact on the time required to calculate
will outweigh the gain expected through adaptation. It is also worth Trend. Hence, Eq. (4) only considers the transition from 6 to 8 nodes for
noting that the state-of-the-art tuning approaches (e.g. (Zhuet al., 2017; calculating Trend. Keeping Change of Parameter Value with a Positive
Herodotouet al., 2011),) do follow the strategy of searching through the Impact: If changing value of a parameter improves the scalability score,
entire search space. However, it is computationally feasible for these the changed value is kept the same for the next CPV. For example, CPV 2
approaches, which aim to tune for optimizing response time. Calculating achieves a better scalability score than CPV 1 by changing the value of
response time requires a system to be executed only once but for spark.executor.memory from 1024 MB to 1250 MB. However, since the
calculating scalability score as required for our approach, a system needs scalability score of CPV 2 is not above the threshold value, our algorithm
to be executed at least five times with a different number of computing will not select CPV 2 rather it will execute CPV 3, but with the spark.
nodes. We, therefore, employed the following optimization techniques executor.memory value of 1250 MB as it has already shown a better
to reduce the computational time with minimal impact on the accuracy scalability score.
of our approach. Our adaptation algorithm is presented as Algorithm 1. If the scal­
Eliminating CPVs with negative Trend: Before calculating the scalabil­ ability score of the BDCA system with default parameter settings is
ity score of a CPV for an entire curve obtained through executing the below the threshold (line 5), an adaptation process is triggered. The first
system with 1, 2, 4, 6, and 8 worker nodes, we calculate the scalability parameter in the default CPV is changed from its default value and then
trend (Eq. (4)) from six to eight nodes. If the trend is negative, then the the Trend is calculated using Eq. (4) to investigate the trend of scalability
response time increases as the cluster size changes from six to eight from six nodes to eight nodes. If the Trend is negative (i.e. response time
nodes. This implies that the CPV is not a candidate for the potential CPV increases as the number of nodes increases from six to eight), the
with a scalability score above the threshold. The reason we include parameter value is changed to its default value (lines 17–20). On the
merely the transition from 6 to 8 nodes in Eq. (4) is the time overhead. other hand, if the Trend is positive (response time decreases as the
To illustrate the impact of including more transitions in Eq. (4) on the number of nodes increases from six to eight), a BDCA system is executed
time to calculate Trend, we take two sample cases - (a) using the only with two and four worker nodes to get the entire scalability curve (lines
transition from 6 to 8 nodes and (b) including two transitions, i.e., from 21–24). After getting the scalability curve, the scalability score is

Table 7
Mean accuracy achieved by our BDCA system for the four datasets and four ML/DL algorithms.
ML Algorithm Dataset Precision Recall F-Measure False Positive Rate Accuracy

Naïve Bayes KDD 83.4% 99.2% 90.6% 6.5% 84.2%


DARPA 97.4% 55.9% 71.0% 0.1% 74.1%
CIDDS 84.6% 100.0% 91.7% 0.4% 96.6%
CICIDS2017 46.3% 27.3% 34.4% 0.5% 83.9%
Random Forest KDD 99.9% 97.1% 98.5% 2.5% 97.6%
DARPA 99.9% 75.1% 85.8% 0.2% 85.8%
CIDDS 100.0% 99.5% 99.7% 0.0% 99.9%
CICIDS2017 99.6% 97.8% 98.7% 0.7% 99.6%
Support Vector Machine KDD 97.7% 92.4% 95.4% 0.7% 92.9%
DARPA 100.0% 23.5% 38.0% 7.6% 56.6%
CIDDS 96.1% 100.0% 91.7% 0.0% 96.7%
CICIDS2017 64.5% 57.9% 61.0% 0.5% 88.5%
Multilayer Perceptron KDD 99.5% 95.7% 97.6% 1.5% 96.3%
DARPA 97.9% 61.4% 75.2% 1.5% 78.1%
CIDDS 99.9% 98.1% 99.0% 0.1% 99.6%
CICIDS2017 98.9% 86.5% 92.3% 0.2% 97.3%

8
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Table 8 reported in this paper, i.e., to confirm whether or not a BDCA system
Accuracy achieved by our BDCA system for the four datasets and four ML/DL scales ideally with the default configuration. Ideal scalability implies
algorithms in 2, 4, 6, and 8 node cluster. that a BDCA system makes full use of the additional resource provided
Number of Worker Nodes in the Cluster by scaling (Williams and Smith, 2004). For instance, when the number
ML Algorithm Dataset 2 4 6 8
of worker nodes is doubled, the response time of a system should reduce
to half. If a BDCA system scales ideally, there would be no value added
Naïve Bayes KDD 89.6% 81.7% 82.5% 80.4%
by our work.
DARPA 78.9% 76.6% 75.4% 73.5%
CIDDS 94.8% 98.4% 90.7% 89.7% Classification accuracy: Before presenting scalability findings, we
CICIDS2017 88.4% 84.7% 83.5% 76.8% first present the accuracy of our BDCA system in Table 7 for the four
Random Forest KDD 96.8% 96.4% 85.9% 84.7% datasets and four ML algorithms. This is because accuracy is one of the
DARPA 88.6% 89.4% 88.4% 81.7% main quality measures for a BDCA system and needs to be considered
CIDDS 99.9% 98.8% 98.9% 99.1%
CICIDS2017 99.9% 99.7% 99.4% 99.2%
before scalability (Ullah and Babar, 2019a). According to the accuracy
Support Vector Machine KDD 88.7% 87.6% 89.4% 87.4% presented in Table 7, our system achieves a mean accuracy of 92.7% for
DARPA 77.6% 68.4% 49.7% 48.8% KDD, 73.6% for DARPA, 98.2% for CIDDS, and 92.3% for CIDIDS2017.
CIDDS 98.1% 97.5% 96.2% 94.5% With respect to the algorithms, our system achieves a mean accuracy of
CICIDS2017 91.4% 90.8% 90.4% 85.4%
84.75% for Naïve Bayes, 95.74% for Random Forest, 83.71% for Support
Multilayer Perceptron KDD 97.4% 96.5% 97.4% 94.3%
DARPA 86.4% 77.8% 79.8% 77.9% Vector Machine, and 92.8% for Multilayer Perceptron (MLP). The mean
CIDDS 99.9% 99.8% 99.9% 96.8% accuracy of our system for the four datasets and four algorithms is
CICIDS2017 98.6% 97.8% 98.7% 93.7% 89.2%, which is a decent level of accuracy as compared to the accuracy
of the state-of-the-art BDCA systems (Gupta and Kulariya, 2016; Kumari
et al., 2016; Marchal et al., 2014; Las-Casas et al., 2016; Zhang et al.,
calculated for the CPV. If the scalability score is higher than the previous
2016; Böse et al., 2017). Table 8 shows how the accuracy of the ML/DL
best scalability score, the optimal CPV is updated. Finally, the scalability
models varies with respect to the number of nodes in the cluster. Whilst
score of the CPV is compared with the threshold scalability score (line
there is no significant change in the accuracy for most of the cases, the
29). If the scalability score is above the threshold, the CPV is selected for
general trend shows that the accuracy slightly decreases as the number
the future operations of a system. The variable Rt in Algorithm 1 spec­
of nodes in the cluster increases. This could be attributed to the way data
ifies the number of times each execution is repeated. Such a repetition of
is distributed among the nodes during the training and testing process. A
execution is required to remove (any) experimental fluctuations. We set
comparatively larger number of nodes in the cluster requires the gen­
Rt equal to three – indicating to repeat each execution three times. We
eration of larger data blocks and vice versa. Such data partitioning and
then take the mean of the response time determined in the three exe­
distribution strategy slightly impact the accuracy as presented in
cutions for subsequent calculation of scalability score. It is important to
Table 8.
note that Algorithm 1 only restricts adaptation trigger based on the
Following the approach reported in (Qiu et al., 2016), we trained and
predefined threshold, i.e., adaptation is triggered only if the scalability
evaluated the ML/DL algorithms in a distributed manner. In other
score is less than the predefined threshold. Algorithm 1 ensures that it
words, the cluster consists of a total of 10 nodes in our case. Among these
will return a CPV with scalability score either equal or better than the
nodes, one acts as a master and nine act as workers. The master node
previously running CPV. Algorithm 1 does not guarantee that it will
distributes the process of training and testing among the nine workers,
always return a CPV with scalability score above the predefined
which perform the training and testing in a distributed and parallel
threshold. However, we did not observe any such case based on the
manner. On the contrary, the same job can be performed in a centralized
results presented in Section 4.3.
manner – termed centralized learning, in which the training and testing
of algorithms is performed centrally on a single node instead of a cluster
4. Results of nodes. In addition to distributed and centralized learning of ML al­
gorithms, deep learning approaches have gained tremendous attention
In this section, we present the results from our study aimed at in recent times (Pouyanfaret al., 2018). Therefore, we also incorporate a
answering the three research questions. Deep Learning (DL) algorithm, Multilayer Perceptron (MLP), to assess
how it performs as compared to the traditional ML algorithms. We
4.1. RQ1: How does a BDCA system scale with default spark selected MLP based on its widespread usage in the cyber security
configuration settings? domain. The accuracy and training time of the ML and DL algorithms
trained and tested in centralized and distributed manners are presented
This research question investigates the very premise of the work in Table 9. The mean accuracy for centralized learning, distributed

Table 9
Accuracy achieved by our BDCA system with centralized learning, distributed learning, and deep learning. SVM denotes Support Vector Machine and MLP denotes
MultiLayer Perceptron.
Learning Type Dataset Dataset

KDD DARPA CIDDS CICIDS2017

Accuracy Training Time Accuracy Training Time Accuracy Training Time Accuracy Training Time
(%) (sec) (%) (sec) (%) (sec) (%) (sec)

Centralized Naïve Bayes 90.6 395 76.4 2855 95.7 2141 88.4 3377
Learning Random 90.7 355 88.1 2048 99.6 2122 99.9 3741
Forest
SVM 88.1 230 78.9 104 99.2 306 99.0 314
Distributed Naïve Bayes 80.4 331 73.5 1853 89.7 968 76.8 1521
Learning Random 84.7 245 81.7 228 99.1 265 99.2 1243
Forest
SVM 87.4 184 48.8 58 94.5 120 85.4 44
Deep Learning MLP 96.3 312 78.1 2978 99.6 2749 97.3 3104

9
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Fig. 3. Ideal and achieved scalability with default Spark settings for the four datasets – KDD, DARPA, CIDDS, CICIDS2017 with (A) Naïve Bayes – training phase (B)
Naïve Bayes – testing phase (C) Random Forest – training phase (D) Random Forest – testing phase (E) Support Vector Machine – training phase and (F) Support
Vector Machine – testing phase (G) MultiLayer Perceptron - training phase (H) MultiLayer Perceptron - testing phase. The number in the legend specifies the
scalability score.

The summary answer to RQ1: A BDCA system with default Spark configuration settings does not scale ideally. The deviation from ideal
scalability is around 59.5%.

10
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

learning, and deep learning is 91.2%, 83.4%, and 92.83%, respectively.


default to modified at one time such as changing value of P1 from 1024 to 1250. Default Spark Settings (P1 = 1024, P2 = 200, P3 = TRUE, P4 = 0.5, P5 = 32k, P6 = 48m, P7 = 0.6, P8 = 100, P9 = FALSE, P10 = \, P11 =
Scalability score with default and modified value for the 11 studied parameters. The bold numbers indicate scalability scores lower than the default scalability score. The value of only one parameter is changed from

CICIDS2017
The difference in accuracy is due to the way algorithms are trained based
on the partitioning of the data during the training process. On the other

0.16

0.04

0.03

0.15
0.13
hand, the mean training time for centralized, distributed, and deep

0.18

0.19
0.27

0.47

0.37

0.28

0.42
learning is 1499 seconds, 588 seconds, and 2285 seconds. The difference
CIDDS
in training time is due to resource allocation. For example, centralized

0.09
0.13

0.06
0.01

0.06

0.06
0.07

0.02
0.08

0.04
0.57

0.88
learning takes more time as compared to distributed learning due to the
Multilayer Perceptron

more computational capacity available in distributed learning.


Scalability: Fig. 3 shows the scalability curves with default Spark
DARPA

0.15

0.26

0.21
0.01

0.05
0.31

0.84

0.51
0.53

0.33
settings for training time and testing time of the four datasets and four

0.1
0.4

algorithms. The dotted line in Fig. 3 denotes ideal scalability and the
0.06 solid line denotes achieved scalability. In the training phase with Naïve

0.12

0.09
KDD

0.24

0.31
0.25

0.55

0.42
0.62

0.31
0.46

0.49
Bayes, Random Forest, and MLP, the system scales almost ideally as the
number of worker nodes increases from one to two. The impact of
CICIDS2017

adding further nodes until six nodes is negligible. After six nodes, the
addition of nodes has a negative impact on scalability, i.e., the training
0.39
0.54

0.37

0.49

0.52

0.48
0.44

0.36
0.64

0.69

0.65 time slightly increases. With Support Vector Machine, the trend is a bit
0.6

abrupt as compared to the other three algorithms. For example, unex­


CIDDS

pected spikes can be observed at three nodes for KDD and five nodes for
0.45

0.79
0.58

0.62
0.63

0.53
0.49

0.83
0.76

0.55

0.49
Support Vector Machine

0.3
CICIDS2017. A potential reason for such spikes with Support Vector
Machine is the short training time as compared to the training time for
DARPA

the other three algorithms. We used Eq. (1) to quantify the ideal and
0.37

0.54

0.18

0.46

0.59
0.63

0.66

0.65

0.73

0.83
0.7

0.8

achieved scalabilities. The ideal scalability score for each dataset is 1.


The scalability scores, calculated using our scalability metric presented
0.08
KDD

0.09

0.25
0.24

0.41
0.14

0.68

0.17
0.49

0.66

0.31

in Section 2.4.2, for achieved scalabilities are shown in the legend for
0.3

training in Fig. 3.
The mean scalability score with default Spark setting for the datasets
CICIDS2017

is KDD – 0.31, DARPA – 0.39, CIDDS – 0.53, and CIDIDS2017–0.52. This


trend is largely in line with the number of instances in each dataset. For
0.28
0.48

0.58

0.15

0.68
0.58

0.47
0.71

0.72
0.27

0.72

0.7

instance, CIDDS having the largest number of instances achieves the best
scalability and KDD with the smallest number of instances achieves the
CIDDS

-0.07

lowest scalability. The deviation from ideal scalability for each dataset is
0.48
0.52

0.45

0.57

0.42
0.59

0.63
0.67

0.61

0.7

0.6

calculated as deviation = (1- scalability score) × 100. With regards to


the algorithms, the mean scalability scores are: Naïve Bayes – 0.24,
DARPA
Random Forest

Random Forest – 0.70, Support Vector Machine – 0.36, and MLP – 0.32.
0.28

0.32

0.06

0.52
0.57

0.59

0.75
0.62

0.57

0.78

0.66
0.7

The deviation from ideal scalability for each dataset is found to be; KDD
– 69%, DARPA – 61%, CIDDS – 47%, and CIDDS2017 – 48%. The de­
0.51

0.41
0.44

0.29
0.18

0.48
KDD

0.53

0.55
0.58

0.59
0.54
0.6

viation from ideal scalability for each algorithm is found to be; Naïve
Bayes – 76%, Random Forest – 30%, Support Vector Machine – 64%, and
CICIDS2017

MLP – 68%. On average, the achieved scalability of a BDCA system


deviates from the ideal scalability by 59.5%. The scalability with respect
0.49
0.46

0.46
0.59
0.61

0.61

0.61
0.67

0.64
0.62

to testing time is abrupt. This is because of the very quick response of the
0.6
0.7

system (i.e., in milliseconds) during the testing phase. Such abrupt


changes in testing time make our findings unreliable to be used for
CIDDS

0.51

0.52

0.59
0.44

0.54

0.53
0.68

0.72
0.68

0.82

0.68

0.69

scalability analysis. Therefore, in the rest of this paper, by following the


approach used in the related studies BDCA studies (e.g. (Aljarah and
Ludwig, 2013b),), we only report our findings with respect to the
DARPA

0.67

0.38

0.44
0.43

0.38

0.38
0.53

0.62
0.81

0.64

0.68

0.54

training time.
Naïve Bayes

4.2. RQ2: What is the impact of tuning spark configuration parameters on


-0.15
-0.14

-0.13
KDD

0.31
0.01

0.36

0.15
0.68

0.01
0.71

0.92
0.5

the scalability of a BDCA system?

Impactful Spark parameters: Table 10 shows the default values and


Modified

FALSE

the modified values for the 11 configuration parameters (described in


TRUE
Value

1250

1600
96m
400

64k
0.7

0.8

0.4

Section 3) and the scalability scores achieved with the default and
-1

modified settings. Figs. 4–7 show the scalability graphs with the default
and modified values for each of the 11 parameters for the 16 use cases, i.
Spark.executor.memory

Spark.shuffle.file.buffer

Spark.memory.fraction

e., 4 algorithms × 4 datasets. For instance, as shown in Table 10,


bypassMergeThreshold
Spark.shuffle.compress
Scalability Score with

Spark.driver.memory
Spark.rdd.compress

changing the value of spark.rdd.compress from FALSE (default) to TRUE


objectStreamReset
Spark.shuffle.sort.
Spark Parameter

memoryFraction
maxSizeInFlight

Spark.serializer.

changes the scalability score of the Naïve Bayes based BDCA system
storageFraction
Spark.memory.

Spark.reducer.
default setting

Spark.shuffle.

from − 0.14 to 0.92 for KDD, 0.53 to 0.38 for DARPA, 0.68 to 0.69 for
CIDDS, and 0.61 to 0.64 for CICIDS2017. The same trend continues for
the first nine parameters shown in Table 10, where modifying the value
of the parameters leads to a significant change in the scalability score.
Table 10

1024).

The last two parameters (i.e., P10 - spark.driver.memory and P11 - spark.
P10

P11
P1
P2

P3
P4

P5
P6

P7
P8

P9
ID

shuffle.memoryFraction) do not significantly impact the scalability. For

11
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Fig. 4. Impact of modifying the value of parameters on the scalability score of Naïve Bayes based BDCA system for the four datasets - (A) KDD (B) DARPA (C) CIDDS
and (D) CICIDS2017. The number in the legend specifies scalability score.

Fig. 5. Impact of modifying the value of parameters on the scalability of Random Forest based BDCA system for the four datasets - (A) KDD, (B) DARPA, (C) CIDDS,
and (D) CICIDS2017. The number in the legend specifies scalability score. 12
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Fig. 6. Impact of modifying the value of parameters on the scalability of Support Vector Machine based BDCA system for the four datasets - (A) KDD, (B) DARPA, (C)
CIDDS, and (D) CICIDS2017. The number in legend specifies scalability score.

Fig. 7. Impact of modifying the value of parameters on the scalability of Multilayer Perceptron based BDCA system for the four datasets - (A) KDD, (B) DARPA, (C)
CIDDS, and (D) CICIDS2017. The number in legend specifies scalability.

13
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

example, as presented in Table 10, changing the value of spark.shuffle. can be asserted that the Spark configuration parameters need to be
memoryFraction from ‘/’ (default) to 0.4 for Naïve Bayes based BDCA configured as per the type of the dataset and algorithm. In other words,
system brings an insignificant change in scalability score for KDD (− 0.14 this finding invalidates the reuse of a single Spark configuration setting
to − 0.13), DARPA (0.53–0.54), CIDDS (0.68–0.54), and CICIDS2017 across multiple datasets and algorithms.
(0.61–0.6). Spark parameter ranking: Table 12 presents the ranking of the
Unexpected variations: We also assess the regions for each of the 16 studied Spark parameters, based on their impact on scalability, with
use cases (4 datasets × 4 algorithms) where unexpected variations respect to the four datasets and four algorithms. The impact is calculated
happen. By unexpected variation, we mean a variation (e.g., from 2 to 4 as the difference between the scalability score with the default Spark
nodes) where the training time increases instead of decreasing. The parameter setting and the modified one. Such a ranking is useful in
number of unexpected variations in each of the four regions for the 16 prioritizing the tuning of particular parameters, i.e., parameters with a
use cases are presented in Table 11. Among the 192 scalability curves significant impact. Overall, our findings show that with respect to
(12 parameter settings × 4 datasets × 4 algorithms), 8/192 shows un­ scalability, Spark.reducer.maxSizeInFlight is the most impactful and
expected variation in the region from 1 to 2 nodes, 23/192 shows un­ Spark.shuffle.memoryFraction is the least impactful Spark parameter. It is
expected variation in the region from 2 to 4 nodes, 37/192 show worth noting that Spark.shuffle.compress significantly impacts the
unexpected variation in the region from 4 to 6 nodes, and 41/192 show training time as can be observed from Figs. 4–7. However, a significant
unexpected variation in the region from 6 to 8 nodes. This trend exhibits impact on training time does not necessarily mean a significant impact
that the rate of unexpected variation increases with the increase in the on scalability (Section 4.1). That is why it is not ranked as the most
number of nodes. impactful with respect to scalability. Table 12 also depicts that the
Positive/negative impact on scalability: We assess whether ranking of parameters varies with respect to datasets and algorithms.
modifying a parameter value has a positive or negative impact on For example, Spark.shuffle.sort.bypassMergeTreshold is ranked as 1st and
scalability. Table 10 shows that the positive or negative impact of 3rd for DARPA and CICIDS2017 datasets respectively; however, this
modifying a parameter value varies from one dataset to another as well parameter is respectively ranked as 9th and 10th for KDD and CIDDS
as from one algorithm to another. The bold values in Table 10 indicate datasets. As stated earlier and illustrated by the ranking, the two pa­
the negative impact of changing the default value for the parameter, i.e., rameters (i.e., spark.driver.memory and spark.shuffle.memoryFraction) are
scalability score decreases in comparison to scalability score with ranked at the bottom due to their insignificant or minor impact on the
default settings. For example, changing the value of P3 - spark.shuffle. scalability score.
compress for Naïve Bayes based BDCA system from TRUE to FALSE has a
positive impact on scalability with KDD, DARPA, and CIDDS but a 4.3. RQ3: How to improve the scalability of a BDCA system?
negative impact on scalability with CICIDS2017. Similarly, with
Random Forest based BDCA system, the default value (100) of P8 - spark. We have proposed a parameter-driven adaptation approach,
serializer.objectStreamReset achieves better scalability for CIDDS and SCALER, for improving the scalability of a BDCA system. The adaptation
approach has already been described in Section 4. Here, we evaluate the
effectiveness of our approach with respect to the following research
Table 11
Number of unexpected variations (i.e., where training time increases unlike
questions.
expected decrease) in each of the four transitions – 1 to 2 nodes, 2 to 4 nodes, 4
to 6 nodes, and 6 to 8 nodes. The value in brackets specifies the percentage of 4.3.1. RQ3.1: How much scalability of a BDCA system is improved using
unexpected variations calculated as the number of unexpected variations SCALER (scalability improvement)?
divided by the number of total variations. SVM and MLP stands for Support Adaptation scenarios: We assess the scalability improvement by
Vector Machine and MultiLayer Perceptron, respectively. comparing the scalability score achieved by our system exactly before
ML Dataset 1 to 2 2 to 4 4 to 6 6 to 8 and after adaptation. In order to realize adaptation, we experimented
Algorithm nodes nodes nodes nodes with two scenarios, i.e., baseline and change in input data. In the baseline
Naïve Bayes KDD 0 (0%) 9 (75.0%) 3 (25.0%) 5 (41.6%) scenario, a BDCA system is processing a particular dataset such as KDD
DARPA 0 (0%) 5 (41.6%) 6 (50.0%) 4 (33.3%) with the optimal CPV determined for KDD based on Algorithm 1. In the
CIDDS 0 (0%) 0 (0%) 7 (58.3%) 7 (58.3%) change in input data scenario, the input to the system is changed from one
CICIDS2017 0 (0%) 0 (0%) 8 (66.6%) 3 (25.0%) dataset to another, e.g., from KDD to CIDDS. Upon the change in the
Random KDD 1 (8.3%) 2 (16.6%) 1 (8.3%) 2 (16.6%)
Forest DARPA 0 (0%) 0 (0%) 2 (16.6%) 3 (25.0%)
dataset, SCALER calculates the scalability score for the new dataset (i.e.,
CIDDS 0 (0%) 2 (16.6%) 1 (8.3%) 2 (16.6%) CIDDS), which is presented as the scalability score before adaptation in
CICIDS2017 0 (0%) 1 (8.3%) 4 (33.3%) 3 (25.0%) Table 13. If the scalability score is lower than the predefined threshold
SVM KDD 1 (8.3%) 2 (16.6%) 2 (16.6%) 3 (25.0%) (0.58), the adaptation process is triggered. Given that we have four se­
DARPA 0 (0%) 1 (8.3%) 2 (16.6%) 2 (16.6%)
curity datasets, a total of 12 (change in input data) use cases are possible
CIDDS 0 (0%) 1 (8.3%) 1 (8.3%) 3 (25.0%)
CICIDS2017 0 (0%) 0 (0%) 0 (0%) 4 (33.3%) as shown in Table 13.
MLP KDD 2 (16.6%) 1 (8.3%) 2 (16.6%) 0 (0%) Scalability improvement: Table 13 shows the scalability scores
DARPA 2 (16.6%) 0 (0%) 2 (16.6%) 2 (16.6%) before and after adaptation for each of the 12 possible use cases and the
CIDDS 0 (0%) 1 (8.3%) 2 (16.6%) 2 (16.6%) mean scalability improvement for each of the four datasets. On average,
CICIDS2017 2 (16.6%) 2 (16.6%) 1 (41.6%) 1 (8.3%)
Total Number of 8 (5.5%) 23 37 41
SCALER improves scalability by 20.8%. With respect to datasets, the
Unexpected Variations (18.7%) (30.5%) (32.0%) highest improvement is 27.83% for CIDDS followed by 25.83% for
CICIDS2017, 22.71% for KDD, and 7.86% for DARPA. Since the scal­
ability score of DARPA with Naïve Bayes is higher than the threshold
CICIDS2017 while the modified value (− 1) achieves better scalability score of 0.58 (Section 3.3), adaptation is not triggered for the associated
for KDD and DARPA. This finding underlines a correlation between the three use cases. It is important to note that the scalability score after
dataset and Spark configuration parameters. We observe a similar trend adaptation is the same for all three cases associated with each dataset.
for the algorithms where the optimal values of the parameters do not This is because SCALER selects a CPV for a dataset irrespective of the
necessarily remain the same for different algorithms. For example, the dataset previously being processed by the system. For example, in use
default value of P3 - Spark.shuffle.compress obtains better scalability with cases 1 and 2, SCALER aims to select an optimal CPV for KDD and does
Random Forest but the modified value achieves better scalability with not pay any attention to the previously processed datasets (i.e., DAPRA
Naïve Bayes and Support Vector Machine for CIDDS dataset. Hence, it and CIDDS).

14
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Table 12
Ranking of the studied parameters based on their impact on scalability (the number in brackets specifies the difference between the scalability score with the default
settings and the modified settings). SVM and MLP stands for Support Vector Machine and MultiLayer Perceptron, respectively.
ID Spark Parameters Overall Ranking with Respect to Datasets Ranking with Respect to ML Algorithms
Ranking

KDD DARPA CIDDS CICIDS2017 Naïve Random SVM MLP


Bayes Forest

P1 Spark.executor.memory 9 (0.12) 6 (0.21) 11 4 (0.14) 9 (0.07) 7 (0.15) 11 (0.02) 5 (0.19) 11


(0.05) (0.16)
P2 Spark.shuffle.sort. 6 (0.17) 9 (0.12) 1 (0.28) 10 3 (0.20) 8 (0.13) 3 (0.22) 7 (0.15) 4 (0.27)
bypassMergeThreshold (0.07)
P3 Spark.shuffle.compress 4 (0.20) 4 (0.28) 7 (0.13) 6 (0.14) 1 (0.26) 5 (0.22) 4 (0.19) 4 (0.20) 2 (0.32)
P4 Spark.memory.storageFraction 8 (0.15) 5 (0.25) 8 (0.09) 5 (0.14) 5 (0.14) 4 (0.28) 6 (0.10) 11 7 (0.24)
(0.09)
P5 Spark.shuffle.file.buffer 5 (0.16) 7 (0.21) 3 (0.23) 1 (0.32) 7 (0.09) 6 (0.15) 1 (0.41) 10 8 (0.23)
(0.11)
P6 Spark.reducer.maxSizeInFlight 1 (0.22) 2 (0.50) 2 (0.24) 8 (0.07) 8 (0.08) 3 (0.29) 7 (0.09) 1 (0.29) 1 (0.34)
P7 Spark.memory.fraction 7 (0.16) 11 6 (0.15) 2 (0.19) 2 (0.21) 9 (0.08) 2 (0.25) 6 (0.15) 10
(0.09) (0.18)
P8 Spark.serializer.objectStreamReset 3 (0.20) 3 (0.42) 4 (0.16) 3 (0.19) 11 (0.04) 1 (0.31) 10 (0.06) 3 (0.23) 9 (0.22)
P9 Spark.rdd.compress 2 (0.21) 1 (0.62) 5 (0.15) 11 10 (0.06) 2 (0.31) 8 (0.09) 2 (0.25) 5 (0.24)
(0.04)
P10 Spark.shuffle.memoryFraction 11 (0.10) 8 (0.12) 10 9 (0.07) 6 (0.10) 11 (0.04) 9 (0.07) 9 (0.12) 6 (0.24)
(0.06)
P11 Spark.driver.memory 10 (0.10) 10 9 (0.08) 7 (0.12) 4 (0.16) 10 (0.08) 5 (0.13) 8 (0.14) 3 (0.30)
(0.09)

The summary answer to RQ2: Modifying the default value of 9 out of 11 studied Spark parameters impacts the scalability of a BDCA system.
Each security dataset and algorithm requires a separate configuration of Spark parameters for achieving optimal scalability. With respect to
scalability, Spark.reducer.maxSizeInFlight is the most impactful and Spark.shuffle.memoryFraction is the least impactful Spark parameter.

Table 13
Scalability score before and after adaptation.
Naïve Bayes Random Forest Support Vector Machine Multilayer Perceptron

Use Case ID Use Case Before After Before After Before After Before After Mean Improvement for Dataset (%)

1 DARPA → KDD 0.47 0.63 0.38 0.61 0.41 0.59 0.24 0.61 26.61%
2 CIDDS → KDD 0.52 0.63 0.49 0.61 0.47 0.59 0.38 0.61
3 CICIDS2017 → KDD 0.33 0.63 0.54 0.61 0.52 0.59 0.51 0.61
4 KDD → DARPA 0.70 0.70 0.50 0.59 0.60 0.60 0.59 0.59 7.61%
5 CIDDS → DARPA 0.70 0.70 0.59 0.59 0.72 0.72 0.61 0.61
6 CICIDS2017 → DARPA 0.70 0.70 0.41 0.59 0.54 0.72 0.54 0.68
7 KDD → CIDDS 0.51 0.72 0.45 0.63 0.41 0.60 0.47 0.65 26.01%
8 DARPA → CIDDS 0.54 0.72 0.63 0.63 0.53 0.60 0.55 0.65
9 CICIDS2017 → CIDDS 0.47 0.72 0.39 0.63 0.29 0.60 0.53 0.65
10 KDD → CICIDS2017 0.54 0.60 0.54 0.71 0.50 0.64 0.47 0.64 22.80%
11 DARPA → CICIDS217 0.51 0.60 0.47 0.71 0.39 0.64 0.58 0.58
12 CIDDS → CICIDS2017 0.53 0.60 0.49 0.71 0.38 0.64 0.51 0.64

The trend in scalability improvement largely correlates with the size a comparison with all studies discussed in Section 6.3 due to the lack of
of each dataset – the larger the size, the larger is the improvement. A required data in the reported studies. It is important to note the
higher scalability improvement is recorded for large size datasets (i.e., following points before we analyse the findings presented in Table 14 –
CIDDS and CICIDS2017) and a lower scalability improvement is recor­ (i) Since some of the studies (e.g. (Kyong et al., 2017),) only report
ded for small size datasets, i.e., KDD and DARPA. With respect to al­ throughput, we first calculated the response time for those studies based
gorithms, Support Vector Machine (SVM) benefits the most from on the reported throughput and data size (ii) some studies report find­
SCALER – achieving a mean scalability improvement of 23.68%. The ings for a cluster size greater than eight nodes. Given that our study
mean scalability improvement for Random Forest, Naïve Bayes, and considers a cluster size of maximum of eight nodes, we only selected
MLP is 22.50%, 16.54%, and 20.3% respectively. (and scaled where required) the response time of up to eight nodes from
Comparison with related studies: We compare the optimization those studies to make a fair comparison with those studies (iii) The
potential of SCALER with regards to the state-of-the-art approaches that studies presented in Table 14 use different datasets and different
also aim to improve the scalability of different software systems. For workloads. For example, Joohyun Kyong et al. (2017) use BigDataBench
such a comparison, we collected the data (e.g., response time) as re­ (Wanget al., 2014) and Chen et al. (2010) use DaCapo (Blackburnet al.,
ported in those studies and then calculated the scalability scores, using 2006) in their experiments. Given that our study is focussed on security
our scalability metric (Section 2.4.2), before and after the applied analytics, we used the datasets and algorithms used in security analytics.
optimization. The scalability score and achieved optimization in scal­ Therefore, owing to the usage of different datasets and algorithms in the
ability for various studies are presented in Table 14. We could not make related studies, an apple-to-apple comparison is quite challenging, and

15
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Table 14
Comparison of scalability improvement achieved through SCALER with scalability improvement achieved with regards to the state-of-the-art approaches. The scal­
ability improvement is calculated based on our scalability metric.
Study Workload Scalability Score Before Scalability Score After Improvement Mean Improvement
Optimization Optimization (%) (%)

Joohyun Kyong et al. (Kyong et al., Wordcount 0.36 0.74 38.00 18.00%
2017) Naïve Bayes 0.34 0.61 27.00
Grep 0.63 0.79 16.00
K-means 0.12 0.03 − 9.00
Hasan Jamal et al. (Jamal et al., 2009) 16 KB 0.83 0.86 3.00 1.75%
512 KB 0.97 1.0 3.00
6000 KB 0.11 0.13 2.00
16000 KB 0.62 0.61 − 1.00
Chen et al. (Chen et al., 2010) Eclipse 0.74 0.70 − 4.00 13.13%
Hsqldb 0.25 0.11 − 14.00
Lusearch − 0.57 0.59 116
Xalan 0.95 0.79 − 16
MolDyn 0.75 0.78 3.00
MonteCarlo 0.85 0.80 − 5.00
RayTracer 0.84 0.85 1.00
SPECjbb2005 0.37 0.61 24.00
SCALER Naïve Bayes 0.49 0.65 16.54 20.81%
Random Forest 0.44 0.66 22.50
SVM 0.42 0.63 23.68
MLP 0.49 0.62 20.34

The summary answer to RQ3.1: The proposed adaptation


approach improves the scalability of the BDCA system by 20.8%.
The larger the size of the dataset, the larger is the scalability
improvement achieved via our proposed approach.

(iv) some of the studies presented in Table 14 consider various scenarios


of scalability improvement. For instance, a study (Kyong et al., 2017)
considers two cases (e.g., fine-grained optimization and course-grained
optimization). For such studies, we report the mean scalability
improvement in Table 14. The improvement achieved with SCALER is
highest among the approaches mentioned in Table 14. A potential
reason for such improvement with SCALER is that, unlike the related
studies, our approach exploits the configuration parameters of the un­
derlying framework to improve the scalability.

4.3.2. RQ3.2: how long does it take for SCALER to adapt a BDCA system Fig. 8. Adaptation time of SCALER for the four datasets and four ML/
for optimal scalability (i.e., adaptation time)? DL algorithms.
Adaptation time: The adaptation time underlines the speed with
which SCALER adapts a system. The adaptation time is calculated as the
time between the point of time adaptation is triggered to the point when
Table 15
the system gains a stable state, i.e., the adaptation process is terminated Comparison of the number of iterations required by SCALER and other state-of-
(Villegas et al., 2011). Fig. 7 shows the adaptation time of SCALER for the-art approaches to converge towards optimal configuration.
the 16 use cases, i.e., 4 datasets × 4 algorithms. On average, it takes
Perez Zhu et al. ( Gounaris Liao SCALER
around 170 min for SCALER to adapt a system, i.e., to bring a system to a et al. ( Zhuet al., et al. ( et al. (
state where the scalability of a system is above the predefined threshold. Perez 2017) Gounaris and Liao
It is worth noting that unlike the previous studies (e.g. (Ullah and Babar, et al., Torres, et al.,
2019b; Ullah and Babar, 2019c),) that adapt for improving the response 2018) 2018) 2013)

time, our approach takes more time due to the generation of scalability Number of 4.06 5 9 15.75 2.27
curve instead of a single point required for response time optimization. Iterations/
trails to
The adaptation time is mainly elapsed in executing a system with
converge
different CPVs (Table 6) to identify the CPV with which a system has a
scalability score above the threshold. With respect to datasets, the mean
adaptation time is longest (i.e., 374.02 minutes (min)) for CICIDS2017

16
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

The summary answer to RQ3.2: Our adaptation approach takes around 2 iterations to adapt a BDCA system to a dataset. The time taken to
adapt a system is directly proportional to the size of the dataset.

followed by CIDDS (133.51 min), KDD (88.03 min), and DARPA (77.75 each approach to converge towards a stable configuration. Such a
min). This trend is largely in accordance with the size of each dataset. comparison of SCALER with the other state-of-the-art approaches is
For example, our approach takes the longest time to adapt the BDCA presented in Table 15. On average, SCALER requires only 2.1 iterations
system for CICIDS2017 that is the largest in size and takes the shortest to find the desired configuration, which is the smallest number of iter­
time to adapt for small size datasets such as KDD and DARPA. SVM is ations as compared to the other state-of-the-art approaches. One of the
quite fast with a mean adaptation time of 29.59 min, followed by reasons for such a small number of iterations is that instead of searching
Random Forest with a mean adaptation time of 149.15 min, Naïve Bayes for the most optimal CPV in the search space, SCALER only searches for a
with mean adaptation time of 326.25 min, and MLP with a mean CPV that has a scalability score above the threshold. As soon as the
adaptation time of 178 min. desired CPV is found, the search process is stopped.
Adaptation time Vs. Training time: The adaptation time is larger
than the actual job completion time (i.e., training time). For example, 4.3.3. RQ3.3: does the number of parameters and their value options
the mean training time for SVM based BDCA system with the DARPA impact the optimization capability and adaptation time of SCALER?
dataset is 84.33 min while the mean adaptation time for SVM based Scenarios: For this research question, we assess the impact of the
BDCA system with the DARPA dataset is 223 min. This is because in number of parameters considered and their value options on the per­
order to determine training time, a system requires to be executed only formance (i.e., scalability improvement and adaptation time) of
once. However, to determine the scalability score, a system needs to be SCALER. We considered four scenarios as shown in Table 16. Scenario-1
executed multiple times with a different number of nodes (i.e., 1, 2, 4, 6, is the default scenario, as presented in the rest of the paper, that con­
and 8 nodes in our case). Adaptation time is elapsed in determining siders nine parameters with each parameter having two potential values
scalability score for different parameter combinations; therefore, adap­ as shown in Table 16 and previously presented in Table 5. In scenario-2,
tation time is larger than training time. However, this factor does not we reduced the number of parameters from nine to five – considering
invalidate the advantages of SCALER. Similar to most of the tuning ap­ only the most impactful parameters as determined from the average
proaches (e.g. (Gounaris and Torres, 2018; Zhuet al., 2017; Alipourfard ranking presented in Table 11. Scenario-3 considers the same nine pa­
et al., 2017),), the real advantage of SCALER is in the execution of rameters as considered in scenario-1, but unlike scenario-1, each
recurring jobs (same job executed by a system multiple times over a parameter has four value options except the binary parameters such as
period of time), which is a common phenomenon and equally applicable Spark.shuffle.compress. The value options for the parameters are selected
to security analytics (Ferguson et al., 2012; Agarwal et al., 2012). Some based on academic and industrial recommendations (Spark, 2014b,
recent studies (Ferguson et al., 2012; Agarwal et al., 2012) reveal that 2016). Similarly, scenario-4 considers the same five parameters as
around 40% of data analytics jobs are recurrent jobs. The current job is considered in scenario-2, but unlike scenario-2, each parameter has four
executed for the sake of tuning; therefore, it does not benefit from value options.
tuning. However, the recurring and/or subsequent jobs benefit from the Scalability improvement: Fig. 9 (A) presents the improvement in
already tuned system. For example, SCALER improves the scalability scalability achieved by SCALER for each of the four studied scenarios.
score of a job (i.e., training SVM based BDCA system with CIDDS data­ On average, SCALER improves the scalability of a BDCA system by
set) from 0.41 to 0.60 – an improvement of around 19%, which in turn 20.8% in scenario-1, 8.74% in scenario-2, 28.27% in scenario-3, and
translates into a reduction of training time from 121 min to 97.2 min 23.62% in scenario-4. The improvement in scalability increases as the
with an eight nodes cluster. Now, since the system is tuned, when the number of parameters and their value options increases. For example,
system executes the same or similar job, it will take 97.2 min to complete scalability improvement is the highest (28.27%) in scenario-3, where a
the job instead of 121 min. total of nine parameters are considered, each with four value options (i.
Comparison with related studies: In Fig. 8, the number of itera­ e., 9 parameters – 4 value options). On the contrary, scalability
tions indicates the number of CPVs tried to identify the CPV with which improvement is the lowest (8.74%) in scenario-2, where SCALER ex­
a system has a scalability score above the threshold. On average, it takes plores combinations of only five parameters each with only two value
2.1 iterations/trails for SCALER to find the desired CPV from the search options (i.e., 5 parameters – 2 value options). However, it is worth
space. Since the related optimization approaches (e.g. (Gounaris and noting that the improvement in scalability is not directly proportional to
Torres, 2018; Zhuet al., 2017; Perez et al., 2018; Liao et al., 2013),) use the number of parameters and their value options considered in each
different datasets and algorithms, we cannot make a direct comparison scenario. For example, in scenario-3, SCALER explores almost twice the
of the adaptation time of SCALER with the related approaches. However, number of parameter combinations as in scenario-1 but achieves merely
we can make a direct comparison of the number of iterations required by 7.36% higher improvement than in scenario-1. A potential reason for

Table 16
Spark parameters and their value options considered in the four scenarios. The value in bold denotes the default value of the parameter.
ID Spark Parameter Scenario 1 Scenario 2 Scenario 3 Scenario 4

P1 Spark.executor.memory ✓ {1024, 1250} £ ✓ {1024, 1250, 512, 2056} £


P2 Spark.shuffle.sort.bypassMergeThreshold ✓ {200, 400} ✓ {200, 400} ✓ {200, 400, 100, 800} ✓ {200, 400, 100, 800}
P3 Spark.shuffle.compress ✓ {True, False} £ ✓ {True, False} £
P4 Spark.memory.storageFraction ✓ {0.5, 0.7} ✓ {0.5, 0.7} ✓ {0.5, 0.7, 0.2, 0.9} ✓ {0.5, 0.7, 0.2, 0.9}
P5 Spark.shuffle.file.buffer ✓ {32k, 64 k} ✓ {32k, 64 k} ✓ {32k, 64 k, 16 k, 128 k} ✓ {32k, 64 k, 16 k, 128 k}
P6 Spark.reducer.maxSizeInFlight ✓ {48m, 96 m} ✓ {48m, 96 m} ✓ {48m, 96 m, 24 m, 192 m} ✓ {48m, 96 m, 24 m, 192 m}
P7 Spark.memory.fraction ✓ {0.6, 0.8} £ ✓ {0.6, 0.8, 0.3, 1.0} £
P8 Spark.serializer.objectStreamReset ✓ {1024, 1250} ✓ {1024, 1250} ✓ {1024, 1250, 512, 2056} ✓ {1024, 1250, 512, 2056}
P9 Spark.rdd.compress ✓ {False, True} £ ✓ {False, True} £

17
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Fig. 9. (A) Scalability improvement achieved by SCALER in each of the four scenarios presented in Table 14 and (B) Adaptation time (in minutes) and the number of
iterations to converge towards optimal configuration.

from two to four (between scenario-1 and scenario-3) improves the


scalability by 7.36% but increases the adaptation time by 28.51%.
The summary answer to RQ3.3: The higher the number of pa­ Similarly, reducing the number of parameters from nine to five (scenario-
rameters and their value options in the search space, the higher is 2 and scenario-4) poses a threat to the adaptation stability, i.e., SCALER
the scalability optimization with SCALER. However, a higher not being able to find the desired parameter combination due to very
number of parameters and their value options in the search space few options. Therefore, we assert that the search space in scenario-1 (i.e.,
also leads to higher adaptation time (trade-off between scalability nine parameters each with two value options) is the most suitable
improvement and adaptation time). The most suitable choice of choice, in terms of optimization capability and adaptation time, for the
search space is the one with nine parameters each having two applicability of SCALER.
value options.

5. Discussion

This section discusses the broader implications of the findings.


such lack of proportionality is that Algorithm 1 does not sequentially test
each parameter combination rather randomly tests parameter combi­ 5.1. Benefits to security operators
nations. As a result, as soon as a parameter combination with a scal­
ability score that is above the threshold is found, the parameter We now turn to the question that how the findings of this study are
combination is selected for future system operation. useful for the security operators of a BDCA system. In practice, BDCA
Adaptation time: Fig. 9 (B) presents the adaptation time of SCALER systems are deployed and operated using default Spark settings (Gou­
with respect to each of the four scenarios and the number of iterations to naris and Torres, 2018). The first takeaway from our findings is the
converge to a parameter combination with a scalability score above the realization that default Spark settings are not optimal from scalability
threshold. The mean adaptation time is 168 min in scenario-1, 136 min perspective. Furthermore, our findings indicate that on average the
in scenario-2, 235 min in scenario-3, and 187 min in scenario-4. As ex­ deviation from ideal scalability is around 59.5%. The finding is expected
pected, the adaptation time and the number of iterations to converge to motivate a security operator to assess the scalability of a BDCA system
towards optimal configuration increase with an increment in the num­ after it is deployed. If the system scales poorly, the security operator can
ber of parameters and value options. The adaptation time is the lowest manually tune the parameters to improve the system’s scalability. The
(136 min) in scenario-2, which has the smallest number of parameter identification and ranking of impactful parameters in Section 4.2 further
combinations (i.e., 5 parameters each with two value options). On the facilitate a security operator to only assess the tuning of parameters that
other hand, the adaptation time is highest (235 min) in scenario-3, where has a significant impact on a system’s scalability. Our adaptation
SCALER explores the highest number of parameter combinations (i.e., 9 approach saves the security operator’s time and effort that is to be
parameters each with four value options). Although adaptation time is invested in finding the right combination of parameters with which a
the lowest in scenario-2, SCALER fails to converge/find a Spark param­ system can scale in a better way.
eter setting with scalability score above the threshold in 4 out of 48 use
cases (3 ML algorithms * 4 datasets * 3 changes), hence, impacting the
5.2. Extending SCALER to other domains
adaptation stability (Villegas et al., 2011) of SCALER. Such lack of
convergence indicates that reducing the number of parameters in the
Whilst the main aim of this study was to investigate and improve the
search space may lead to a search space that does not have a parameter
scalability of a BDCA system, the techniques (e.g., SCALER and param­
combination with a scalability score above the predefined threshold.
eters’ impact) presented in this paper can be extended and applied to
Trade-offs: From the above findings, we can see the case of a trade-
other domains. One such domain is banking where big data analytics is
off between optimization capability and adaptation time. Increasing the
making its mark. In banking, big data analytics is used to analyse large
number of parameters and their value options grows the optimization
volumes of data to understand in (near real-time) customer behaviour,
capability but at the cost of increased adaptation time. For instance, in
promote the right product, and increase revenue (Sun et al., 2014).
scenario-3 that has the largest number of parameters and their value
Similar to security data, the generation of banking data (e.g., trans­
options, SCALER achieves the highest improvement in scalability
actions) increases and decreases with time (Srivastava and Gopalk­
(28.27%) but with the highest adaptation time (235 min). However,
rishnan, 2015). For example, a heavy workload is observed during
with an increment in the number of value options, the gain in optimi­
working hours and at the end of a month. Accordingly, the banking big
zation capability is not as significant as the loss in adaptation time. As an
data analytics systems need to scale as per the workload. Therefore, our
example, increasing the number of value options for the nine parameters
adaptation approach (Section 4) can be applied to banking big data

18
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

systems for automatic tuning to improve the scalability of a system. features limits our approach and consequently poses a threat to the
Another potential domain is healthcare analytics where big data tech­ validity of our findings related to the KDD dataset. In the future, it will
nologies are frequently employed to deal with massive volumes of be interesting to incorporate and assess other encoding techniques, such
healthcare data (e.g., patient records) (Belle et al., 2015). For healthcare as one-hot encoding, to discard such bias and make a comparison with
big data systems, the workload fluctuates frequently as a heavy work­ the existing results.
load is to be handled in an emergency (e.g., natural disasters) (Nambiar
et al., 2013). Hence, we believe that a healthcare big data system can 6. Related work
also benefit from using our adaptation approach.
In this section, we compare our study with the existing studies on
5.3. Experimental bottlenecks BDCA systems, scalability investigation, scalability optimization, and
adaptation approaches.
During our experimentation, we faced several bottlenecks related to
the hardware resources and Spark processing framework. We discuss 6.1. BDCA systems
those issues for the benefit of interested readers (i.e., researchers), who
may come across the same issues. Our experiments produced temporary Given the exponentially growing number of cyber-attacks and
data during job execution on Spark, which consumes a lot of disk space increasing emphasis on real (or near)-time cybersecurity data analytics,
on worker nodes. Since each worker node has a limited disk space (10 there is strong interest in the strategies and tools of engineering and
GB in our case), the temporary data can exceed the disk limit of the operating optimal BDCA systems. However, there is relatively less
worker node. In such a case, the worker node becomes unhealthy and so literature on this topic (Ullah and Babar, 2019a). Spark-based BDCA
the master node does not assign the worker node any further tasks. To systems are rapidly surpassing Hadoop-based BDCA systems in popu­
deal with this issue, we used to delete the temporary data produced on larity and adoption as 70% of the BDCA systems in 2014 were
worker nodes. However, we ensured that critical data such as the HDFS Hadoop-based and only 30% were Spark-based; it changed to 50%
block files and the block metafiles are not deleted. Debugging becomes a Hadoop-based and 50% Spark-based in 2017 (Ullah and Babar, 2019a).
serious concern in distributed data processing. We also initially faced the Recently several studies (e.g. (Gupta and Kulariya, 2016; Kumari et al.,
challenges of debugging failures (e.g., node failure) during our experi­ 2016; Marchal et al., 2014; Las-Casas et al., 2016; Zhang et al., 2016;
ments. For instance, running a data processing job with eight worker Böse et al., 2017),) have proposed Spark-based BDCA systems. Gupta
nodes, where two worker nodes are already unhealthy, consumes time et al. (Gupta and Kulariya, 2016) present a Spark-based BDCA system
but produces useless results. This is because the experiment was that leverages two feature selection algorithms (i.e., correlation-based
designed for eight nodes, but two nodes were in an unhealthy state and feature selection and Chi-squared feature selection) and several ML al­
we were not aware of it. To handle this issue, we designated a path gorithms for detecting cyber intrusions. The system was evaluated with
through the variable yarn.nodemanager.log-dirs as the path for saving KDD dataset. The Spark-based BDCA system presented in (Kumari et al.,
operational logs. We used to constantly check the logs to identify any 2016) used K-means clustering for intrusion detection. Marchal et al.
issues before running the experiment. (2014) propose a Spark-based BDCA system for collecting different types
of security data (e.g., HTTP, DNS, IP Flow, etc.) and correlating the data
5.4. Threats to validity to detect cyber-attacks. Las-Kasas et al. (Las-Casas et al., 2016) present a
Spark-based BDCA system that leverages Apache Pig, Apache Hive, and
In this study, we have investigated a specific BDCA system that is SparkSQL to collect emails from honeypots installed in different coun­
using particular algorithms and a big data framework (Spark). There­ tries and analyse the emails to detect phishing attacks. Another
fore, our findings may not generalize to all kinds of BDCA systems. Spark-based BDCA system presented in (Zhang et al., 2016) analyses
However, it is important to note that the aim of this study is not to show abnormal network packets for unveiling DoS attacks. RADISH (Böse
the results that generalize to all BDCA systems but to show that the et al., 2017) is another Spark-based BDCA system that aims to detect
parameter configuration of the underlying big data framework impacts a abnormal user and resource behaviour in an enterprise to detect insider
BDCA system’s scalability. Nonetheless, future research aimed at threats. Similarly, Wang et al. (Wang and Jones, 2021) focussed on the 3
obtaining more generalizable results will be useful. The number of value Vs (volume, variety, and veracity) of cyber security big data to explore
options (i.e., two and four) we investigated for the parameters limit the impact of missing values, duplicates, variable correlation, and gen­
exploration of the parameter space. Since the modification of a param­ eral data quality on the detection of cyber-attacks. The authors used R
eter value (e.g., from 1024 MB to 1250 MB) shows a significant impact language and several datasets such as KDD-Cup and MAWILab in their
on scalability, investigating other modifications (e.g., from 1024 MB to study. Like the previous studies, our study also uses Spark and ML al­
2056 MB) can only strengthen our findings but cannot contradict them. gorithms for detecting cyber intrusions. However, unlike the previous
Our adaptation approach takes around 170 min to select Spark config­ studies, our study has been evaluated with four ML/DL algorithms and
uration with a scalability score above the threshold. Although the real four different security datasets in a fully distributed mode, which en­
advantage of our approach is the reduction in the execution time of the ables us to assert that our findings are based on a more rigorous study
recurring jobs, the adaptation time can be reduced in the future by (i) and are more generalizable. Since the previous studies use different ML
reducing the number of parameters considered during tuning through algorithms and security datasets for evaluation, hence, an apple-to-apple
techniques such as Lasso linear regression (Van Aken et al., 2017) and comparison of our findings with the findings from the previous studies is
(ii) similar to (Alipourfard et al., 2017) and (Venkataraman et al., 2016), not possible.
using representative datasets of smaller size instead of using the original
datasets. Our study has only investigated a limited number of parame­ 6.2. Scalability of BDCA systems
ters for their impact on scalability. Even if other Spark parameters do not
impact scalability, our findings for the studied parameters still remain Despite the increasing importance of the scalability of BDCA systems
valid. For feature engineering, we have used StringIndexer (from org. as reported in several studies (Ullah and Babar, 2019a), there have been
apache.spark.ml.feature) to transform the label features (i.e., normal and only a few efforts (e.g. (Lee and Lee, 2013; Las-Casas et al., 2016; Aljarah
attack) in the KDD dataset from string to indices as described in Section and Ludwig, 2013b; Du et al., 2014; Xiang et al., 2014),) aimed at
2.2.1. However, this approach introduces order in features that do not investigating the scalability of BDCA systems. Lee et al. (Lee and Lee,
have a natural order, which results in affecting/biasing the results of the 2013) investigated the scalability of a Hadoop-based BDCA system on a
machine learning model. Therefore, using the ordinal encoding of string 30-node cluster and observed that the execution time improves in

19
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

proportion to the hardware resource from 5 to 30 nodes. Du et al. (2014) 2002). Therefore, the approaches presented in (Gounaris and Torres,
studied the scalability of a Storm-based BDCA system on a five-node 2018; Nguyen et al., 2018; Perez et al., 2018; Wang et al., 2016) are not
cluster and observed that the system failed to achieve an ideal level of aimed at improving scalability. In general, the previous studies are
scalability due to extra task scheduling and communications overhead largely orthogonal to our study. This is because (i) our study is the first of
between spout and bold phases of the Storm execution environment. its kind that aims to improve the scalability of Spark-based BDCA sys­
Aljarah et al. (Aljarah and Ludwig, 2013b) also studied the scalability of tems and (ii) employ a parameter-driven adaptation approach for
a Hadoop-based BDCA system on an 18 node cluster and found that the improving the scalability of a BDCA system that, unlike the previous
system scaled abruptly as the number of nodes in the cluster was studies, automatically improves scalability at runtime.
increased. For example, an ideal speedup is observed from two to four
nodes and 14 to 16 nodes while the non-ideal speedup is observed for the 6.4. Parameter-driven adaptation
rest of the scalability curve. The non-ideal speedup is attributed to the
start-up of MapReduce jobs and storing intermediate results in HDFS. Parameter-driven adaptation is one of the commonly used adapta­
Las-Casas et al. (2016), compared the scalability of two BDCA systems – tion approaches. Several studies (e.g. (Calinescu et al., 2010; Epifani
one Hadoop-based and another Spark-based. They found that Spark et al., 2009) (Tongchim and Chongstitvatana, 2002; Jiang et al., 2018),)
scales better than Hadoop due to the efficient use of caching in Spark. attempted to modify the values of a certain system’s parameters to
Xiang et al. (2014) also explored the scalability of a BDCA system on a achieve various objectives such as high accuracy and improved security.
30-node cluster and found that the execution time decreases, although Calinescu et al. (2010) used the KAMI model based on a Bayesian esti­
not ideally, with an increase in the number of nodes up to 25 nodes. mator to modify model parameters at runtime for achieving reliability
After 25 nodes, the execution time is increased, which the authors and quick response in a service-based medical assistance system.
attribute to the excessive communication among nodes and disk Another study (Epifani et al., 2009) argued that the software abstraction
read/write operation during MapReduce tasks. Whilst the previous models’ parameters such as Discrete-Time Markov Chains (DTMC)
studies have investigated the scalability of a BDCA system, none of the should be constantly updated to achieve better accuracy. The authors of
studies have quantified the scalability; nor have they calculated the the study (Epifani et al., 2009) proposed an adaptation method that
deviation from the ideal scalability. Furthermore, the previous studies leverages real-time operational data of a system to keep the parameters
have only investigated the scalability with default settings. Our study is up to date. Similarly, parameter-based adaptation is quite common in
the first study that has (i) quantified the scalability with respect to four the ML domain for adjusting a model’s parameters. For example,
datasets (ii) assessed the deviation from the ideal scalability and most Tongchim et al. (Tongchim and Chongstitvatana, 2002) proposed a
importantly (iii) investigated the impact of Spark parameters on the parameter-driven adaptation approach for adjusting the control pa­
scalability of a BDCA system. rameters of genetic algorithms to achieve optimal accuracy. The
approach reported in (Tongchim and Chongstitvatana, 2002) divides the
6.3. Scalability improvement parameter space into sub-spaces and each sub-space evolves on separate
computing nodes in parallel. Jiang et al. (2018) proposed a
Several studies (e.g. (Kyong et al., 2017; Jamal et al., 2009; Chen parameter-driven adaptation approach that uses the temporal and
et al., 2010; Wu et al., 2009; Senger, 2009; Canali and Lancellotti, 2014), spatial correlations among characteristics (such as size and velocity of
) have proposed methods for improving the scalability of software sys­ objects) for finding the best set of configuration parameters for a con­
tems in different domains. Kyong et al. (2017) proposed a docker volutional neural network employed in a video analytics system. The
container-based architecture for Spark-based scale-up server, where the adaptation approach proposed in (Jiang et al., 2018) aims to balance
original scale-up server is partitioned into several small servers to reduce resource consumption and a system’s accuracy. From the adaptation
memory access overheads. Wu et al. (2009) propose a scalability point of view, our study differs from the previous studies in two ways.
improvement technique that learns from the interaction patterns among First, our study is the first of its kind that applies a parameter-driven
the services of a service-based application and accordingly adopts an adaptation approach in the domain of Spark-based systems. Second,
optimized task assignment strategy to reduce the communication unlike the previous studies that aim to achieve accuracy or quick
bandwidth and improve the scalability. Senger et al. (Senger, 2009) response time, our adaptation approach aims to achieve improved
defined a scalability measure called input file affinity that quantifies the scalability.
level of file sharing among tasks belonging to a bag-of-tasks application
(e.g., data mining algorithm). In a study (Senger, 2009), the authors 7. Conclusion
proposed a scalability improvement method that leverages the input file
affinity measure to increase the degree of file sharing among tasks. Chen Big Data Cyber Security Analytics (BDCA) systems use big data
et al. (2010) first studied the scalability of Java applications with default technologies (such as Apache Spark) to collect and analyse security
JVM settings and then proposed a tuning approach that alleviates JVM event data (e.g., NetFlow) for detecting cyber-attacks such as SQL in­
bottlenecks to improve the scalability of Java applications. Hasan et al. jection and brute force. The exponential growth in the volume and the
(Jamal et al., 2009) studied the scalability of virtual machine-based unpredictable velocity of security event data require BDCA systems to be
systems on a multicore processor setup. This study revealed that highly scalable. Therefore, in this paper, we have studied (i) how a
excessive communication among virtual machines impacts the scal­ Spark-based BDCA system scales with default Spark settings (ii) how
ability of multicore systems. Canali et al. (Canali and Lancellotti, 2014), tuning configuration parameters (e.g., execution memory) of Spark
present a scalability improvement approach for cloud-based systems, impacts the scalability of a BDCA system, and (iii) proposed SCALER - a
which leverages the resource usage patterns (e.g., CPU, storage, and parameter-driven adaptation approach to improve a BDCA system’s
network) among virtual machines and accordingly group the virtual scalability. For this study, we have developed an experimental infra­
machines in a cloud-based infrastructure. It is important to note that structure using a large-scale OpenStack cloud. We have implemented a
some studies (Gounaris and Torres, 2018; Nguyen et al., 2018; Perez Spark-based BDCA system and have used four security datasets to find
et al., 2018; Wang et al., 2016) proposed tuning techniques for out how a BDCA system scales, how Spark parameters impact scalability,
Spark-based systems with the objective to reduce execution time. Such and to evaluate our adaptation approach aimed at improving scalability.
studies are largely impertinent to ours as they are focussed on execution Based on our detailed experiments, we have found that:
time (response time). Our study focusses on scalability, which are two
very different quality attributes (i.e., response time and scalability) of a • With default Spark settings, a BDCA system does not scale ideally.
software system and are treated differently in the state-of-the-art (Sun, The deviation from ideal scalability is around 59.5%. The system

20
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

scales better with large size datasets (e.g., CICIDS2017) as compared Acknowledgement
to small size datasets (e.g., KDD)
• 9 out of 11 studied Spark parameters impact the scalability of a BDCA The authors would like to thank Anying Xiang for her help in con­
system. The impact of configuration parameters on scalability varies ducting the experiments.
from one security dataset to another
• Our parameter-driven adaptation approach improves the mean References
scalability of a BDCA system by 20.8%.
Aceto, G., Ciuonzo, D., Montieri, A., Persico, V., Pescapé, A., 2019. Know your big data
trade-offs when classifying encrypted mobile traffic with deep learning. In: 2019
From our findings, we conclude that practitioners should first tune Network Traffic Measurement and Analysis Conference (TMA). IEEE, pp. 121–128.
the parameters of Spark before putting a Spark-based BDCA system into Agarwal, S., Kandula, S., Bruno, N., Wu, M.-C., Stoica, I., Zhou, J., 2012. Reoptimizing
operation. Such parameter tunning can improve the scalability of the data parallel computing. In: Presented as Part of the 9th {USENIX} Symposium on
Networked Systems Design and Implementation ({NSDI} 12), pp. 281–294.
system. We also recommend that practitioners should not use someone Alexander, C.A., Wang, L.J.J.N.C., 2017. Big data analytics in heart attack prediction, 6
else’s pre-tuned parameter settings. The reason for this is that the best (393), 2167-1168.
combination of Spark parameters varies from dataset to dataset. Our Alipourfard, O., Liu, H.H., Chen, J., Venkataraman, S., Yu, M., Zhang, M., 2017.
Cherrypick: adaptively unearthing the best cloud configurations for big data
proposed adaptation approach is the first step towards facilitating analytics. In: 14th {USENIX} Symposium on Networked Systems Design and
practitioners to automatically tune Spark parameters for achieving Implementation ({NSDI} 17), pp. 469–482.
optimal scalability. More generally, we assert that the field of big data Aljarah, I., Ludwig, S.A., 2013a. Towards a scalable intrusion detection system based on
parallel PSO clustering using mapreduce. In: Proceedings of the 15th Annual
analytics should pay attention to the impact of the configuration pa­
Conference Companion on Genetic and Evolutionary Computation. ACM,
rameters of the big data frameworks on the various system qualities such pp. 169–170.
as reliability, response time, and scalability. Federated machine learning Aljarah, I., Ludwig, S.A., 2013b. Mapreduce intrusion detection system based on a
has recently gained tremendous attention in various domains due to its particle swarm optimization clustering algorithm. In: 2013 IEEE Congress on
Evolutionary Computation. IEEE, pp. 955–962.
ability to perform on-device collaborative training in a privacy- Allaince, C.S., 2013. Big Data Analytics for Security Intelligence. Big data working group.
preserved manner (Wahab et al., 2021). It’d be worth exploring how Available at: https://bit.ly/211P7jj. (Accessed 11 February 2020).
our proposed approach performs with respect to federated machine Apache, Flink, 2011. Apache Flink. Available at: https://bit.ly/2v7. Int. J. Innov. "How
to use big data technologies to optimize operations in upstream petroleum
learning. industry9ouR.
Based on our study, we highlight the following areas for future Baaziz, A., Quoniam, L.J.B., Abdelkader, Quoniam, L., 2014. How to use big data
research. Investigating the parameters’ impact of other big data frameworks. technologies to optimize operations in upstream petroleum industry. Int. J. Innov.,
"How to use big data technologies to optimize operations in upstream petroleum
Although currently, Spark is the most popular big data framework, there industry 1 (1), 2013.
exist several other big data frameworks (such as Hadoop (2009), Storm Batista, G.E., Prati, R.C., Monard, M.C., 2004. A study of the behavior of several methods
(2011), Samza (2014), and Flink (Carbone et al., 2015)) with a different for balancing machine learning training data. ACM SIGKDD explorations newsletter
6 (1), 20–29.
set of configuration parameters. Therefore, future research should Belle, A., Thiagarajan, R., Soroushmehr, S., Navidi, F., Beard, D.A., Najarian, K., 2015.
investigate how configuration parameters of these frameworks impact Big data analytics in healthcare. BioMed Res. Int. 2015.
the scalability of a BDCA system. Approximate analytics for tuning big data Blackburn, S.M., et al., 2006. The DaCapo benchmarks: Java benchmarking development
and analysis. In: Proceedings of the 21st Annual ACM SIGPLAN Conference on
frameworks: Approximate analytics is an emerging concept that en­
Object-Oriented Programming Systems. and applications, languages, pp. 169–190.
courages computing over a representative sample instead of computing Böse, B., Avasarala, B., Tirthapura, S., Chung, Y.-Y., Steiner, D., 2017. Detecting insider
over the entire dataset (Quoc et al., 2017). The rationale behind threats using radish: a system for real-time anomaly detection in heterogeneous data
approximate analytics is to make a trade-off between accuracy and streams. IEEE Syst. J. 11 (2), 471–482.
Calinescu, R., Grunske, L., Kwiatkowska, M., Mirandola, R., Tamburrelli, G., 2010.
computational time. In our study, we used the entire security datasets Dynamic QoS management and optimization in service-based systems. Trans. Softw.
for system execution and subsequent tuning. Therefore, an interesting Eng.
avenue for future research is to explore the applicability of approximate Canali, C., Lancellotti, R., 2014. Improving scalability of cloud monitoring through PCA-
based clustering of virtual machines. J. Comput. Sci. Technol. 29 (1), 38–52.
analytics for tuning big data frameworks. Investigating the parameters’ Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K., 2015. Apache
impact on other system’s qualities. The focus of our study was only on flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech.
scalability, however, there exist several other quality attributes (e.g., Commit. Data Eng. 36 (4).
Chen, K.-Y., Chang, J.M., Hou, T.-W., 2010. Multithreading in Java: performance and
reliability, security, and interoperability) that are also important for a scalability on multicore systems. IEEE Trans. Comput. 60 (11), 1521–1534.
BDCA system. Hence, it is worth investigating that how the configura­ Cheng, L., Wang, Y., Ma, X., Wang, Y., 2016. GSLAC: a general scalable and low-
tion parameters of Spark impact other quality attributes of a BDCA overhead alert correlation method. In: Trustcom/BigDataSE/I SPA. IEEE.
KDD, 1999. KDDcup99 Knowledge Discovery in Databases. https://goo.gl/Jz2Un6.
system. (Accessed 11 February 2020).
Davidson, A., Or, A., 2013. Optimizing Shuffle Performance in Spark. University of
Author contribution California, Berkeley-Department of Electrical Engineering and Computer Sciences,
Tech. Rep.
Du, Y., Liu, J., Liu, F., Chen, L., 2014. A real-time anomalies detection system based on
Faheem Ullah: Conceptualization, Methodology, Software, Valida­ streaming technology. In: Intelligent Human-Machine Systems and Cybernetics
tion, Visualization, Writing – original draft, Writing -review and editing. (IHMSC), 2014 Sixth International Conference on, vol. 2. IEEE, pp. 275–279.
M. Ali Babar: Conceptualization, writing original draft, writing-review Economist, T., 2017. The World’s Most Valuable Resource Is No Longer Oil, but Data.
Available at: https://econ.st/2Gtfztg. (Accessed 11 February 2020).
and editing, Project administration, Resources. Epifani, I., Ghezzi, C., Mirandola, R., Tamburrelli, G., 2009. Model evolution by run-time
parameter adaptation. In: Proceedings of the 31st International Conference on
Declaration of competing interest Software Engineering. IEEE Computer Society, pp. 111–121.
Ferguson, A.D., Bodik, P., Kandula, S., Boutin, E., Fonseca, R., 2012. Jockey: guaranteed
job latency in data parallel clusters. In: Proceedings of the 7th ACM European
The authors declare that they have no known competing financial Conference on Computer Systems, pp. 99–112.
interests or personal relationships that could have appeared to influence Gontz, J., Riensel, D., 2012. The Digital Universe in 2020: Big Data, Bigger Digital
Shadow, and Biggest Growth in the Far East. IDC Country Brief. Available at: htt
the work reported in this paper. ps://bit.ly/2rqPWaw. (Accessed 11 February 2020).
Gounaris, A., Torres, J., 2018. A methodology for spark parameter tuning. Big Data Res.
11, 22–32.

21
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Gounaris, A., Kougka, G., Tous, R., Montes, C.T., Torres, J., 2017. Dynamic configuration Senger, H., 2009. Improving scalability of Bag-of-Tasks applications running on
of partitioning in spark applications. IEEE Trans. Parallel Distr. Syst. 28 (7), master–slave platforms. Parallel Comput. 35 (2), 57–71.
1891–1904. Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A., 2018. Toward Generating a New Intrusion
Grama, A.Y., Gupta, A., Kumar, V., 1993. Isoefficiency: measuring the scalability of Detection Dataset and Intrusion Traffic Characterization. ICISSP. Available at: htt
parallel algorithms and architectures. IEEE Parallel Distr. Technol. Syst. Appl. 1 (3), ps://bit.ly/30qWkft. (Accessed 11 February 2020).
12–21. Shvachko, K., Kuang, H., Radia, S., Chansler, R., 2010. The hadoop distributed file
Greene, C.S., Tan, J., Ung, M., Moore, J.H., Cheng, C., 2014. Big data bioinformatics. system. MSST 10.
J. Cell. Physiol. 229 (12), 1896–1900. Spark, A., 2011. Spark Programming Guide. Available at: https://bit.ly/37DETeF.
Groves, P., Kayyali, B., Knott, D., Kuiken, S.V., 2016. The’big Data’revolution in Spark, 2014a. Apache Spark. Available at: https://spark.apache.org/. (Accessed 11
Healthcare: Accelerating Value and Innovation. February 2020).
Gupta, G.P., Kulariya, M., 2016. A framework for fast and efficient cyber security Spark, A., 2014b. Spark Configuration. Available at: https://bit.ly/2rXR4NK. (Accessed
network intrusion detection using Apache spark. Proc. Comput. Sci. 93, 824–831. 11 February 2020).
Hadoop, A., 2009. Apache Hadoop. https://goo.gl/GLWG9Q. (Accessed 11 February Spark, A., 2016. SparkHub: A Community Site for Apache Spark. Available at: https://bit
2020). .ly/2lS8Vs5. (Accessed 11 February 2020).
Herodotou, H., et al., 2011. Starfish: a self-tuning system for big data analytics. Cidr 11 Srivastava, U., Gopalkrishnan, S., 2015. Impact of big data analytics on banking sector:
(2011), 261–272. learning for Indian banks. Proc. Comput. Sci. 50, 643–652.
Holtz, M.D., David, B.M., de Sousa Júnior, R.T., 2011. Building scalable distributed Storm, A., 2011. Apache Storm. Available at: https://bit.ly/2tEvqox.
intrusion detection systems based on the mapreduce framework. Rev. Telecommun. Sun, X.-H., 2002. Scalability versus execution time in scalable systems. J. Parallel Distr.
13 (2), 22. Comput. 62 (2), 173–192.
Hong, K.-F., Chen, C.-C., Chiu, Y.-T., Chou, K.-S., 2015. Ctracer: uncover C&C in Sun, X.-H., Rover, D.T., 1994. Scalability of parallel algorithm-machine combinations.
advanced persistent threats based on scalable framework for enterprise log data. In: IEEE Trans. Parallel Distr. Syst. 5 (6), 599–613.
2015 IEEE International Congress on Big Data. IEEE, pp. 551–558. Sun, X.-H., Chen, Y., Wu, M., 2005. Scalability of heterogeneous computing. In: 2005
Jamal, M.H., Qadeer, A., Mahmood, W., Waheed, A., Ding, J.J., 2009. Virtual machine International Conference on Parallel Processing (ICPP’05). IEEE, pp. 557–564.
scalability on multi-core processors based servers for cloud computing workloads. In: Sun, N., Morris, J.G., Xu, J., Zhu, X., Xie, M., 2014. iCARE: a framework for big data-
2009 IEEE International Conference on Networking, Architecture, and Storage. IEEE, based banking customer analytics. IBM J. Res. Dev. 58 (5/6), 4: 1-4: 9.
pp. 90–97. MIT, 1998. DARPA Intrusion Detection Evaluation Data Set. Available at: https://goo.
Jiang, J., Ananthanarayanan, G., Bodik, P., Sen, S., Stoica, I., 2018. Chameleon: scalable gl/jYBYNe. (Accessed 11 February 2020).
adaptation of video analytics. In: Proceedings of the 2018 Conference of the ACM Tongchim, S., Chongstitvatana, P., 2002. Parallel genetic algorithm with parameter
Special Interest Group on Data Communication. ACM, pp. 253–266. adaptation. Inf. Process. Lett. 82 (1), 47–54.
Jogalekar, P., Woodside, M., 2000. Evaluating the scalability of distributed systems. IEEE Ullah, F., Babar, M.A., 2019a. Architectural tactics for big data cybersecurity analytics
Trans. Parallel Distr. Syst. 11 (6), 589–603. systems: a review. J. Syst. Software.
Johnsirani Venkatesan, N., Nam, C., Shin, R., 2019. Deep learning frameworks on Ullah, F., Babar, M.A., 2019b. An architecture-driven adaptation approach for big data
Apache spark: a review, 36 (2), 164–177. cyber security analytics. In: International Conference on Software Architecture,
Kumar, M., Hanumanthappa, M., 2013. Scalable intrusion detection systems log analysis pp. 41–50.
using cloud computing infrastructure. In: 2013 IEEE International Conference on Ullah, F., Babar, M., 2019c. QuickAdapt: scalable adaptation for big data cyber security
Computational Intelligence and Computing Research. IEEE, pp. 1–4. analytics. In: International Conference on Engineering of Complex Computer
Kumari, R., Singh, M., Jha, R., Singh, N., 2016. Anomaly Detection in Network Traffic Systems.
Using K-Mean Clustering. Recent Advances in Information Technology (RAIT). Van Aken, D., Pavlo, A., Gordon, G.J., Zhang, B., 2017. Automatic database management
Kyong, J., Jeon, J., Lim, S.-S., 2017. Improving scalability of Apache spark-based scale- system tuning through large-scale machine learning. In: Proceedings of the 2017
up server through docker container-based partitioning. In: Proceedings of the 6th ACM International Conference on Management of Data, pp. 1009–1024.
International Conference on Software and Computer Applications. ACM, Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I., 2016. Ernest: efficient
pp. 176–180. performance prediction for large-scale advanced analytics. In: 13th {USENIX}
Las-Casas, P.H., Dias, V.S., Meira, W., Guedes, D., 2016. A Big Data architecture for Symposium on Networked Systems Design and Implementation ({NSDI} 16),
security data and its application to phishing characterization. In: Big Data Security pp. 363–378.
on Cloud (BigDataSecurity). IEEE, pp. 36–41. Villegas, N.M., Müller, H.A., Tamura, G., Duchien, L., Casallas, R., 2011. A framework for
Lee, Y., Lee, Y., 2013. Toward scalable internet traffic measurement and analysis with evaluating quality-driven self-adaptive software systems. In: Symposium on Software
hadoop. Comput. Commun. Rev. 43 (1), 5–13. Engineering for Adaptive and Self-Managing Systems.
Liao, G., Datta, K., Willke, T.L., 2013. Gunther: search-based auto-tuning of mapreduce. Wahab, O.A., Mourad, A., Otrok, H., Taleb, T.J.I.C.S., Tutorials, 2021. Federated
In: European Conference on Parallel Processing. Springer, pp. 406–419. machine learning: survey, multi-level classification, desirable criteria and future
Marchal, S., Jiang, X., Engel, T., 2014. A Big Data Architecture for Large Scale Security directions in communication and networking systems, 23 (2), 1342–1397.
Monitoring. Congress on Big Data. Wang, L., Jones, S., 2021. Big data analytics in cyber security: network traffic and
Mehta, N., Pandit, A., 2018. Concurrence of big data analytics and healthcare: a attacks, 61 (5), 410–417.
systematic review, 114, 57–65. Wang, G., Xu, J., He, B., 2016. A novel method for tuning configuration parameters of
Nambiar, R., Bhardwaj, R., Sethi, A., Vargheese, R., 2013. A look at challenges and spark based on machine learning. In: 2016 IEEE 18th International Conference on
opportunities of big data analytics in healthcare. In: 2013 IEEE International High Performance Computing and Communications; IEEE 14th International
Conference on Big Data. IEEE, pp. 17–22. Conference on Smart City; IEEE 2nd International Conference on Data Science and
Nguyen, N., Khan, M.M.H., Wang, K., 2018. Towards automatic tuning of Apache spark Systems (HPCC/SmartCity/DSS). IEEE, pp. 586–593.
configuration. In: 2018 IEEE 11th International Conference on Cloud Computing Wang, L., et al., 2014. Bigdatabench: a big data benchmark suite from internet services.
(CLOUD). IEEE, pp. 417–425. In: 2014 IEEE 20th International Symposium on High Performance Computer
Nguyen, T., Gosine, R.G., Warrian, P.J.I.A., 2020. A Systematic Review of Big Data Architecture (HPCA). IEEE, pp. 488–499.
Analytics for Oil and Gas Industry 4.0, vol. 8, pp. 61183–61201. Williams, L.G., Smith, C.U., 2004. Web application scalability: a model-based approach.
Obitade, P.O., 2019. Big data analytics: a link between knowledge management In: Int. CMG Conference, pp. 215–226.
capabilities and superior cyber protection. J. Big Data 6 (1), 71. Wu, J., Liang, Q., Bertino, E., 2009. Improving scalability of software cloud for composite
Oussous, A., Benjelloun, F.-Z., Lahcen, A.A., Belfkih, S., 2018. Big Data technologies: a web services. In: 2009 IEEE International Conference on Cloud Computing. IEEE,
survey. J. King Saud Univ. Comput. Inform. Sci. 30 (4), 431–448. pp. 143–146.
Partners, N., 2019. Big Data and AI Executive Survey. Available at: https://bit. Xiang, J., Westerlund, M., Sovilj, D., Pulkkis, G., 2014. Using Extreme Learning Machine
ly/2Y5XcZ6. (Accessed 11 February 2020). for Intrusion Detection in a Big Data Environment. Artificial Intelligent and Security
Perez, T.B., Chen, W., Ji, R., Liu, L., Zhou, X., 2018. Pets: bottleneck-aware spark tuning Workshop.
with parameter ensembles. In: 2018 27th International Conference on Computer Zaharia, et al., 2016. Apache spark: a unified engine for big data processing. Commun.
Communication and Networks (ICCCN). IEEE, pp. 1–9. ACM.
Persico, V., Pescapé, A., Picariello, A., Sperlí, S., 2018. Benchmarking big data Zhang, J., Zhang, Y., Liu, P., He, J., 2016. A spark-based DDoS attack detection model in
architectures for social networks data processing using public cloud platforms, 89, cloud services. In: International Conference on Information Security Practice and
98–109. Experience. Springer, pp. 48–64.
Pouyanfar, S., et al., 2018. A survey on deep learning: Algorithms, techniques, and Zhao, S., Chandrashekar, M., Lee, Y., Medhi, D., 2015. Real-time network anomaly
applications 51 (5), 1–36. detection system using machine learning. In: 2015 11th International Conference on
Pramanik, M.I., Lau, R.Y., Azad, M.A.K., Hossain, M.S., Chowdhury, M.K.H., the Design of Reliable Communication Networks (DRCN). IEEE, pp. 267–270.
Karmaker, A., 2020. Healthcare informatics and analytics in big data 152, 113388. Zhu, Y., et al., 2017. Bestconfig: tapping the performance potential of systems via
Qiu, J., Wu, Q., Ding, G., Xu, Y., Feng, P., 2016. A survey of machine learning for big data automatic configuration tuning. In: Proceedings of the 2017 Symposium on Cloud
processing, 2016 (1), 1–16. Computing. ACM, pp. 338–350.
Quoc, D.L., Chen, R., Bhatotia, P., Fetzer, C., Hilt, V., Strufe, T., 2017. StreamApprox:
approximate computing for stream analytics. In: Proceedings of the 18th ACM/IFIP/
USENIX Middleware Conference, pp. 185–197.
Ring, M., Wunderlich, S., Grüdl, D., Landes, D., Hotho, A., 2017. Flow-based Benchmark
Data Sets for Intrusion Detection. ECCWS. Available at: https://bit.ly/3ad1CQc/.
(Accessed 11 February 2020).
Samza, A., 2014. Apache Samza. Available at: https://bit.ly/37fFCSR.

22
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294

Faheem Ullah is a postdoctoral researcher with the School of M. ALI BABAR is a Professor in the School of Computer Sci­
Computer Science, The University of Adelaide, Australia. He ence, University of Adelaide. He is an honorary visiting pro­
completed his PhD, focussed on the intersection of big data and fessor at the Software Institute, Nanjing University, China. Prof
cyber security, from the University of Adelaide, Australia. He is Babar has established an interdisciplinary research centre,
a member of CREST - Centre for Research on Engineering CREST - Centre for Research on Engineering Software Tech­
Software Technologies, which is an interdisciplinary research nologies, where he leads the research and research training of
centre at the University of Adelaide. He has been actively more than 30 (15 PhD students) members. He also leads a
involved in teaching undergrad and master courses in the area theme, Platform and Architecture for Cyber Security as a Ser­
of computer science and software engineering. He has vice, of the Cyber Security Cooperative Research Centre
supervised/co-supervised more than 20 undergrad/master (CSCRC), one of the largest Australian initiatives for building
projects. His current research primarily focuses on cyber se­ sovereign cyber security capability through world class applied
curity, big data analytics, and cloud computing. R&D with industrial impact. Prof Babar has authored/co-
authored more than 250 peer-reviewed publications through
premier Software Technology journals and conferences. Apart from his work having in­
dustrial relevance as evidenced by several R&D projects and setting up a number of col­
laborations in Australia and Europe with industry and government agencies, his
publications have been highly cited within the discipline of Software Engineering as evi­
denced by his H-Index is 46 with 11028 citations as per Google Scholar on December 9,
2021. Prior to joining the University of Adelaide in November 2013, he spent almost 7
years in Europe (Ireland, Denmark, and UK) working as a senior researcher and an
academic.

23

You might also like