Article 1
Article 1
A R T I C L E I N F O A B S T R A C T
Keywords: Big Data Cyber Security Analytics (BDCA) systems use big data technologies (e.g., Apache Spark) to collect, store,
Big data and analyse a large volume of security event data for detecting cyber-attacks. The volume of digital data in
Cyber security general and security event data in specific is increasing exponentially. The velocity with which security event
Adaptation
data is generated and fed into a BDCA system is unpredictable. Therefore, a BDCA system should be highly
Scalability
Configuration parameter
scalable to deal with the unpredictable increase/decrease in the velocity of security event data. However, there
Spark has been little effort to investigate the scalability of BDCA systems to identify and exploit the sources of scal
ability improvement. In this paper, we first investigate the scalability of a Spark-based BDCA system with default
Spark settings. We then identify Spark configuration parameters (e.g., execution memory) that can significantly
impact the scalability of a BDCA system. Based on the identified parameters, we finally propose a parameter-
driven adaptation approach, SCALER, for optimizing a system’s scalability. We have conducted a set of exper
iments by implementing a Spark-based BDCA system on a large-scale OpenStack cluster. We ran our experiments
with four security datasets. We have found that (i) a BDCA system with default settings of Spark configuration
parameters deviates from ideal scalability by 59.5% (ii) 9 out of 11 studied Spark configuration parameters
significantly impact scalability and (iii) SCALER improves the BDCA system’s scalability by 20.8% compared to
the scalability with default Spark parameter setting. The findings of our study highlight the importance of
exploring the parameter space of the underlying big data framework (e.g., Apache Spark) for scalable cyber
security analytics.
* Corresponding author.
E-mail addresses: [email protected] (F. Ullah), [email protected] (M.A. Babar).
https://doi.org/10.1016/j.jnca.2021.103294
Received 12 January 2021; Received in revised form 18 November 2021; Accepted 23 November 2021
Available online 3 December 2021
1084-8045/© 2021 Elsevier Ltd. All rights reserved.
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
(Allaince, 2013). The merger of cyber security systems and big data and devise an approach for improving the scalability”. Given that there
technologies has given birth to a new breed of a software system called exist several big data processing frameworks (e.g., Spark (Zahariaet al.,
Big Data Cyber Security Analytics (BDCA) system, which is defined as “A 2016), Hadoop (2009), Storm (2011), Samza (2014), and Flink (Apa
system that leverages big data technologies for collecting, storing, and cheFlink, 2011)), we investigate Spark as it is currently the most widely
analyzing a large volume of security event data to protect organizational used framework in the domain of BDCA. We have observed that 14
networks, computers, and data from unauthorized access, damage, or attack” BDCA studies published in 2014 used Hadoop and only four used Spark
(Ullah and Babar, 2019a). A recent study of BDCA systems indicates that which changed to four studies using Hadoop and five studies using Spark
72% of organizations that employed big data technologies in their cyber in 2017 (Ullah and Babar, 2019a). A similar dominance of Spark over
security landscape reported significant improvement in their cyber Hadoop in the industry is observed (Spark, 2016). To achieve the
agility (Obitade, 2019). aforementioned aim, this paper contributes to the state-of-the-art by
BDCA systems are primarily classified into two categories based on answering the following Research Questions (RQ).
their attack detection capability – Generic BDCA systems and Specific To answer the three research questions, we developed an experi
BDCA systems (Ullah and Babar, 2019a). Generic BDCA systems (e.g., an mental infrastructure on a large-scale OpenStack cloud. We imple
intrusion detection system supported with big data technologies) aim to mented a Spark-based BDCA system that ran on an OpenStack cloud in a
detect a variety of attacks such as SQL injection, cross-site scripting, and fully distributed fashion. We used two evaluation metrics – the accuracy
brute force. Specific BDCA systems (e.g., a phishing detection system built and scalability of a BDCA system. For measuring accuracy, we leveraged
using big data technologies) are focused on detecting a specific attack the commonly used measures such as F1 score, precision, and recall. For
type such as phishing. The main characteristics of BDCA systems, that measuring scalability, we used the scalability scoring measure reported
distinguishes them from traditional cyber security systems, include (i) in Section 2.4.2. We used four security datasets (i.e., KDD (KDD, 1999),
monitoring diverse assets of an enterprise such as data storage systems, DARPA (MIT, 1998), CIDDS (Ring et al., 2017), and CICIDS2017
computing machines, and end-user applications (ii) integrating security (Sharafaldin et al., 2018)) in our experimentation and evaluated the
data from multiple sources such as IDS, firewall, and anti-virus (iii) BDCA system with four learning algorithms (i.e., Naïve Bayes, Random
analysing large volume of security event data in near real-time (iv) Forest, Support Vector Machine, and Multilayer Perceptron) that are
enabling deep and holistic security analytics for unfolding complex at employed in the system for classifying security data into benign and
tacks such as Advanced Persistent Threats (APT) and (v) analysing malicious categories. Based on our comprehensive experimentation, we
heterogeneous streams of security event data (Ullah and Babar, 2019a). have found that:
Like any software system, certain quality attributes (e.g., interoper
ability and reliability) are expected in a BDCA system. Ullah and Ali (i) A BDCA system with default Spark configuration parameters does
Babar (Ullah and Babar, 2019a) reported the 12 most important quality not scale ideally. The deviation from ideal scalability is around
attributes of a BDCA system, where scalability is ranked as the third 59.5%. This means a system only takes 41.5% benefit from the
most important quality attribute of a BDCA system. Scalability is defined additional resources.
as “the system’s ability to increase speed-up as the number of processors in (ii) Among the 11 investigated Spark parameters, changing the value
crease” (Sun and Rover, 1994). The rationale behind the need for a BDCA of nine parameters significantly impacts a BDCA system’s scal
system being highly scalable is twofold – (a) the volume of security event ability. The optimal value of a parameter (with respect to scal
data is rapidly increasing, which requires a BDCA system to scale up (by ability) varies from dataset to dataset.
adding more computational power) to process data without impacting (iii) We proposed and evaluated a parameter-driven adaptation
the response time of a system (Lee and Lee, 2013; Cheng et al., 2016) approach, SCALER, that automatically selects the most optimal
and (b) the velocity of security event data generation fluctuates (Hong value for each parameter at runtime. The evaluation results show
et al., 2015; Kumar and Hanumanthappa, 2013). For example, a BDCA that on average, SCALER improves a BDCA system’s scalability by
system analysing network traffic of a bank experiences a higher work 20.8%.
load during working hours as compared to non-working hours. There
fore, a BDCA system should efficiently use the commodity or third-party The rest of this paper is structured as follows. Section 2 reports the
resources to scale up during working hours and scale down during security datasets, our BDCA system, the instrumentation setup, and
non-working hours. In other words, a BDCA system should take evaluation metrics. Our adaptation approach is presented in Section 3.
maximum benefit from the additional resources. Section 4 presents the detailed findings of our study with respect to the
Among the 74 studies on BDCA reviewed in (Ullah and Babar, three research questions. Section 5 presents our reflections on the
2019a), 40 studies highlight the importance of scalability for a BDCA findings. Section 6 positions the novelty of our work with respect to the
system. However, none of the studies have either investigated the factors related work. Finally, Section 7 concludes the paper by highlighting the
that impact the scalability of a BDCA system or have proposed any so implications of our study for practitioners and researchers.
lutions for improving scalability. Several BDCA studies (e.g. (Aljarah
and Ludwig, 2013a; Zhao et al., 2015; Holtz et al., 2011),) hint at factors 2. Research methodology
such as machine learning algorithm employed in a system, quality of
security event data, and big data processing framework that can This section describes the datasets, our BDCA system, the instru
potentially impact scalability. Among these factors, the most prominent mentation setup, and evaluation metrics.
is the underlying big data processing framework such as Spark or
Hadoop, which is an integral part of any BDCA system. One of the core 2.1. Security datasets
features of any big data processing framework is its configuration pa
rameters (e.g., executor memory) (Zahariaet al., 2016), which guide In order to answer the three research questions (Section 1), we used
how a framework should process data. For example, executor memory four security datasets: KDD (KDD, 1999), DARPA (MIT, 1998), CIDDS
specifies how much memory should be allocated to an executor process. (Ring et al., 2017), and CICIDS2017 (Sharafaldin et al., 2018). These
The importance of parameter configuration for big data processing datasets are briefly described in the following with their details pre
frameworks has been highlighted by several studies (e.g. (Lee and Lee, sented in Table 1. We selected these four datasets as they vary from each
2013; Davidson and Or, 2013; Gounaris et al., 2017),). However, none of other in terms of attack types, the number of training and testing in
the previous studies have investigated their impact on the scalability of a stances, dataset size, publication dates, redundancy, and the number of
big data system. Therefore, this paper aims to “investigate the impact of features (e.g., source IP, source port, and payload). These characteristics
Spark configuration parameters on the scalability of a BDCA system of the selected datasets are expected to provide rigour and
2
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
RQ1: How does a BDCA system scales with default Spark configuration settings?
RQ2: What is the impact of tuning Spark configuration parameters on the scalability of a BDCA system?
RQ3: How to improve the scalability of a BDCA system?
Table 1 are relatively old datasets. The CIDDS dataset consists of four-week
Number of training and testing instances in each dataset. NetFlow data directed towards two servers, i.e., OpenStack and
Dataset Number of No. of Instances in No. of Instances in External Server. The training dataset contains 5,634,347 records and the
Features Training Dataset Testing Dataset testing dataset contains 2,788,463 records. Each record represents a
KDD 41 494,022 292,300
network connection – consisting of nine features. The dataset contains
DARPA 6 2,723,496 1,522,310 four types of attacks: pingScan, portScan, bruteForce, and DoS. More de
CIDDS 9 5,634,347 2,788,463 tails on the dataset are available in (Ring et al., 2017).
CICIDS2017 77 1,311,822 445,061 CICIDS2017: This is also a recently developed dataset, which con
tains a variety of state-of-the-art attacks. The dataset consists of five days
of network traffic directed towards a network consisting of three servers,
generalization to our findings. It is important to note that we used the
a firewall, a switch, and 10 PCs. The training dataset consists of
whole of these datasets, instead of using a small sample of each dataset,
1,311,822 records and the testing dataset consists of 445,061 records.
in our experiments.
Each record consists of 77 features. This dataset contains six types of
KDD: The KDD dataset contains 494,022 records as training data and
attacks: bruteForce, heartBleed, botNet, DoS, Distributed DoS, webAttack,
292,300 records as testing data. Each record represents a network
and infiltration attack (Sharafaldin et al., 2018).
connection – consisting of 41 features. Each record is labelled as
belonging to either the normal class or one of the four attack classes, i.e.,
Denial of Service, Probing, Remote to Local, and User to Root. The 2.2. Our BDCA system
testing data includes attack types that are not present in the training
data, which makes the evaluation more realistic. More details on the The overview of our BDCA system is depicted in Fig. 1. This system
dataset are available in (KDD, 1999). consists of three layers –Security Analytics Layer, Big Data Support Layer,
DARPA: Similar to KDD, the records in this dataset are divided into and Adaptation Layer. In the following, we describe Security Analytics
training and testing subsets. The training data consists of 2,723,496 Layer and Big Data Support Layer, while the details of the Adaptation
records, while the testing data consists of 1,522,310 records. Each re Layer are presented in Section 3.
cord represents a network connection – consisting of six features. Each
record is labelled as 0 or 1, where 0 specifies a normal connection and 1 2.2.1. Security Analytics Layer
specifies an attack. The attack types present in DARPA are the same as This layer processes the security event data for detecting cyber-
KDD. More details on the DARPA dataset are available in (MIT, 1998). attacks. The layer consists of three phases (i.e., data engineering,
CIDDS: This dataset has been recently developed as KDD and DARPA feature engineering, and data processing), which are described below.
Data Engineering: This phase pre-processes the data to handle
3
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
missing values and remove incorrect values and outliers (Ullah and created as the bridge between the external network (floating IPs) and
Babar, 2019a). A negative value indicates that the number of features for subnets (internal IPs). We used Scala programming language for various
the instances is incomplete, hence, the instance is removed. Incorrect implementations on Spark.
values (e.g., standard deviation = − 1) specify data points in the dataset
that are unacceptable for the Machine Learning (ML) model employed in 2.4. Evaluation metrics
a system for the classification of security event data into normal and
attack classes. Therefore, we use the filter method of DataFrame avail In this study, we assess two qualities of our BDCA system – accuracy
able in the Spark package, i.e., org.apache.spark.sql (Spark, 2014a) to and scalability. Accuracy measures how accurately a BDCA system
remove the incorrect values. The existence of the outliers in the training classifies the instances in the datasets into normal and attack categories.
dataset affects the accuracy of the machine learning model (Batista Scalability measures to what extent our system takes advantage of the
et al., 2004). We, therefore, removed the values that were larger than additional hardware resource added to a system in the form of
Double.MaxValue. CICIDS2017 has missing values, therefore, we computing nodes.
removed the instances with the missing values by simply investigating
whether the value of the last feature is negative. 2.4.1. Measuring accuracy
Feature Engineering: This phase generates new features and/or For assessing accuracy, we used five evaluation metrics that are
transforms the values of features into a new range (Ullah and Babar, commonly used in the BDCA domain (Ullah and Babar, 2019a). These
2019a). For all four datasets, we assembled the features to transform metrics include False Positive Rate, F-Score, Recall, Accuracy, and Pre
multiple columns of features into one column of feature vector for fitting cision. Table 2 provides the definition and brief description of each of
the ML model. We used VectorAssembler method in org.apache.spark.ml. the metrics.
feature for the implementation of assembling the features. Since some
algorithms (e.g., Naïve Bayes) in SparkML library cannot handle 2.4.2. Measuring scalability
non-numeric features, we used StringIndexer (from org.apache.spark.ml. Several studies (e.g. (Grama et al., 1993; Sun et al., 2005; Jogalekar
feature) to transform the label features (i.e., normal and attack) in the and Woodside, 2000),) have proposed metrics for measuring the scal
KDD dataset from string to indices. Given the relatively smaller number ability of a system. However, the previous metrics are not suitable for
of features in the DARPA dataset, we expanded the features to a poly the scalability analysis in our study for two reasons: 1) these metrics do
nomial space. We used PolynomialExpansion method in org.apache.spark. not quantify scalability with respect to ideal scalability, which is
ml.feature for feature expansion. required for evaluating the effectiveness of our adaptation approach
Data Processing: This phase leverages ML/DL algorithm to classify presented in Section 3; 2) these metrics are primarily suitable for
the instances in security data as either normal or attack. In our system, measuring scalability in cases where a system is partly executed in
we separately used four ML/DL algorithms - Naïve Bayes (NB), Random parallel mode and partly in sequential mode whereas our system
Forest (RF), and Support Vector Machine (SVM), for classifying the in (implemented using Apache Spark (Zahariaet al., 2016; Spark, 2011)) is
stances. These four algorithms have been selected based on (i) their executed fully in parallel mode. For this study, we used Eq. (1) to
widespread use in the domain of BDCA (Ullah and Babar, 2019a) (ii) measure the scalability of a BDCA system. In Eq. (1), S(c) denotes the
popularity in Kaggle competitions and (iii) availability in Spark ML li scalability score for curve ‘c’. Gap denotes the quantified gap value be
brary and DeepLearning4j (Johnsirani Venkatesan et al., 2019). We used tween the achieved and ideal response time (i.e., training time or testing
Spark package org.apache.spark.ml.classification for implementing the time), which is calculated using Eq. (2). In Eq. (2), ωn represents the
ML algorithms. For cross-validation of ML models, we used Cross user-defined weight that specifies the importance of the gap between
Validator method available in org.apache.spark.ml.tuning. achieved and ideal response time at ‘n’ worker nodes. For example, ω2 is
the weight for specifying the importance of gap at two worker nodes and
2.2.2. Big Data Support Layer ω4 is the weight for specifying the importance of the gap at four worker
This layer manages the distributed storage and processing of data on nodes. In Eq. (1), ωn+1 denotes the weight for specifying the importance
multiple computing nodes. The layer consists of big data processing of gap at one size larger than the existing cluster size, e.g., to specify the
framework (i.e., Spark) and big data storage (i.e., HDFS). Apache Spark importance of gap beyond eight nodes if n (cluster size) equals to eight.
is an open-source big data processing framework that uses in-memory The sum of all weights is equal to 1 (as presented in Eq. (5)). In Eq. (2),
primitives to process a large amount of data. Spark is quite suitable Gn defines the ratio of unaccomplished response time improvement to
for ML tasks, which requires iterative processing that best suits Spark the response time improvement in the ideal case with ‘n’ worker nodes.
architecture (Spark, 2014a). Moreover, Spark is not only much faster Gn is calculated using Eq. (3), where ATn denotes achieved response time
than Hadoop, but is also compatible with multiple file systems such as with ‘n’ worker nodes and ITn denotes the ideal response time with ‘n’
HDFS, MongoDB, and Cassandra. Hadoop Distributed File System worker nodes. In Eq. (1), Trend, which is calculated using Eq. (4), de
(HDFS) is a data storage system that enables the distributed storage of a notes how response time decreases between the last two cluster setups
massive amount of data (Shvachko et al., 2010). By default, HDFS rep such as from six to eight worker nodes in with cluster of size 8, the higher
licates each block of data on three nodes, which makes it quite
fault-tolerant. Table 2
Evaluation metrics for assessing accuracy and their descriptions. TP – Ture
2.3. Instrumentation setup Positive, FP – False Positive, TN – True Negative, and FN – False Negative.
Metric Definition Description
We configured Spark and Hadoop (for HDFS) on an OpenStack
Precision TP Proportion of instances correctly
cluster consisting of 10 computing nodes. Each node is installed with P=
TP + FP classified as attack instances
Ubuntu 16.04 Xenial Xerus operating system. Each node runs Spark Recall TP Proportion of attack instances that are
R=
2.4.0, Hadoop 2.9.2, and JDK 1.8. The 10 computing nodes are divided TP + FN correctly classified
into master and slave nodes. There is one master with m1.large flavour F-score
F=2×
P × R Harmonic mean of precision and
P + R recall
(8 GB RAM, 80 GB Hard disk, and 8 virtual CPUs) and nine worker nodes
Accuracy A= Proportion of correctly classified
with m1.small flavour (2 GB RAM, 10 GB Hard disk, and one virtual TP + TN instances
CPU). Each node in the cluster has a floating IP for communicating with TP + TN + FP + FN
the external world and an internal IP for communicating with other False Positive FP Proportion of normal instances
FPR =
Rate FP + TN classified as attack instances
nodes in the cluster. To associate floating IP with internal IP, a router is
4
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
of which indicates the probability that the response time tends to which is high – indicating a positive trend of scalability after eight
decrease with more than eight nodes. worker nodes. The values of Gn is calculated using Eq. (3) are G2 = 1.2,
G4 = 0.46, G3 = 0.28, and G8 = 0.2, which gives Gap = 0.42. Hence, the
S(c) = 1 − Gap − ωn+1 × (1 − Trend) (1)
scalability score for Scenario-1 is 0.85, which indicates poor scalability
∑
n as compared to ideal scalability. This is also observable from the com
Gap = ω2i G2i (2) parison of the two curves, i.e., Ideal Scenario and Scenario-1 as depicted
i=1 in Fig. 2. As compared to Scenario-1, there is a higher reduction trend in
response time with the increase in the number of nodes in the ideal case.
ATn − ITn
Gn = (3) Thus, the scalability score of Scenario-1 is smaller as compared to the
IT1 − ITn
ideal scenario.
ATn− 1 − ATn The scalability score for Scenario-2 is 0.83, which is slightly lower
Trend = (4) than the scalability score for Scenario-1. The slight difference is mainly
ITn− 1 − ITn
due to the difference in Trend for the two scenarios, i.e., a reduction
∑
n from 4 to 3 in Scenario-1 and a reduction from 3.75 to 3.2 in Scenario-2.
ω2i = 1 (5) The scalability score for Scenario-4 is 0.31, which is quite lower as
compared to Scenario-1 and Scenario-2. If we observe the scalability
i=1
Example Scenario: We illustrate the use of the scalability metric with curve for Scenario-4 in Fig. 2, the response time reduces quite signifi
an example, which includes eight hypothetical scalability scenarios for a cantly as we increase the number of worker nodes from 1 to 4 nodes.
software system. Table 3 presents the hypothetical response times for However, there is almost no reduction as the number of worker nodes
eight different scenarios with respect to five different cluster configu are increased from 4 to 8, which is why the scalability score is much
rations, i.e., 1 worker, 2 workers, 4 workers, 6 workers, and 8 workers. lower as compared to more smoother curves such as the curves for
Fig. 2 shows the eight scalability curves drawn using the response times Scenario-1 and Scenario-2. In Scenario-5, the sudden upward jump in the
reported in Table 3. The eight scenarios (i.e., ideal scenario and Scenario curve from 2 nodes to 4 nodes impacts the scalability score of the whole
1 – Scenario 7) presented in Table 3 and Fig. 2 differ from each other curve. Therefore, the scalability is quite low, i.e., 0.16. In Scenario-6, the
with respect to two parameters – the number of worker nodes and response time increases (unlike as expected) at two transitions, i.e., from
response time. The number of worker nodes is the independent param 2 nodes to 4 nodes and from 6 nodes to 8 nodes. The spike in response
eter that we change to observe the impact on the dependent parameter i. time from 6 to 8 nodes is quite high. Therefore, the negative impact on
e., response time. As shown in Fig. 2, the impact of change in the number the response time at two transitions significantly impacts the scalability
of worker nodes is not consistent across scenarios. This could possibly be score and Eq. (1) generates a much lower scalability score (i.e., -0.56)
due to multiple reasons in a real-world scenario. For example, in Ideal for Scenario-6. The response time in Scenario-7 does not change with the
Scenario, the system utilizes the underlying resources such as CPU and change in the number of nodes, therefore, the scalability score for Sce
RAM more efficiently as compared to Scenario-1. Therefore, the response nario-7 is 0.00.
time in the Ideal Scenario reduces more significantly with the increase in
the number of worker nodes as compared to the reduction in response 3. Our adaptation approach
time in Scenario-2.
Ideal Scenario underlines the case where each time the number of To optimize the scalability of a BDCA system, we present SCALER -
nodes is doubled, the response time is reduced to half. For calculating an adaptation approach that automatically triggers the tuning process
the scalability score, we use a value of 0.2 for all weights (e.g., ω2 , ω4 , and tune Spark configuration parameters. By tuning, we mean to select a
ω6 , ω8 , ω10 ). For calculating Trend in this scenario, AT6 = 1.33, AT8 = combination of parameters, which generates a scalability score that is
1, IT6 = 1.33, and IT8 = 1 as shown in Table 3, hence, Trend = 1 using above the predefined threshold (Section 3.3).
Eq. (4). Since there is no gap between achieved and ideal response time, Spark parameters control most of the application settings and
the value of all gaps is equal to zero (i.e., G2 = 0, G4 = 0, G6 = 0, and G8 directly impact the way an application runs (Spark, 2014b). All Spark
= 0). Thus, the overall gap is zero (i.e., Gap = 0) calculated using Eq. (3). parameters have a default configuration; however, the default configu
Feeding these values into Eq. (1) gives us S(ideal) = 1.00. For Scenario-1, ration is not suitable for each application (Spark, 2014b). Therefore, the
AT6 = 4, AT8 = 3, IT6 = 1.66, and IT8 = 1.25, hence, Trend = 2.44, parameters need to be configured separately for each application. The
spark parameters investigated in this study for their impact on the
scalability of a system are presented in Table 4. We selected 11 pa
Table 3 rameters based on the following criteria – (i) the parameters have
Response time (in seconds) and scalability score for the eight hypothetical proven impact on different aspects of Spark such as scheduling,
scalability scenarios. S-1, S-2 and so on denote Scenario-1, Scenario-2 and so on. compression, and serialization (ii) the parameters contribute to Spark
Number of Response Time (sec) running time as highlighted in (Gounaris and Torres, 2018) and (Nguyen
Worker
Ideal S-1 S-2 S-3 S-4 S-5 Scenario- S-7
et al., 2018) and (iii) the parameters impact multiple levels (e.g., ma
Nodes chine level and cluster level) of a BDCA system as reported through
Scenario 6
industry practices (Spark, 2014b, 2016). Although SCALER considers 11
1 8.00 10.00 10.00 8.00 8.00 9.50 8.00 8.00
2 4.00 11.00 6.87 5.50 7.00 5.00 7.00 8.00 Spark parameters, it is worth noting that SCALER can be easily extended
4 2.00 6.00 5.00 4.00 6.00 8.00 7.20 8.00 to incorporate more parameters if needed. In the following, we describe
6 1.33 4.00 3.75 3.00 5.80 5.50 6.00 8.00 our adaptation approach that automatically tunes Spark configuration
8 1.00 3.00 3.20 2.90 5.70 6.00 7.20 8.00
parameters for improving scalability. We present our adaptation
Scalability 1.00 0.85 0.83 0.61 0.31 0.16 − 0.56 0.00
Score
approach as per the guidelines for adaptation approaches presented by
Villegas et al. (2011).
5
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Fig. 2. Hypothetical scalability scenarios (drawn based on Table 3) to illustrate the use of the scalability metric.
6
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Algorithm. 1
Algorithm for adapting configuration setting of Spark-based BDCA system
7
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Table 6
Combination of Parameter Values (CPV) executed at runtime for identifying CPV with scalability score above the threshold. ‘A’ and ‘B’ specify the default and modified
value respectively.
CPV ID Combination of Parameter Values (CPV) P1 P2 P3 P4 P5 P6 P7 P8 P9
1 {A, A, A, A, A, A, A, A, A} A A A A A A A A A
2 {B, A, A, A, A, A, A, A, A} B A A A A A A A A
3 {A, B, A, A, A, A, A, A, A} A B A A A A A A A
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
512 {B, B, B, B, B, B, B, B, B} B B B B B B B B B
TRUE and FALSE). The values options for numerical parameters pre 4 to 6 nodes and from 6 to 8 nodes in Eq. (4). We then apply the two
sented in Table 5 are chosen based on academic and industrial recom sample cases to the hypothetical scenarios presented in Fig. 2. On
mendations (Spark, 2014b, 2016). For example, the execution memory average, the time required to calculate Trend in case (a) and case (b) is
can be set as 1024 m (default value) or 1250 m (modified value). In 9.24 seconds and 14.9 seconds respectively. Hence, the time required to
Table 5, ‘A’ represents the default value and ‘B’ represents the modified calculate the trend in case (a) is 38.12% less than the time required to
value for a parameter. For instance, P1-A and P1–B are default and calculate the trend in case (b). This difference in time required to
modified values for parameter P1 (Spark.executor.memory). Some sample calculate trend increases with an increase in including more transitions
Combinations of Parameter Values (CPV) are shown in Table 6. (e.g., from 1 to 2 and 2 to 4 nodes) in Eq. (4). Furthermore, the results
Since our approach considers a total of nine parameters each with presented in Section 4.2 show that the region from 6 to 8 nodes is the
two possible values, there are a total of 512 (29) CPVs. The execution of most accurate region to determine if the scalability curve is showing any
these many CPVs to find the CPV with a scalability score that is above unexpected variation (see Section 4.2 for details). Consequently, other
the threshold is a computationally expensive task. The time required to transitions (e.g., from 4 to 6 nodes) in Eq. (4) will have minimal impact
execute and search through the large search space of 512 potential CPVs on accuracy but a far significant impact on the time required to calculate
will outweigh the gain expected through adaptation. It is also worth Trend. Hence, Eq. (4) only considers the transition from 6 to 8 nodes for
noting that the state-of-the-art tuning approaches (e.g. (Zhuet al., 2017; calculating Trend. Keeping Change of Parameter Value with a Positive
Herodotouet al., 2011),) do follow the strategy of searching through the Impact: If changing value of a parameter improves the scalability score,
entire search space. However, it is computationally feasible for these the changed value is kept the same for the next CPV. For example, CPV 2
approaches, which aim to tune for optimizing response time. Calculating achieves a better scalability score than CPV 1 by changing the value of
response time requires a system to be executed only once but for spark.executor.memory from 1024 MB to 1250 MB. However, since the
calculating scalability score as required for our approach, a system needs scalability score of CPV 2 is not above the threshold value, our algorithm
to be executed at least five times with a different number of computing will not select CPV 2 rather it will execute CPV 3, but with the spark.
nodes. We, therefore, employed the following optimization techniques executor.memory value of 1250 MB as it has already shown a better
to reduce the computational time with minimal impact on the accuracy scalability score.
of our approach. Our adaptation algorithm is presented as Algorithm 1. If the scal
Eliminating CPVs with negative Trend: Before calculating the scalabil ability score of the BDCA system with default parameter settings is
ity score of a CPV for an entire curve obtained through executing the below the threshold (line 5), an adaptation process is triggered. The first
system with 1, 2, 4, 6, and 8 worker nodes, we calculate the scalability parameter in the default CPV is changed from its default value and then
trend (Eq. (4)) from six to eight nodes. If the trend is negative, then the the Trend is calculated using Eq. (4) to investigate the trend of scalability
response time increases as the cluster size changes from six to eight from six nodes to eight nodes. If the Trend is negative (i.e. response time
nodes. This implies that the CPV is not a candidate for the potential CPV increases as the number of nodes increases from six to eight), the
with a scalability score above the threshold. The reason we include parameter value is changed to its default value (lines 17–20). On the
merely the transition from 6 to 8 nodes in Eq. (4) is the time overhead. other hand, if the Trend is positive (response time decreases as the
To illustrate the impact of including more transitions in Eq. (4) on the number of nodes increases from six to eight), a BDCA system is executed
time to calculate Trend, we take two sample cases - (a) using the only with two and four worker nodes to get the entire scalability curve (lines
transition from 6 to 8 nodes and (b) including two transitions, i.e., from 21–24). After getting the scalability curve, the scalability score is
Table 7
Mean accuracy achieved by our BDCA system for the four datasets and four ML/DL algorithms.
ML Algorithm Dataset Precision Recall F-Measure False Positive Rate Accuracy
8
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Table 8 reported in this paper, i.e., to confirm whether or not a BDCA system
Accuracy achieved by our BDCA system for the four datasets and four ML/DL scales ideally with the default configuration. Ideal scalability implies
algorithms in 2, 4, 6, and 8 node cluster. that a BDCA system makes full use of the additional resource provided
Number of Worker Nodes in the Cluster by scaling (Williams and Smith, 2004). For instance, when the number
ML Algorithm Dataset 2 4 6 8
of worker nodes is doubled, the response time of a system should reduce
to half. If a BDCA system scales ideally, there would be no value added
Naïve Bayes KDD 89.6% 81.7% 82.5% 80.4%
by our work.
DARPA 78.9% 76.6% 75.4% 73.5%
CIDDS 94.8% 98.4% 90.7% 89.7% Classification accuracy: Before presenting scalability findings, we
CICIDS2017 88.4% 84.7% 83.5% 76.8% first present the accuracy of our BDCA system in Table 7 for the four
Random Forest KDD 96.8% 96.4% 85.9% 84.7% datasets and four ML algorithms. This is because accuracy is one of the
DARPA 88.6% 89.4% 88.4% 81.7% main quality measures for a BDCA system and needs to be considered
CIDDS 99.9% 98.8% 98.9% 99.1%
CICIDS2017 99.9% 99.7% 99.4% 99.2%
before scalability (Ullah and Babar, 2019a). According to the accuracy
Support Vector Machine KDD 88.7% 87.6% 89.4% 87.4% presented in Table 7, our system achieves a mean accuracy of 92.7% for
DARPA 77.6% 68.4% 49.7% 48.8% KDD, 73.6% for DARPA, 98.2% for CIDDS, and 92.3% for CIDIDS2017.
CIDDS 98.1% 97.5% 96.2% 94.5% With respect to the algorithms, our system achieves a mean accuracy of
CICIDS2017 91.4% 90.8% 90.4% 85.4%
84.75% for Naïve Bayes, 95.74% for Random Forest, 83.71% for Support
Multilayer Perceptron KDD 97.4% 96.5% 97.4% 94.3%
DARPA 86.4% 77.8% 79.8% 77.9% Vector Machine, and 92.8% for Multilayer Perceptron (MLP). The mean
CIDDS 99.9% 99.8% 99.9% 96.8% accuracy of our system for the four datasets and four algorithms is
CICIDS2017 98.6% 97.8% 98.7% 93.7% 89.2%, which is a decent level of accuracy as compared to the accuracy
of the state-of-the-art BDCA systems (Gupta and Kulariya, 2016; Kumari
et al., 2016; Marchal et al., 2014; Las-Casas et al., 2016; Zhang et al.,
calculated for the CPV. If the scalability score is higher than the previous
2016; Böse et al., 2017). Table 8 shows how the accuracy of the ML/DL
best scalability score, the optimal CPV is updated. Finally, the scalability
models varies with respect to the number of nodes in the cluster. Whilst
score of the CPV is compared with the threshold scalability score (line
there is no significant change in the accuracy for most of the cases, the
29). If the scalability score is above the threshold, the CPV is selected for
general trend shows that the accuracy slightly decreases as the number
the future operations of a system. The variable Rt in Algorithm 1 spec
of nodes in the cluster increases. This could be attributed to the way data
ifies the number of times each execution is repeated. Such a repetition of
is distributed among the nodes during the training and testing process. A
execution is required to remove (any) experimental fluctuations. We set
comparatively larger number of nodes in the cluster requires the gen
Rt equal to three – indicating to repeat each execution three times. We
eration of larger data blocks and vice versa. Such data partitioning and
then take the mean of the response time determined in the three exe
distribution strategy slightly impact the accuracy as presented in
cutions for subsequent calculation of scalability score. It is important to
Table 8.
note that Algorithm 1 only restricts adaptation trigger based on the
Following the approach reported in (Qiu et al., 2016), we trained and
predefined threshold, i.e., adaptation is triggered only if the scalability
evaluated the ML/DL algorithms in a distributed manner. In other
score is less than the predefined threshold. Algorithm 1 ensures that it
words, the cluster consists of a total of 10 nodes in our case. Among these
will return a CPV with scalability score either equal or better than the
nodes, one acts as a master and nine act as workers. The master node
previously running CPV. Algorithm 1 does not guarantee that it will
distributes the process of training and testing among the nine workers,
always return a CPV with scalability score above the predefined
which perform the training and testing in a distributed and parallel
threshold. However, we did not observe any such case based on the
manner. On the contrary, the same job can be performed in a centralized
results presented in Section 4.3.
manner – termed centralized learning, in which the training and testing
of algorithms is performed centrally on a single node instead of a cluster
4. Results of nodes. In addition to distributed and centralized learning of ML al
gorithms, deep learning approaches have gained tremendous attention
In this section, we present the results from our study aimed at in recent times (Pouyanfaret al., 2018). Therefore, we also incorporate a
answering the three research questions. Deep Learning (DL) algorithm, Multilayer Perceptron (MLP), to assess
how it performs as compared to the traditional ML algorithms. We
4.1. RQ1: How does a BDCA system scale with default spark selected MLP based on its widespread usage in the cyber security
configuration settings? domain. The accuracy and training time of the ML and DL algorithms
trained and tested in centralized and distributed manners are presented
This research question investigates the very premise of the work in Table 9. The mean accuracy for centralized learning, distributed
Table 9
Accuracy achieved by our BDCA system with centralized learning, distributed learning, and deep learning. SVM denotes Support Vector Machine and MLP denotes
MultiLayer Perceptron.
Learning Type Dataset Dataset
Accuracy Training Time Accuracy Training Time Accuracy Training Time Accuracy Training Time
(%) (sec) (%) (sec) (%) (sec) (%) (sec)
Centralized Naïve Bayes 90.6 395 76.4 2855 95.7 2141 88.4 3377
Learning Random 90.7 355 88.1 2048 99.6 2122 99.9 3741
Forest
SVM 88.1 230 78.9 104 99.2 306 99.0 314
Distributed Naïve Bayes 80.4 331 73.5 1853 89.7 968 76.8 1521
Learning Random 84.7 245 81.7 228 99.1 265 99.2 1243
Forest
SVM 87.4 184 48.8 58 94.5 120 85.4 44
Deep Learning MLP 96.3 312 78.1 2978 99.6 2749 97.3 3104
9
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Fig. 3. Ideal and achieved scalability with default Spark settings for the four datasets – KDD, DARPA, CIDDS, CICIDS2017 with (A) Naïve Bayes – training phase (B)
Naïve Bayes – testing phase (C) Random Forest – training phase (D) Random Forest – testing phase (E) Support Vector Machine – training phase and (F) Support
Vector Machine – testing phase (G) MultiLayer Perceptron - training phase (H) MultiLayer Perceptron - testing phase. The number in the legend specifies the
scalability score.
The summary answer to RQ1: A BDCA system with default Spark configuration settings does not scale ideally. The deviation from ideal
scalability is around 59.5%.
10
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
CICIDS2017
The difference in accuracy is due to the way algorithms are trained based
on the partitioning of the data during the training process. On the other
0.16
0.04
0.03
0.15
0.13
hand, the mean training time for centralized, distributed, and deep
0.18
0.19
0.27
0.47
0.37
0.28
0.42
learning is 1499 seconds, 588 seconds, and 2285 seconds. The difference
CIDDS
in training time is due to resource allocation. For example, centralized
0.09
0.13
0.06
0.01
0.06
0.06
0.07
0.02
0.08
0.04
0.57
0.88
learning takes more time as compared to distributed learning due to the
Multilayer Perceptron
0.15
0.26
0.21
0.01
0.05
0.31
0.84
0.51
0.53
0.33
settings for training time and testing time of the four datasets and four
0.1
0.4
algorithms. The dotted line in Fig. 3 denotes ideal scalability and the
0.06 solid line denotes achieved scalability. In the training phase with Naïve
0.12
0.09
KDD
0.24
0.31
0.25
0.55
0.42
0.62
0.31
0.46
0.49
Bayes, Random Forest, and MLP, the system scales almost ideally as the
number of worker nodes increases from one to two. The impact of
CICIDS2017
adding further nodes until six nodes is negligible. After six nodes, the
addition of nodes has a negative impact on scalability, i.e., the training
0.39
0.54
0.37
0.49
0.52
0.48
0.44
0.36
0.64
0.69
0.65 time slightly increases. With Support Vector Machine, the trend is a bit
0.6
pected spikes can be observed at three nodes for KDD and five nodes for
0.45
0.79
0.58
0.62
0.63
0.53
0.49
0.83
0.76
0.55
0.49
Support Vector Machine
0.3
CICIDS2017. A potential reason for such spikes with Support Vector
Machine is the short training time as compared to the training time for
DARPA
the other three algorithms. We used Eq. (1) to quantify the ideal and
0.37
0.54
0.18
0.46
0.59
0.63
0.66
0.65
0.73
0.83
0.7
0.8
0.09
0.25
0.24
0.41
0.14
0.68
0.17
0.49
0.66
0.31
in Section 2.4.2, for achieved scalabilities are shown in the legend for
0.3
training in Fig. 3.
The mean scalability score with default Spark setting for the datasets
CICIDS2017
0.58
0.15
0.68
0.58
0.47
0.71
0.72
0.27
0.72
0.7
instance, CIDDS having the largest number of instances achieves the best
scalability and KDD with the smallest number of instances achieves the
CIDDS
-0.07
lowest scalability. The deviation from ideal scalability for each dataset is
0.48
0.52
0.45
0.57
0.42
0.59
0.63
0.67
0.61
0.7
0.6
Random Forest – 0.70, Support Vector Machine – 0.36, and MLP – 0.32.
0.28
0.32
0.06
0.52
0.57
0.59
0.75
0.62
0.57
0.78
0.66
0.7
The deviation from ideal scalability for each dataset is found to be; KDD
– 69%, DARPA – 61%, CIDDS – 47%, and CIDDS2017 – 48%. The de
0.51
0.41
0.44
0.29
0.18
0.48
KDD
0.53
0.55
0.58
0.59
0.54
0.6
viation from ideal scalability for each algorithm is found to be; Naïve
Bayes – 76%, Random Forest – 30%, Support Vector Machine – 64%, and
CICIDS2017
0.46
0.59
0.61
0.61
0.61
0.67
0.64
0.62
to testing time is abrupt. This is because of the very quick response of the
0.6
0.7
0.51
0.52
0.59
0.44
0.54
0.53
0.68
0.72
0.68
0.82
0.68
0.69
0.67
0.38
0.44
0.43
0.38
0.38
0.53
0.62
0.81
0.64
0.68
0.54
training time.
Naïve Bayes
-0.13
KDD
0.31
0.01
0.36
0.15
0.68
0.01
0.71
0.92
0.5
FALSE
1250
1600
96m
400
64k
0.7
0.8
0.4
Section 3) and the scalability scores achieved with the default and
-1
modified settings. Figs. 4–7 show the scalability graphs with the default
and modified values for each of the 11 parameters for the 16 use cases, i.
Spark.executor.memory
Spark.shuffle.file.buffer
Spark.memory.fraction
Spark.driver.memory
Spark.rdd.compress
memoryFraction
maxSizeInFlight
Spark.serializer.
changes the scalability score of the Naïve Bayes based BDCA system
storageFraction
Spark.memory.
Spark.reducer.
default setting
Spark.shuffle.
from − 0.14 to 0.92 for KDD, 0.53 to 0.38 for DARPA, 0.68 to 0.69 for
CIDDS, and 0.61 to 0.64 for CICIDS2017. The same trend continues for
the first nine parameters shown in Table 10, where modifying the value
of the parameters leads to a significant change in the scalability score.
Table 10
1024).
The last two parameters (i.e., P10 - spark.driver.memory and P11 - spark.
P10
P11
P1
P2
P3
P4
P5
P6
P7
P8
P9
ID
11
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Fig. 4. Impact of modifying the value of parameters on the scalability score of Naïve Bayes based BDCA system for the four datasets - (A) KDD (B) DARPA (C) CIDDS
and (D) CICIDS2017. The number in the legend specifies scalability score.
Fig. 5. Impact of modifying the value of parameters on the scalability of Random Forest based BDCA system for the four datasets - (A) KDD, (B) DARPA, (C) CIDDS,
and (D) CICIDS2017. The number in the legend specifies scalability score. 12
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Fig. 6. Impact of modifying the value of parameters on the scalability of Support Vector Machine based BDCA system for the four datasets - (A) KDD, (B) DARPA, (C)
CIDDS, and (D) CICIDS2017. The number in legend specifies scalability score.
Fig. 7. Impact of modifying the value of parameters on the scalability of Multilayer Perceptron based BDCA system for the four datasets - (A) KDD, (B) DARPA, (C)
CIDDS, and (D) CICIDS2017. The number in legend specifies scalability.
13
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
example, as presented in Table 10, changing the value of spark.shuffle. can be asserted that the Spark configuration parameters need to be
memoryFraction from ‘/’ (default) to 0.4 for Naïve Bayes based BDCA configured as per the type of the dataset and algorithm. In other words,
system brings an insignificant change in scalability score for KDD (− 0.14 this finding invalidates the reuse of a single Spark configuration setting
to − 0.13), DARPA (0.53–0.54), CIDDS (0.68–0.54), and CICIDS2017 across multiple datasets and algorithms.
(0.61–0.6). Spark parameter ranking: Table 12 presents the ranking of the
Unexpected variations: We also assess the regions for each of the 16 studied Spark parameters, based on their impact on scalability, with
use cases (4 datasets × 4 algorithms) where unexpected variations respect to the four datasets and four algorithms. The impact is calculated
happen. By unexpected variation, we mean a variation (e.g., from 2 to 4 as the difference between the scalability score with the default Spark
nodes) where the training time increases instead of decreasing. The parameter setting and the modified one. Such a ranking is useful in
number of unexpected variations in each of the four regions for the 16 prioritizing the tuning of particular parameters, i.e., parameters with a
use cases are presented in Table 11. Among the 192 scalability curves significant impact. Overall, our findings show that with respect to
(12 parameter settings × 4 datasets × 4 algorithms), 8/192 shows un scalability, Spark.reducer.maxSizeInFlight is the most impactful and
expected variation in the region from 1 to 2 nodes, 23/192 shows un Spark.shuffle.memoryFraction is the least impactful Spark parameter. It is
expected variation in the region from 2 to 4 nodes, 37/192 show worth noting that Spark.shuffle.compress significantly impacts the
unexpected variation in the region from 4 to 6 nodes, and 41/192 show training time as can be observed from Figs. 4–7. However, a significant
unexpected variation in the region from 6 to 8 nodes. This trend exhibits impact on training time does not necessarily mean a significant impact
that the rate of unexpected variation increases with the increase in the on scalability (Section 4.1). That is why it is not ranked as the most
number of nodes. impactful with respect to scalability. Table 12 also depicts that the
Positive/negative impact on scalability: We assess whether ranking of parameters varies with respect to datasets and algorithms.
modifying a parameter value has a positive or negative impact on For example, Spark.shuffle.sort.bypassMergeTreshold is ranked as 1st and
scalability. Table 10 shows that the positive or negative impact of 3rd for DARPA and CICIDS2017 datasets respectively; however, this
modifying a parameter value varies from one dataset to another as well parameter is respectively ranked as 9th and 10th for KDD and CIDDS
as from one algorithm to another. The bold values in Table 10 indicate datasets. As stated earlier and illustrated by the ranking, the two pa
the negative impact of changing the default value for the parameter, i.e., rameters (i.e., spark.driver.memory and spark.shuffle.memoryFraction) are
scalability score decreases in comparison to scalability score with ranked at the bottom due to their insignificant or minor impact on the
default settings. For example, changing the value of P3 - spark.shuffle. scalability score.
compress for Naïve Bayes based BDCA system from TRUE to FALSE has a
positive impact on scalability with KDD, DARPA, and CIDDS but a 4.3. RQ3: How to improve the scalability of a BDCA system?
negative impact on scalability with CICIDS2017. Similarly, with
Random Forest based BDCA system, the default value (100) of P8 - spark. We have proposed a parameter-driven adaptation approach,
serializer.objectStreamReset achieves better scalability for CIDDS and SCALER, for improving the scalability of a BDCA system. The adaptation
approach has already been described in Section 4. Here, we evaluate the
effectiveness of our approach with respect to the following research
Table 11
Number of unexpected variations (i.e., where training time increases unlike
questions.
expected decrease) in each of the four transitions – 1 to 2 nodes, 2 to 4 nodes, 4
to 6 nodes, and 6 to 8 nodes. The value in brackets specifies the percentage of 4.3.1. RQ3.1: How much scalability of a BDCA system is improved using
unexpected variations calculated as the number of unexpected variations SCALER (scalability improvement)?
divided by the number of total variations. SVM and MLP stands for Support Adaptation scenarios: We assess the scalability improvement by
Vector Machine and MultiLayer Perceptron, respectively. comparing the scalability score achieved by our system exactly before
ML Dataset 1 to 2 2 to 4 4 to 6 6 to 8 and after adaptation. In order to realize adaptation, we experimented
Algorithm nodes nodes nodes nodes with two scenarios, i.e., baseline and change in input data. In the baseline
Naïve Bayes KDD 0 (0%) 9 (75.0%) 3 (25.0%) 5 (41.6%) scenario, a BDCA system is processing a particular dataset such as KDD
DARPA 0 (0%) 5 (41.6%) 6 (50.0%) 4 (33.3%) with the optimal CPV determined for KDD based on Algorithm 1. In the
CIDDS 0 (0%) 0 (0%) 7 (58.3%) 7 (58.3%) change in input data scenario, the input to the system is changed from one
CICIDS2017 0 (0%) 0 (0%) 8 (66.6%) 3 (25.0%) dataset to another, e.g., from KDD to CIDDS. Upon the change in the
Random KDD 1 (8.3%) 2 (16.6%) 1 (8.3%) 2 (16.6%)
Forest DARPA 0 (0%) 0 (0%) 2 (16.6%) 3 (25.0%)
dataset, SCALER calculates the scalability score for the new dataset (i.e.,
CIDDS 0 (0%) 2 (16.6%) 1 (8.3%) 2 (16.6%) CIDDS), which is presented as the scalability score before adaptation in
CICIDS2017 0 (0%) 1 (8.3%) 4 (33.3%) 3 (25.0%) Table 13. If the scalability score is lower than the predefined threshold
SVM KDD 1 (8.3%) 2 (16.6%) 2 (16.6%) 3 (25.0%) (0.58), the adaptation process is triggered. Given that we have four se
DARPA 0 (0%) 1 (8.3%) 2 (16.6%) 2 (16.6%)
curity datasets, a total of 12 (change in input data) use cases are possible
CIDDS 0 (0%) 1 (8.3%) 1 (8.3%) 3 (25.0%)
CICIDS2017 0 (0%) 0 (0%) 0 (0%) 4 (33.3%) as shown in Table 13.
MLP KDD 2 (16.6%) 1 (8.3%) 2 (16.6%) 0 (0%) Scalability improvement: Table 13 shows the scalability scores
DARPA 2 (16.6%) 0 (0%) 2 (16.6%) 2 (16.6%) before and after adaptation for each of the 12 possible use cases and the
CIDDS 0 (0%) 1 (8.3%) 2 (16.6%) 2 (16.6%) mean scalability improvement for each of the four datasets. On average,
CICIDS2017 2 (16.6%) 2 (16.6%) 1 (41.6%) 1 (8.3%)
Total Number of 8 (5.5%) 23 37 41
SCALER improves scalability by 20.8%. With respect to datasets, the
Unexpected Variations (18.7%) (30.5%) (32.0%) highest improvement is 27.83% for CIDDS followed by 25.83% for
CICIDS2017, 22.71% for KDD, and 7.86% for DARPA. Since the scal
ability score of DARPA with Naïve Bayes is higher than the threshold
CICIDS2017 while the modified value (− 1) achieves better scalability score of 0.58 (Section 3.3), adaptation is not triggered for the associated
for KDD and DARPA. This finding underlines a correlation between the three use cases. It is important to note that the scalability score after
dataset and Spark configuration parameters. We observe a similar trend adaptation is the same for all three cases associated with each dataset.
for the algorithms where the optimal values of the parameters do not This is because SCALER selects a CPV for a dataset irrespective of the
necessarily remain the same for different algorithms. For example, the dataset previously being processed by the system. For example, in use
default value of P3 - Spark.shuffle.compress obtains better scalability with cases 1 and 2, SCALER aims to select an optimal CPV for KDD and does
Random Forest but the modified value achieves better scalability with not pay any attention to the previously processed datasets (i.e., DAPRA
Naïve Bayes and Support Vector Machine for CIDDS dataset. Hence, it and CIDDS).
14
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Table 12
Ranking of the studied parameters based on their impact on scalability (the number in brackets specifies the difference between the scalability score with the default
settings and the modified settings). SVM and MLP stands for Support Vector Machine and MultiLayer Perceptron, respectively.
ID Spark Parameters Overall Ranking with Respect to Datasets Ranking with Respect to ML Algorithms
Ranking
The summary answer to RQ2: Modifying the default value of 9 out of 11 studied Spark parameters impacts the scalability of a BDCA system.
Each security dataset and algorithm requires a separate configuration of Spark parameters for achieving optimal scalability. With respect to
scalability, Spark.reducer.maxSizeInFlight is the most impactful and Spark.shuffle.memoryFraction is the least impactful Spark parameter.
Table 13
Scalability score before and after adaptation.
Naïve Bayes Random Forest Support Vector Machine Multilayer Perceptron
Use Case ID Use Case Before After Before After Before After Before After Mean Improvement for Dataset (%)
1 DARPA → KDD 0.47 0.63 0.38 0.61 0.41 0.59 0.24 0.61 26.61%
2 CIDDS → KDD 0.52 0.63 0.49 0.61 0.47 0.59 0.38 0.61
3 CICIDS2017 → KDD 0.33 0.63 0.54 0.61 0.52 0.59 0.51 0.61
4 KDD → DARPA 0.70 0.70 0.50 0.59 0.60 0.60 0.59 0.59 7.61%
5 CIDDS → DARPA 0.70 0.70 0.59 0.59 0.72 0.72 0.61 0.61
6 CICIDS2017 → DARPA 0.70 0.70 0.41 0.59 0.54 0.72 0.54 0.68
7 KDD → CIDDS 0.51 0.72 0.45 0.63 0.41 0.60 0.47 0.65 26.01%
8 DARPA → CIDDS 0.54 0.72 0.63 0.63 0.53 0.60 0.55 0.65
9 CICIDS2017 → CIDDS 0.47 0.72 0.39 0.63 0.29 0.60 0.53 0.65
10 KDD → CICIDS2017 0.54 0.60 0.54 0.71 0.50 0.64 0.47 0.64 22.80%
11 DARPA → CICIDS217 0.51 0.60 0.47 0.71 0.39 0.64 0.58 0.58
12 CIDDS → CICIDS2017 0.53 0.60 0.49 0.71 0.38 0.64 0.51 0.64
The trend in scalability improvement largely correlates with the size a comparison with all studies discussed in Section 6.3 due to the lack of
of each dataset – the larger the size, the larger is the improvement. A required data in the reported studies. It is important to note the
higher scalability improvement is recorded for large size datasets (i.e., following points before we analyse the findings presented in Table 14 –
CIDDS and CICIDS2017) and a lower scalability improvement is recor (i) Since some of the studies (e.g. (Kyong et al., 2017),) only report
ded for small size datasets, i.e., KDD and DARPA. With respect to al throughput, we first calculated the response time for those studies based
gorithms, Support Vector Machine (SVM) benefits the most from on the reported throughput and data size (ii) some studies report find
SCALER – achieving a mean scalability improvement of 23.68%. The ings for a cluster size greater than eight nodes. Given that our study
mean scalability improvement for Random Forest, Naïve Bayes, and considers a cluster size of maximum of eight nodes, we only selected
MLP is 22.50%, 16.54%, and 20.3% respectively. (and scaled where required) the response time of up to eight nodes from
Comparison with related studies: We compare the optimization those studies to make a fair comparison with those studies (iii) The
potential of SCALER with regards to the state-of-the-art approaches that studies presented in Table 14 use different datasets and different
also aim to improve the scalability of different software systems. For workloads. For example, Joohyun Kyong et al. (2017) use BigDataBench
such a comparison, we collected the data (e.g., response time) as re (Wanget al., 2014) and Chen et al. (2010) use DaCapo (Blackburnet al.,
ported in those studies and then calculated the scalability scores, using 2006) in their experiments. Given that our study is focussed on security
our scalability metric (Section 2.4.2), before and after the applied analytics, we used the datasets and algorithms used in security analytics.
optimization. The scalability score and achieved optimization in scal Therefore, owing to the usage of different datasets and algorithms in the
ability for various studies are presented in Table 14. We could not make related studies, an apple-to-apple comparison is quite challenging, and
15
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Table 14
Comparison of scalability improvement achieved through SCALER with scalability improvement achieved with regards to the state-of-the-art approaches. The scal
ability improvement is calculated based on our scalability metric.
Study Workload Scalability Score Before Scalability Score After Improvement Mean Improvement
Optimization Optimization (%) (%)
Joohyun Kyong et al. (Kyong et al., Wordcount 0.36 0.74 38.00 18.00%
2017) Naïve Bayes 0.34 0.61 27.00
Grep 0.63 0.79 16.00
K-means 0.12 0.03 − 9.00
Hasan Jamal et al. (Jamal et al., 2009) 16 KB 0.83 0.86 3.00 1.75%
512 KB 0.97 1.0 3.00
6000 KB 0.11 0.13 2.00
16000 KB 0.62 0.61 − 1.00
Chen et al. (Chen et al., 2010) Eclipse 0.74 0.70 − 4.00 13.13%
Hsqldb 0.25 0.11 − 14.00
Lusearch − 0.57 0.59 116
Xalan 0.95 0.79 − 16
MolDyn 0.75 0.78 3.00
MonteCarlo 0.85 0.80 − 5.00
RayTracer 0.84 0.85 1.00
SPECjbb2005 0.37 0.61 24.00
SCALER Naïve Bayes 0.49 0.65 16.54 20.81%
Random Forest 0.44 0.66 22.50
SVM 0.42 0.63 23.68
MLP 0.49 0.62 20.34
4.3.2. RQ3.2: how long does it take for SCALER to adapt a BDCA system Fig. 8. Adaptation time of SCALER for the four datasets and four ML/
for optimal scalability (i.e., adaptation time)? DL algorithms.
Adaptation time: The adaptation time underlines the speed with
which SCALER adapts a system. The adaptation time is calculated as the
time between the point of time adaptation is triggered to the point when
Table 15
the system gains a stable state, i.e., the adaptation process is terminated Comparison of the number of iterations required by SCALER and other state-of-
(Villegas et al., 2011). Fig. 7 shows the adaptation time of SCALER for the-art approaches to converge towards optimal configuration.
the 16 use cases, i.e., 4 datasets × 4 algorithms. On average, it takes
Perez Zhu et al. ( Gounaris Liao SCALER
around 170 min for SCALER to adapt a system, i.e., to bring a system to a et al. ( Zhuet al., et al. ( et al. (
state where the scalability of a system is above the predefined threshold. Perez 2017) Gounaris and Liao
It is worth noting that unlike the previous studies (e.g. (Ullah and Babar, et al., Torres, et al.,
2019b; Ullah and Babar, 2019c),) that adapt for improving the response 2018) 2018) 2013)
time, our approach takes more time due to the generation of scalability Number of 4.06 5 9 15.75 2.27
curve instead of a single point required for response time optimization. Iterations/
trails to
The adaptation time is mainly elapsed in executing a system with
converge
different CPVs (Table 6) to identify the CPV with which a system has a
scalability score above the threshold. With respect to datasets, the mean
adaptation time is longest (i.e., 374.02 minutes (min)) for CICIDS2017
16
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
The summary answer to RQ3.2: Our adaptation approach takes around 2 iterations to adapt a BDCA system to a dataset. The time taken to
adapt a system is directly proportional to the size of the dataset.
followed by CIDDS (133.51 min), KDD (88.03 min), and DARPA (77.75 each approach to converge towards a stable configuration. Such a
min). This trend is largely in accordance with the size of each dataset. comparison of SCALER with the other state-of-the-art approaches is
For example, our approach takes the longest time to adapt the BDCA presented in Table 15. On average, SCALER requires only 2.1 iterations
system for CICIDS2017 that is the largest in size and takes the shortest to find the desired configuration, which is the smallest number of iter
time to adapt for small size datasets such as KDD and DARPA. SVM is ations as compared to the other state-of-the-art approaches. One of the
quite fast with a mean adaptation time of 29.59 min, followed by reasons for such a small number of iterations is that instead of searching
Random Forest with a mean adaptation time of 149.15 min, Naïve Bayes for the most optimal CPV in the search space, SCALER only searches for a
with mean adaptation time of 326.25 min, and MLP with a mean CPV that has a scalability score above the threshold. As soon as the
adaptation time of 178 min. desired CPV is found, the search process is stopped.
Adaptation time Vs. Training time: The adaptation time is larger
than the actual job completion time (i.e., training time). For example, 4.3.3. RQ3.3: does the number of parameters and their value options
the mean training time for SVM based BDCA system with the DARPA impact the optimization capability and adaptation time of SCALER?
dataset is 84.33 min while the mean adaptation time for SVM based Scenarios: For this research question, we assess the impact of the
BDCA system with the DARPA dataset is 223 min. This is because in number of parameters considered and their value options on the per
order to determine training time, a system requires to be executed only formance (i.e., scalability improvement and adaptation time) of
once. However, to determine the scalability score, a system needs to be SCALER. We considered four scenarios as shown in Table 16. Scenario-1
executed multiple times with a different number of nodes (i.e., 1, 2, 4, 6, is the default scenario, as presented in the rest of the paper, that con
and 8 nodes in our case). Adaptation time is elapsed in determining siders nine parameters with each parameter having two potential values
scalability score for different parameter combinations; therefore, adap as shown in Table 16 and previously presented in Table 5. In scenario-2,
tation time is larger than training time. However, this factor does not we reduced the number of parameters from nine to five – considering
invalidate the advantages of SCALER. Similar to most of the tuning ap only the most impactful parameters as determined from the average
proaches (e.g. (Gounaris and Torres, 2018; Zhuet al., 2017; Alipourfard ranking presented in Table 11. Scenario-3 considers the same nine pa
et al., 2017),), the real advantage of SCALER is in the execution of rameters as considered in scenario-1, but unlike scenario-1, each
recurring jobs (same job executed by a system multiple times over a parameter has four value options except the binary parameters such as
period of time), which is a common phenomenon and equally applicable Spark.shuffle.compress. The value options for the parameters are selected
to security analytics (Ferguson et al., 2012; Agarwal et al., 2012). Some based on academic and industrial recommendations (Spark, 2014b,
recent studies (Ferguson et al., 2012; Agarwal et al., 2012) reveal that 2016). Similarly, scenario-4 considers the same five parameters as
around 40% of data analytics jobs are recurrent jobs. The current job is considered in scenario-2, but unlike scenario-2, each parameter has four
executed for the sake of tuning; therefore, it does not benefit from value options.
tuning. However, the recurring and/or subsequent jobs benefit from the Scalability improvement: Fig. 9 (A) presents the improvement in
already tuned system. For example, SCALER improves the scalability scalability achieved by SCALER for each of the four studied scenarios.
score of a job (i.e., training SVM based BDCA system with CIDDS data On average, SCALER improves the scalability of a BDCA system by
set) from 0.41 to 0.60 – an improvement of around 19%, which in turn 20.8% in scenario-1, 8.74% in scenario-2, 28.27% in scenario-3, and
translates into a reduction of training time from 121 min to 97.2 min 23.62% in scenario-4. The improvement in scalability increases as the
with an eight nodes cluster. Now, since the system is tuned, when the number of parameters and their value options increases. For example,
system executes the same or similar job, it will take 97.2 min to complete scalability improvement is the highest (28.27%) in scenario-3, where a
the job instead of 121 min. total of nine parameters are considered, each with four value options (i.
Comparison with related studies: In Fig. 8, the number of itera e., 9 parameters – 4 value options). On the contrary, scalability
tions indicates the number of CPVs tried to identify the CPV with which improvement is the lowest (8.74%) in scenario-2, where SCALER ex
a system has a scalability score above the threshold. On average, it takes plores combinations of only five parameters each with only two value
2.1 iterations/trails for SCALER to find the desired CPV from the search options (i.e., 5 parameters – 2 value options). However, it is worth
space. Since the related optimization approaches (e.g. (Gounaris and noting that the improvement in scalability is not directly proportional to
Torres, 2018; Zhuet al., 2017; Perez et al., 2018; Liao et al., 2013),) use the number of parameters and their value options considered in each
different datasets and algorithms, we cannot make a direct comparison scenario. For example, in scenario-3, SCALER explores almost twice the
of the adaptation time of SCALER with the related approaches. However, number of parameter combinations as in scenario-1 but achieves merely
we can make a direct comparison of the number of iterations required by 7.36% higher improvement than in scenario-1. A potential reason for
Table 16
Spark parameters and their value options considered in the four scenarios. The value in bold denotes the default value of the parameter.
ID Spark Parameter Scenario 1 Scenario 2 Scenario 3 Scenario 4
17
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Fig. 9. (A) Scalability improvement achieved by SCALER in each of the four scenarios presented in Table 14 and (B) Adaptation time (in minutes) and the number of
iterations to converge towards optimal configuration.
5. Discussion
18
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
systems for automatic tuning to improve the scalability of a system. features limits our approach and consequently poses a threat to the
Another potential domain is healthcare analytics where big data tech validity of our findings related to the KDD dataset. In the future, it will
nologies are frequently employed to deal with massive volumes of be interesting to incorporate and assess other encoding techniques, such
healthcare data (e.g., patient records) (Belle et al., 2015). For healthcare as one-hot encoding, to discard such bias and make a comparison with
big data systems, the workload fluctuates frequently as a heavy work the existing results.
load is to be handled in an emergency (e.g., natural disasters) (Nambiar
et al., 2013). Hence, we believe that a healthcare big data system can 6. Related work
also benefit from using our adaptation approach.
In this section, we compare our study with the existing studies on
5.3. Experimental bottlenecks BDCA systems, scalability investigation, scalability optimization, and
adaptation approaches.
During our experimentation, we faced several bottlenecks related to
the hardware resources and Spark processing framework. We discuss 6.1. BDCA systems
those issues for the benefit of interested readers (i.e., researchers), who
may come across the same issues. Our experiments produced temporary Given the exponentially growing number of cyber-attacks and
data during job execution on Spark, which consumes a lot of disk space increasing emphasis on real (or near)-time cybersecurity data analytics,
on worker nodes. Since each worker node has a limited disk space (10 there is strong interest in the strategies and tools of engineering and
GB in our case), the temporary data can exceed the disk limit of the operating optimal BDCA systems. However, there is relatively less
worker node. In such a case, the worker node becomes unhealthy and so literature on this topic (Ullah and Babar, 2019a). Spark-based BDCA
the master node does not assign the worker node any further tasks. To systems are rapidly surpassing Hadoop-based BDCA systems in popu
deal with this issue, we used to delete the temporary data produced on larity and adoption as 70% of the BDCA systems in 2014 were
worker nodes. However, we ensured that critical data such as the HDFS Hadoop-based and only 30% were Spark-based; it changed to 50%
block files and the block metafiles are not deleted. Debugging becomes a Hadoop-based and 50% Spark-based in 2017 (Ullah and Babar, 2019a).
serious concern in distributed data processing. We also initially faced the Recently several studies (e.g. (Gupta and Kulariya, 2016; Kumari et al.,
challenges of debugging failures (e.g., node failure) during our experi 2016; Marchal et al., 2014; Las-Casas et al., 2016; Zhang et al., 2016;
ments. For instance, running a data processing job with eight worker Böse et al., 2017),) have proposed Spark-based BDCA systems. Gupta
nodes, where two worker nodes are already unhealthy, consumes time et al. (Gupta and Kulariya, 2016) present a Spark-based BDCA system
but produces useless results. This is because the experiment was that leverages two feature selection algorithms (i.e., correlation-based
designed for eight nodes, but two nodes were in an unhealthy state and feature selection and Chi-squared feature selection) and several ML al
we were not aware of it. To handle this issue, we designated a path gorithms for detecting cyber intrusions. The system was evaluated with
through the variable yarn.nodemanager.log-dirs as the path for saving KDD dataset. The Spark-based BDCA system presented in (Kumari et al.,
operational logs. We used to constantly check the logs to identify any 2016) used K-means clustering for intrusion detection. Marchal et al.
issues before running the experiment. (2014) propose a Spark-based BDCA system for collecting different types
of security data (e.g., HTTP, DNS, IP Flow, etc.) and correlating the data
5.4. Threats to validity to detect cyber-attacks. Las-Kasas et al. (Las-Casas et al., 2016) present a
Spark-based BDCA system that leverages Apache Pig, Apache Hive, and
In this study, we have investigated a specific BDCA system that is SparkSQL to collect emails from honeypots installed in different coun
using particular algorithms and a big data framework (Spark). There tries and analyse the emails to detect phishing attacks. Another
fore, our findings may not generalize to all kinds of BDCA systems. Spark-based BDCA system presented in (Zhang et al., 2016) analyses
However, it is important to note that the aim of this study is not to show abnormal network packets for unveiling DoS attacks. RADISH (Böse
the results that generalize to all BDCA systems but to show that the et al., 2017) is another Spark-based BDCA system that aims to detect
parameter configuration of the underlying big data framework impacts a abnormal user and resource behaviour in an enterprise to detect insider
BDCA system’s scalability. Nonetheless, future research aimed at threats. Similarly, Wang et al. (Wang and Jones, 2021) focussed on the 3
obtaining more generalizable results will be useful. The number of value Vs (volume, variety, and veracity) of cyber security big data to explore
options (i.e., two and four) we investigated for the parameters limit the impact of missing values, duplicates, variable correlation, and gen
exploration of the parameter space. Since the modification of a param eral data quality on the detection of cyber-attacks. The authors used R
eter value (e.g., from 1024 MB to 1250 MB) shows a significant impact language and several datasets such as KDD-Cup and MAWILab in their
on scalability, investigating other modifications (e.g., from 1024 MB to study. Like the previous studies, our study also uses Spark and ML al
2056 MB) can only strengthen our findings but cannot contradict them. gorithms for detecting cyber intrusions. However, unlike the previous
Our adaptation approach takes around 170 min to select Spark config studies, our study has been evaluated with four ML/DL algorithms and
uration with a scalability score above the threshold. Although the real four different security datasets in a fully distributed mode, which en
advantage of our approach is the reduction in the execution time of the ables us to assert that our findings are based on a more rigorous study
recurring jobs, the adaptation time can be reduced in the future by (i) and are more generalizable. Since the previous studies use different ML
reducing the number of parameters considered during tuning through algorithms and security datasets for evaluation, hence, an apple-to-apple
techniques such as Lasso linear regression (Van Aken et al., 2017) and comparison of our findings with the findings from the previous studies is
(ii) similar to (Alipourfard et al., 2017) and (Venkataraman et al., 2016), not possible.
using representative datasets of smaller size instead of using the original
datasets. Our study has only investigated a limited number of parame 6.2. Scalability of BDCA systems
ters for their impact on scalability. Even if other Spark parameters do not
impact scalability, our findings for the studied parameters still remain Despite the increasing importance of the scalability of BDCA systems
valid. For feature engineering, we have used StringIndexer (from org. as reported in several studies (Ullah and Babar, 2019a), there have been
apache.spark.ml.feature) to transform the label features (i.e., normal and only a few efforts (e.g. (Lee and Lee, 2013; Las-Casas et al., 2016; Aljarah
attack) in the KDD dataset from string to indices as described in Section and Ludwig, 2013b; Du et al., 2014; Xiang et al., 2014),) aimed at
2.2.1. However, this approach introduces order in features that do not investigating the scalability of BDCA systems. Lee et al. (Lee and Lee,
have a natural order, which results in affecting/biasing the results of the 2013) investigated the scalability of a Hadoop-based BDCA system on a
machine learning model. Therefore, using the ordinal encoding of string 30-node cluster and observed that the execution time improves in
19
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
proportion to the hardware resource from 5 to 30 nodes. Du et al. (2014) 2002). Therefore, the approaches presented in (Gounaris and Torres,
studied the scalability of a Storm-based BDCA system on a five-node 2018; Nguyen et al., 2018; Perez et al., 2018; Wang et al., 2016) are not
cluster and observed that the system failed to achieve an ideal level of aimed at improving scalability. In general, the previous studies are
scalability due to extra task scheduling and communications overhead largely orthogonal to our study. This is because (i) our study is the first of
between spout and bold phases of the Storm execution environment. its kind that aims to improve the scalability of Spark-based BDCA sys
Aljarah et al. (Aljarah and Ludwig, 2013b) also studied the scalability of tems and (ii) employ a parameter-driven adaptation approach for
a Hadoop-based BDCA system on an 18 node cluster and found that the improving the scalability of a BDCA system that, unlike the previous
system scaled abruptly as the number of nodes in the cluster was studies, automatically improves scalability at runtime.
increased. For example, an ideal speedup is observed from two to four
nodes and 14 to 16 nodes while the non-ideal speedup is observed for the 6.4. Parameter-driven adaptation
rest of the scalability curve. The non-ideal speedup is attributed to the
start-up of MapReduce jobs and storing intermediate results in HDFS. Parameter-driven adaptation is one of the commonly used adapta
Las-Casas et al. (2016), compared the scalability of two BDCA systems – tion approaches. Several studies (e.g. (Calinescu et al., 2010; Epifani
one Hadoop-based and another Spark-based. They found that Spark et al., 2009) (Tongchim and Chongstitvatana, 2002; Jiang et al., 2018),)
scales better than Hadoop due to the efficient use of caching in Spark. attempted to modify the values of a certain system’s parameters to
Xiang et al. (2014) also explored the scalability of a BDCA system on a achieve various objectives such as high accuracy and improved security.
30-node cluster and found that the execution time decreases, although Calinescu et al. (2010) used the KAMI model based on a Bayesian esti
not ideally, with an increase in the number of nodes up to 25 nodes. mator to modify model parameters at runtime for achieving reliability
After 25 nodes, the execution time is increased, which the authors and quick response in a service-based medical assistance system.
attribute to the excessive communication among nodes and disk Another study (Epifani et al., 2009) argued that the software abstraction
read/write operation during MapReduce tasks. Whilst the previous models’ parameters such as Discrete-Time Markov Chains (DTMC)
studies have investigated the scalability of a BDCA system, none of the should be constantly updated to achieve better accuracy. The authors of
studies have quantified the scalability; nor have they calculated the the study (Epifani et al., 2009) proposed an adaptation method that
deviation from the ideal scalability. Furthermore, the previous studies leverages real-time operational data of a system to keep the parameters
have only investigated the scalability with default settings. Our study is up to date. Similarly, parameter-based adaptation is quite common in
the first study that has (i) quantified the scalability with respect to four the ML domain for adjusting a model’s parameters. For example,
datasets (ii) assessed the deviation from the ideal scalability and most Tongchim et al. (Tongchim and Chongstitvatana, 2002) proposed a
importantly (iii) investigated the impact of Spark parameters on the parameter-driven adaptation approach for adjusting the control pa
scalability of a BDCA system. rameters of genetic algorithms to achieve optimal accuracy. The
approach reported in (Tongchim and Chongstitvatana, 2002) divides the
6.3. Scalability improvement parameter space into sub-spaces and each sub-space evolves on separate
computing nodes in parallel. Jiang et al. (2018) proposed a
Several studies (e.g. (Kyong et al., 2017; Jamal et al., 2009; Chen parameter-driven adaptation approach that uses the temporal and
et al., 2010; Wu et al., 2009; Senger, 2009; Canali and Lancellotti, 2014), spatial correlations among characteristics (such as size and velocity of
) have proposed methods for improving the scalability of software sys objects) for finding the best set of configuration parameters for a con
tems in different domains. Kyong et al. (2017) proposed a docker volutional neural network employed in a video analytics system. The
container-based architecture for Spark-based scale-up server, where the adaptation approach proposed in (Jiang et al., 2018) aims to balance
original scale-up server is partitioned into several small servers to reduce resource consumption and a system’s accuracy. From the adaptation
memory access overheads. Wu et al. (2009) propose a scalability point of view, our study differs from the previous studies in two ways.
improvement technique that learns from the interaction patterns among First, our study is the first of its kind that applies a parameter-driven
the services of a service-based application and accordingly adopts an adaptation approach in the domain of Spark-based systems. Second,
optimized task assignment strategy to reduce the communication unlike the previous studies that aim to achieve accuracy or quick
bandwidth and improve the scalability. Senger et al. (Senger, 2009) response time, our adaptation approach aims to achieve improved
defined a scalability measure called input file affinity that quantifies the scalability.
level of file sharing among tasks belonging to a bag-of-tasks application
(e.g., data mining algorithm). In a study (Senger, 2009), the authors 7. Conclusion
proposed a scalability improvement method that leverages the input file
affinity measure to increase the degree of file sharing among tasks. Chen Big Data Cyber Security Analytics (BDCA) systems use big data
et al. (2010) first studied the scalability of Java applications with default technologies (such as Apache Spark) to collect and analyse security
JVM settings and then proposed a tuning approach that alleviates JVM event data (e.g., NetFlow) for detecting cyber-attacks such as SQL in
bottlenecks to improve the scalability of Java applications. Hasan et al. jection and brute force. The exponential growth in the volume and the
(Jamal et al., 2009) studied the scalability of virtual machine-based unpredictable velocity of security event data require BDCA systems to be
systems on a multicore processor setup. This study revealed that highly scalable. Therefore, in this paper, we have studied (i) how a
excessive communication among virtual machines impacts the scal Spark-based BDCA system scales with default Spark settings (ii) how
ability of multicore systems. Canali et al. (Canali and Lancellotti, 2014), tuning configuration parameters (e.g., execution memory) of Spark
present a scalability improvement approach for cloud-based systems, impacts the scalability of a BDCA system, and (iii) proposed SCALER - a
which leverages the resource usage patterns (e.g., CPU, storage, and parameter-driven adaptation approach to improve a BDCA system’s
network) among virtual machines and accordingly group the virtual scalability. For this study, we have developed an experimental infra
machines in a cloud-based infrastructure. It is important to note that structure using a large-scale OpenStack cloud. We have implemented a
some studies (Gounaris and Torres, 2018; Nguyen et al., 2018; Perez Spark-based BDCA system and have used four security datasets to find
et al., 2018; Wang et al., 2016) proposed tuning techniques for out how a BDCA system scales, how Spark parameters impact scalability,
Spark-based systems with the objective to reduce execution time. Such and to evaluate our adaptation approach aimed at improving scalability.
studies are largely impertinent to ours as they are focussed on execution Based on our detailed experiments, we have found that:
time (response time). Our study focusses on scalability, which are two
very different quality attributes (i.e., response time and scalability) of a • With default Spark settings, a BDCA system does not scale ideally.
software system and are treated differently in the state-of-the-art (Sun, The deviation from ideal scalability is around 59.5%. The system
20
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
scales better with large size datasets (e.g., CICIDS2017) as compared Acknowledgement
to small size datasets (e.g., KDD)
• 9 out of 11 studied Spark parameters impact the scalability of a BDCA The authors would like to thank Anying Xiang for her help in con
system. The impact of configuration parameters on scalability varies ducting the experiments.
from one security dataset to another
• Our parameter-driven adaptation approach improves the mean References
scalability of a BDCA system by 20.8%.
Aceto, G., Ciuonzo, D., Montieri, A., Persico, V., Pescapé, A., 2019. Know your big data
trade-offs when classifying encrypted mobile traffic with deep learning. In: 2019
From our findings, we conclude that practitioners should first tune Network Traffic Measurement and Analysis Conference (TMA). IEEE, pp. 121–128.
the parameters of Spark before putting a Spark-based BDCA system into Agarwal, S., Kandula, S., Bruno, N., Wu, M.-C., Stoica, I., Zhou, J., 2012. Reoptimizing
operation. Such parameter tunning can improve the scalability of the data parallel computing. In: Presented as Part of the 9th {USENIX} Symposium on
Networked Systems Design and Implementation ({NSDI} 12), pp. 281–294.
system. We also recommend that practitioners should not use someone Alexander, C.A., Wang, L.J.J.N.C., 2017. Big data analytics in heart attack prediction, 6
else’s pre-tuned parameter settings. The reason for this is that the best (393), 2167-1168.
combination of Spark parameters varies from dataset to dataset. Our Alipourfard, O., Liu, H.H., Chen, J., Venkataraman, S., Yu, M., Zhang, M., 2017.
Cherrypick: adaptively unearthing the best cloud configurations for big data
proposed adaptation approach is the first step towards facilitating analytics. In: 14th {USENIX} Symposium on Networked Systems Design and
practitioners to automatically tune Spark parameters for achieving Implementation ({NSDI} 17), pp. 469–482.
optimal scalability. More generally, we assert that the field of big data Aljarah, I., Ludwig, S.A., 2013a. Towards a scalable intrusion detection system based on
parallel PSO clustering using mapreduce. In: Proceedings of the 15th Annual
analytics should pay attention to the impact of the configuration pa
Conference Companion on Genetic and Evolutionary Computation. ACM,
rameters of the big data frameworks on the various system qualities such pp. 169–170.
as reliability, response time, and scalability. Federated machine learning Aljarah, I., Ludwig, S.A., 2013b. Mapreduce intrusion detection system based on a
has recently gained tremendous attention in various domains due to its particle swarm optimization clustering algorithm. In: 2013 IEEE Congress on
Evolutionary Computation. IEEE, pp. 955–962.
ability to perform on-device collaborative training in a privacy- Allaince, C.S., 2013. Big Data Analytics for Security Intelligence. Big data working group.
preserved manner (Wahab et al., 2021). It’d be worth exploring how Available at: https://bit.ly/211P7jj. (Accessed 11 February 2020).
our proposed approach performs with respect to federated machine Apache, Flink, 2011. Apache Flink. Available at: https://bit.ly/2v7. Int. J. Innov. "How
to use big data technologies to optimize operations in upstream petroleum
learning. industry9ouR.
Based on our study, we highlight the following areas for future Baaziz, A., Quoniam, L.J.B., Abdelkader, Quoniam, L., 2014. How to use big data
research. Investigating the parameters’ impact of other big data frameworks. technologies to optimize operations in upstream petroleum industry. Int. J. Innov.,
"How to use big data technologies to optimize operations in upstream petroleum
Although currently, Spark is the most popular big data framework, there industry 1 (1), 2013.
exist several other big data frameworks (such as Hadoop (2009), Storm Batista, G.E., Prati, R.C., Monard, M.C., 2004. A study of the behavior of several methods
(2011), Samza (2014), and Flink (Carbone et al., 2015)) with a different for balancing machine learning training data. ACM SIGKDD explorations newsletter
6 (1), 20–29.
set of configuration parameters. Therefore, future research should Belle, A., Thiagarajan, R., Soroushmehr, S., Navidi, F., Beard, D.A., Najarian, K., 2015.
investigate how configuration parameters of these frameworks impact Big data analytics in healthcare. BioMed Res. Int. 2015.
the scalability of a BDCA system. Approximate analytics for tuning big data Blackburn, S.M., et al., 2006. The DaCapo benchmarks: Java benchmarking development
and analysis. In: Proceedings of the 21st Annual ACM SIGPLAN Conference on
frameworks: Approximate analytics is an emerging concept that en
Object-Oriented Programming Systems. and applications, languages, pp. 169–190.
courages computing over a representative sample instead of computing Böse, B., Avasarala, B., Tirthapura, S., Chung, Y.-Y., Steiner, D., 2017. Detecting insider
over the entire dataset (Quoc et al., 2017). The rationale behind threats using radish: a system for real-time anomaly detection in heterogeneous data
approximate analytics is to make a trade-off between accuracy and streams. IEEE Syst. J. 11 (2), 471–482.
Calinescu, R., Grunske, L., Kwiatkowska, M., Mirandola, R., Tamburrelli, G., 2010.
computational time. In our study, we used the entire security datasets Dynamic QoS management and optimization in service-based systems. Trans. Softw.
for system execution and subsequent tuning. Therefore, an interesting Eng.
avenue for future research is to explore the applicability of approximate Canali, C., Lancellotti, R., 2014. Improving scalability of cloud monitoring through PCA-
based clustering of virtual machines. J. Comput. Sci. Technol. 29 (1), 38–52.
analytics for tuning big data frameworks. Investigating the parameters’ Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., Tzoumas, K., 2015. Apache
impact on other system’s qualities. The focus of our study was only on flink: stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech.
scalability, however, there exist several other quality attributes (e.g., Commit. Data Eng. 36 (4).
Chen, K.-Y., Chang, J.M., Hou, T.-W., 2010. Multithreading in Java: performance and
reliability, security, and interoperability) that are also important for a scalability on multicore systems. IEEE Trans. Comput. 60 (11), 1521–1534.
BDCA system. Hence, it is worth investigating that how the configura Cheng, L., Wang, Y., Ma, X., Wang, Y., 2016. GSLAC: a general scalable and low-
tion parameters of Spark impact other quality attributes of a BDCA overhead alert correlation method. In: Trustcom/BigDataSE/I SPA. IEEE.
KDD, 1999. KDDcup99 Knowledge Discovery in Databases. https://goo.gl/Jz2Un6.
system. (Accessed 11 February 2020).
Davidson, A., Or, A., 2013. Optimizing Shuffle Performance in Spark. University of
Author contribution California, Berkeley-Department of Electrical Engineering and Computer Sciences,
Tech. Rep.
Du, Y., Liu, J., Liu, F., Chen, L., 2014. A real-time anomalies detection system based on
Faheem Ullah: Conceptualization, Methodology, Software, Valida streaming technology. In: Intelligent Human-Machine Systems and Cybernetics
tion, Visualization, Writing – original draft, Writing -review and editing. (IHMSC), 2014 Sixth International Conference on, vol. 2. IEEE, pp. 275–279.
M. Ali Babar: Conceptualization, writing original draft, writing-review Economist, T., 2017. The World’s Most Valuable Resource Is No Longer Oil, but Data.
Available at: https://econ.st/2Gtfztg. (Accessed 11 February 2020).
and editing, Project administration, Resources. Epifani, I., Ghezzi, C., Mirandola, R., Tamburrelli, G., 2009. Model evolution by run-time
parameter adaptation. In: Proceedings of the 31st International Conference on
Declaration of competing interest Software Engineering. IEEE Computer Society, pp. 111–121.
Ferguson, A.D., Bodik, P., Kandula, S., Boutin, E., Fonseca, R., 2012. Jockey: guaranteed
job latency in data parallel clusters. In: Proceedings of the 7th ACM European
The authors declare that they have no known competing financial Conference on Computer Systems, pp. 99–112.
interests or personal relationships that could have appeared to influence Gontz, J., Riensel, D., 2012. The Digital Universe in 2020: Big Data, Bigger Digital
Shadow, and Biggest Growth in the Far East. IDC Country Brief. Available at: htt
the work reported in this paper. ps://bit.ly/2rqPWaw. (Accessed 11 February 2020).
Gounaris, A., Torres, J., 2018. A methodology for spark parameter tuning. Big Data Res.
11, 22–32.
21
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Gounaris, A., Kougka, G., Tous, R., Montes, C.T., Torres, J., 2017. Dynamic configuration Senger, H., 2009. Improving scalability of Bag-of-Tasks applications running on
of partitioning in spark applications. IEEE Trans. Parallel Distr. Syst. 28 (7), master–slave platforms. Parallel Comput. 35 (2), 57–71.
1891–1904. Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A., 2018. Toward Generating a New Intrusion
Grama, A.Y., Gupta, A., Kumar, V., 1993. Isoefficiency: measuring the scalability of Detection Dataset and Intrusion Traffic Characterization. ICISSP. Available at: htt
parallel algorithms and architectures. IEEE Parallel Distr. Technol. Syst. Appl. 1 (3), ps://bit.ly/30qWkft. (Accessed 11 February 2020).
12–21. Shvachko, K., Kuang, H., Radia, S., Chansler, R., 2010. The hadoop distributed file
Greene, C.S., Tan, J., Ung, M., Moore, J.H., Cheng, C., 2014. Big data bioinformatics. system. MSST 10.
J. Cell. Physiol. 229 (12), 1896–1900. Spark, A., 2011. Spark Programming Guide. Available at: https://bit.ly/37DETeF.
Groves, P., Kayyali, B., Knott, D., Kuiken, S.V., 2016. The’big Data’revolution in Spark, 2014a. Apache Spark. Available at: https://spark.apache.org/. (Accessed 11
Healthcare: Accelerating Value and Innovation. February 2020).
Gupta, G.P., Kulariya, M., 2016. A framework for fast and efficient cyber security Spark, A., 2014b. Spark Configuration. Available at: https://bit.ly/2rXR4NK. (Accessed
network intrusion detection using Apache spark. Proc. Comput. Sci. 93, 824–831. 11 February 2020).
Hadoop, A., 2009. Apache Hadoop. https://goo.gl/GLWG9Q. (Accessed 11 February Spark, A., 2016. SparkHub: A Community Site for Apache Spark. Available at: https://bit
2020). .ly/2lS8Vs5. (Accessed 11 February 2020).
Herodotou, H., et al., 2011. Starfish: a self-tuning system for big data analytics. Cidr 11 Srivastava, U., Gopalkrishnan, S., 2015. Impact of big data analytics on banking sector:
(2011), 261–272. learning for Indian banks. Proc. Comput. Sci. 50, 643–652.
Holtz, M.D., David, B.M., de Sousa Júnior, R.T., 2011. Building scalable distributed Storm, A., 2011. Apache Storm. Available at: https://bit.ly/2tEvqox.
intrusion detection systems based on the mapreduce framework. Rev. Telecommun. Sun, X.-H., 2002. Scalability versus execution time in scalable systems. J. Parallel Distr.
13 (2), 22. Comput. 62 (2), 173–192.
Hong, K.-F., Chen, C.-C., Chiu, Y.-T., Chou, K.-S., 2015. Ctracer: uncover C&C in Sun, X.-H., Rover, D.T., 1994. Scalability of parallel algorithm-machine combinations.
advanced persistent threats based on scalable framework for enterprise log data. In: IEEE Trans. Parallel Distr. Syst. 5 (6), 599–613.
2015 IEEE International Congress on Big Data. IEEE, pp. 551–558. Sun, X.-H., Chen, Y., Wu, M., 2005. Scalability of heterogeneous computing. In: 2005
Jamal, M.H., Qadeer, A., Mahmood, W., Waheed, A., Ding, J.J., 2009. Virtual machine International Conference on Parallel Processing (ICPP’05). IEEE, pp. 557–564.
scalability on multi-core processors based servers for cloud computing workloads. In: Sun, N., Morris, J.G., Xu, J., Zhu, X., Xie, M., 2014. iCARE: a framework for big data-
2009 IEEE International Conference on Networking, Architecture, and Storage. IEEE, based banking customer analytics. IBM J. Res. Dev. 58 (5/6), 4: 1-4: 9.
pp. 90–97. MIT, 1998. DARPA Intrusion Detection Evaluation Data Set. Available at: https://goo.
Jiang, J., Ananthanarayanan, G., Bodik, P., Sen, S., Stoica, I., 2018. Chameleon: scalable gl/jYBYNe. (Accessed 11 February 2020).
adaptation of video analytics. In: Proceedings of the 2018 Conference of the ACM Tongchim, S., Chongstitvatana, P., 2002. Parallel genetic algorithm with parameter
Special Interest Group on Data Communication. ACM, pp. 253–266. adaptation. Inf. Process. Lett. 82 (1), 47–54.
Jogalekar, P., Woodside, M., 2000. Evaluating the scalability of distributed systems. IEEE Ullah, F., Babar, M.A., 2019a. Architectural tactics for big data cybersecurity analytics
Trans. Parallel Distr. Syst. 11 (6), 589–603. systems: a review. J. Syst. Software.
Johnsirani Venkatesan, N., Nam, C., Shin, R., 2019. Deep learning frameworks on Ullah, F., Babar, M.A., 2019b. An architecture-driven adaptation approach for big data
Apache spark: a review, 36 (2), 164–177. cyber security analytics. In: International Conference on Software Architecture,
Kumar, M., Hanumanthappa, M., 2013. Scalable intrusion detection systems log analysis pp. 41–50.
using cloud computing infrastructure. In: 2013 IEEE International Conference on Ullah, F., Babar, M., 2019c. QuickAdapt: scalable adaptation for big data cyber security
Computational Intelligence and Computing Research. IEEE, pp. 1–4. analytics. In: International Conference on Engineering of Complex Computer
Kumari, R., Singh, M., Jha, R., Singh, N., 2016. Anomaly Detection in Network Traffic Systems.
Using K-Mean Clustering. Recent Advances in Information Technology (RAIT). Van Aken, D., Pavlo, A., Gordon, G.J., Zhang, B., 2017. Automatic database management
Kyong, J., Jeon, J., Lim, S.-S., 2017. Improving scalability of Apache spark-based scale- system tuning through large-scale machine learning. In: Proceedings of the 2017
up server through docker container-based partitioning. In: Proceedings of the 6th ACM International Conference on Management of Data, pp. 1009–1024.
International Conference on Software and Computer Applications. ACM, Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I., 2016. Ernest: efficient
pp. 176–180. performance prediction for large-scale advanced analytics. In: 13th {USENIX}
Las-Casas, P.H., Dias, V.S., Meira, W., Guedes, D., 2016. A Big Data architecture for Symposium on Networked Systems Design and Implementation ({NSDI} 16),
security data and its application to phishing characterization. In: Big Data Security pp. 363–378.
on Cloud (BigDataSecurity). IEEE, pp. 36–41. Villegas, N.M., Müller, H.A., Tamura, G., Duchien, L., Casallas, R., 2011. A framework for
Lee, Y., Lee, Y., 2013. Toward scalable internet traffic measurement and analysis with evaluating quality-driven self-adaptive software systems. In: Symposium on Software
hadoop. Comput. Commun. Rev. 43 (1), 5–13. Engineering for Adaptive and Self-Managing Systems.
Liao, G., Datta, K., Willke, T.L., 2013. Gunther: search-based auto-tuning of mapreduce. Wahab, O.A., Mourad, A., Otrok, H., Taleb, T.J.I.C.S., Tutorials, 2021. Federated
In: European Conference on Parallel Processing. Springer, pp. 406–419. machine learning: survey, multi-level classification, desirable criteria and future
Marchal, S., Jiang, X., Engel, T., 2014. A Big Data Architecture for Large Scale Security directions in communication and networking systems, 23 (2), 1342–1397.
Monitoring. Congress on Big Data. Wang, L., Jones, S., 2021. Big data analytics in cyber security: network traffic and
Mehta, N., Pandit, A., 2018. Concurrence of big data analytics and healthcare: a attacks, 61 (5), 410–417.
systematic review, 114, 57–65. Wang, G., Xu, J., He, B., 2016. A novel method for tuning configuration parameters of
Nambiar, R., Bhardwaj, R., Sethi, A., Vargheese, R., 2013. A look at challenges and spark based on machine learning. In: 2016 IEEE 18th International Conference on
opportunities of big data analytics in healthcare. In: 2013 IEEE International High Performance Computing and Communications; IEEE 14th International
Conference on Big Data. IEEE, pp. 17–22. Conference on Smart City; IEEE 2nd International Conference on Data Science and
Nguyen, N., Khan, M.M.H., Wang, K., 2018. Towards automatic tuning of Apache spark Systems (HPCC/SmartCity/DSS). IEEE, pp. 586–593.
configuration. In: 2018 IEEE 11th International Conference on Cloud Computing Wang, L., et al., 2014. Bigdatabench: a big data benchmark suite from internet services.
(CLOUD). IEEE, pp. 417–425. In: 2014 IEEE 20th International Symposium on High Performance Computer
Nguyen, T., Gosine, R.G., Warrian, P.J.I.A., 2020. A Systematic Review of Big Data Architecture (HPCA). IEEE, pp. 488–499.
Analytics for Oil and Gas Industry 4.0, vol. 8, pp. 61183–61201. Williams, L.G., Smith, C.U., 2004. Web application scalability: a model-based approach.
Obitade, P.O., 2019. Big data analytics: a link between knowledge management In: Int. CMG Conference, pp. 215–226.
capabilities and superior cyber protection. J. Big Data 6 (1), 71. Wu, J., Liang, Q., Bertino, E., 2009. Improving scalability of software cloud for composite
Oussous, A., Benjelloun, F.-Z., Lahcen, A.A., Belfkih, S., 2018. Big Data technologies: a web services. In: 2009 IEEE International Conference on Cloud Computing. IEEE,
survey. J. King Saud Univ. Comput. Inform. Sci. 30 (4), 431–448. pp. 143–146.
Partners, N., 2019. Big Data and AI Executive Survey. Available at: https://bit. Xiang, J., Westerlund, M., Sovilj, D., Pulkkis, G., 2014. Using Extreme Learning Machine
ly/2Y5XcZ6. (Accessed 11 February 2020). for Intrusion Detection in a Big Data Environment. Artificial Intelligent and Security
Perez, T.B., Chen, W., Ji, R., Liu, L., Zhou, X., 2018. Pets: bottleneck-aware spark tuning Workshop.
with parameter ensembles. In: 2018 27th International Conference on Computer Zaharia, et al., 2016. Apache spark: a unified engine for big data processing. Commun.
Communication and Networks (ICCCN). IEEE, pp. 1–9. ACM.
Persico, V., Pescapé, A., Picariello, A., Sperlí, S., 2018. Benchmarking big data Zhang, J., Zhang, Y., Liu, P., He, J., 2016. A spark-based DDoS attack detection model in
architectures for social networks data processing using public cloud platforms, 89, cloud services. In: International Conference on Information Security Practice and
98–109. Experience. Springer, pp. 48–64.
Pouyanfar, S., et al., 2018. A survey on deep learning: Algorithms, techniques, and Zhao, S., Chandrashekar, M., Lee, Y., Medhi, D., 2015. Real-time network anomaly
applications 51 (5), 1–36. detection system using machine learning. In: 2015 11th International Conference on
Pramanik, M.I., Lau, R.Y., Azad, M.A.K., Hossain, M.S., Chowdhury, M.K.H., the Design of Reliable Communication Networks (DRCN). IEEE, pp. 267–270.
Karmaker, A., 2020. Healthcare informatics and analytics in big data 152, 113388. Zhu, Y., et al., 2017. Bestconfig: tapping the performance potential of systems via
Qiu, J., Wu, Q., Ding, G., Xu, Y., Feng, P., 2016. A survey of machine learning for big data automatic configuration tuning. In: Proceedings of the 2017 Symposium on Cloud
processing, 2016 (1), 1–16. Computing. ACM, pp. 338–350.
Quoc, D.L., Chen, R., Bhatotia, P., Fetzer, C., Hilt, V., Strufe, T., 2017. StreamApprox:
approximate computing for stream analytics. In: Proceedings of the 18th ACM/IFIP/
USENIX Middleware Conference, pp. 185–197.
Ring, M., Wunderlich, S., Grüdl, D., Landes, D., Hotho, A., 2017. Flow-based Benchmark
Data Sets for Intrusion Detection. ECCWS. Available at: https://bit.ly/3ad1CQc/.
(Accessed 11 February 2020).
Samza, A., 2014. Apache Samza. Available at: https://bit.ly/37fFCSR.
22
F. Ullah and M.A. Babar Journal of Network and Computer Applications 198 (2022) 103294
Faheem Ullah is a postdoctoral researcher with the School of M. ALI BABAR is a Professor in the School of Computer Sci
Computer Science, The University of Adelaide, Australia. He ence, University of Adelaide. He is an honorary visiting pro
completed his PhD, focussed on the intersection of big data and fessor at the Software Institute, Nanjing University, China. Prof
cyber security, from the University of Adelaide, Australia. He is Babar has established an interdisciplinary research centre,
a member of CREST - Centre for Research on Engineering CREST - Centre for Research on Engineering Software Tech
Software Technologies, which is an interdisciplinary research nologies, where he leads the research and research training of
centre at the University of Adelaide. He has been actively more than 30 (15 PhD students) members. He also leads a
involved in teaching undergrad and master courses in the area theme, Platform and Architecture for Cyber Security as a Ser
of computer science and software engineering. He has vice, of the Cyber Security Cooperative Research Centre
supervised/co-supervised more than 20 undergrad/master (CSCRC), one of the largest Australian initiatives for building
projects. His current research primarily focuses on cyber se sovereign cyber security capability through world class applied
curity, big data analytics, and cloud computing. R&D with industrial impact. Prof Babar has authored/co-
authored more than 250 peer-reviewed publications through
premier Software Technology journals and conferences. Apart from his work having in
dustrial relevance as evidenced by several R&D projects and setting up a number of col
laborations in Australia and Europe with industry and government agencies, his
publications have been highly cited within the discipline of Software Engineering as evi
denced by his H-Index is 46 with 11028 citations as per Google Scholar on December 9,
2021. Prior to joining the University of Adelaide in November 2013, he spent almost 7
years in Europe (Ireland, Denmark, and UK) working as a senior researcher and an
academic.
23