18th Aug

Uploaded by

richa010phd23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views13 pages

18th Aug

Uploaded by

richa010phd23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Computers & Security 116 (2022) 102659

Contents lists available at ScienceDirect

Computers & Security

journal homepage: www.elsevier.com/locate/cose

FeSA: Feature selection architecture for ransomware detection under

concept drift
Damien Warren Fernando∗, Nikos Komninos
Department of Computer Science, School of Mathematics, Computer Science and Engineering, City, University of London, UK

a r t i c l e i n f o a b s t r a c t

Article history: This paper investigates how different genetic and nature-inspired feature selection algorithms operate in
Received 12 April 2021 systems where the prediction model changes over time in unforeseen ways. As a result, this study pro-
Revised 30 November 2021
poses a feature section architecture, namely FeSA, independent of the underlying classification algorithm
Accepted 11 February 2022
and aims to find a set of features that will improve the longevity of the machine learning classifier. The
Available online 15 February 2022
feature set produced by FeSA is evaluated by creating scenarios in which concept drift is presented to
Keywords: our trained model. Based on our results, the generated feature set remains robust and maintains high
Ransomware detection rates of ransomware malware. Throughout this paper, we will refer to the true-positive rate
Concept-drift of ransomware as detection; this is to clearly define what we focus on, as the high true positive rate
Detection for ransomware is the main priority. Our architecture is compared to other nature-inspired feature se-
Learning-algorithms lection algorithms such as evolutionary search, genetic search, harmony search, best-first search and the
Features
greedy stepwise feature selection algorithm. Our results show that FeSA displays the least degradation
on average when exposed to concept drift. FeSA is evaluated based on ransomware detection rate, recall,
false positives and precision. The FeSA architecture provides a feature set that shows competitive recall,
false positives and precision under concept drift while maintaining the highest detection rate from the
algorithms it has been compared to.
Crown Copyright © 2022 Published by Elsevier Ltd. All rights reserved.

1. Introduction generated approximately $320million (De Groot, 2017). Overall it is

estimated that in 2020, organisations will pay up to $11 billion in
In recent years ransomware has emerged as one of the most paying ransoms or dealing with the damage caused by ransomware
potent malware threats out there. Ransomware uses tactics to attacks (Sanders, 2020). Popular and infamous ransomware like
reduce the victim’s access to their system or prevent files by Petya encrypts the Master Boot Record of a Windows system in
encrypting them. Victims pay for various reasons, whether it is a terms of behavioural diversity. Modern ransomware variants like
business that needs access to its files and does not have sufficient Maze encrypts files, steals sensitive information from companies
backups (Cook, 2020a) or a single person who has ”lost” personal and then exposes it if organisations and individuals do not pay the
files due to a ransomware attack. There are two types of ran- ransom (Saxena, 2018). Ransomware malware evolves to become
somware, the first being locker ransomware. Locker-ransomware more dangerous and damaging, as history has shown us. In the
will stop users from accessing their systems by displaying a lock context of machine-learning detection systems, the constant evo-
screen when they log into their systems. The second type of lution of malware can be classed as concept drift, a phenomenon
ransomware is crypto-ransomware that our research is focused that means the rules and logic learned by the classifier to classify
on. Crypto-ransomware is a highly sophisticated malware type, a malware becomes outdated and incorrect.
more common form of ransomware used in ransomware attacks.
Crypto-ransomware will use complex encryption schemes to en-
1.1. Malware evolution
crypt a victim’s files, rendering them unusable and unrecoverable
unless the ransom is paid and the attacker provides the subse-
According to DataProt (Jovanovic, 2019), around 980 million
quent decryption keys with a decryption tool. A popular example
malware programs on the internet today and 350,0 0 0 new mal-
of crypto-ransomware CryptoWall appeared in 2014, and it has
ware pieces are detected every day. The recent boom in mal-
ware evolution is traced back to 2013, in which the number
∗
Corresponding author. of malicious files on the web doubled, this growth may have
E-mail address: [email protected] (D.W. Fernando). slowed, but it has not stopped. The statistics show that malware

https://doi.org/10.1016/j.cose.2022.102659
0167-4048/Crown Copyright © 2022 Published by Elsevier Ltd. All rights reserved.
D.W. Fernando and N. Komninos Computers & Security 116 (2022) 102659

is not only emerging at a rapid rate; this is also acknowledged in sifier expects shows true evolutionary characteristics, as the dy-
Hayes et al. (2009) recognised the diversity in malware in 2008, namic behaviour of the malware has changed. The transcend sys-
which implies malware has been evolving and changing for years. tem (Jordaney et al., 2018) acknowledges that malware can evolve
Singh et al. described three types of malware evolution, the first in ways that make it difficult for even machine learning detection
being a natural evolution, the second being environmental evolu- systems to detect; this evolution is described as concept drift in
tion, and the third being polymorphic evolution. Most aspects of a machine learning system. If malware’s behavioural patterns and
malware evolution are due to adaption to avoid anti-virus (AV) de- statistical properties change beyond the scope of what a machine
tection. Environmental evolution occurs when software develop- learning system defines as malicious behaviour, detection rates will
ment changes, such as compiler changes. If malware uses differ- start to decrease. Changes in statistical properties and dynamic be-
ent libraries to fulfil its goals, its behaviour may appear signifi- haviour during execution is what we would classify as true mal-
cantly different from what detection systems expect. The defini- ware evolution, as even machine learning systems would strug-
tion of environmental change depends heavily on compiler and gle to detect them. An example of the differences between evo-
library changes as defined in Singh et al. (2012), which means lution and zero-day is the WannaCry ransomware attack, consid-
these changes will be far less frequent than natural evolution. ered a zero-day threat. The attack was carried out using the eternal
Polymorphic evolution occurs in the form of transformation and blue and double pulsar exploits. These two exploits are windows
obfuscation (Singh et al., 2012). The use of packers and protec- SMB and privilege based and allow the ransomware to execute, the
tors create an artificial diversity that is designed to evade detec- zero-day aspect of this attack. If WannaCry was loaded into a sys-
tors. Packing will not help track drift, as the packers will encrypt tem using eternalBlue; however, it behaved the same as previous
and compress code; drift tracking should be carried out on un- ransomware, its characteristics would not be considered evolution-
packed malware. Malware evolution poses a large threat to sys- ary, only that it had been propagated using zero-day exploits, eter-
tems due to the rate of evolution not slowing down, according nalBlue and doublePulsar. WannaCry could be considered evolved
to Symantec (Cook, 2020b). Enterprise ransomware like SamSam because of the way it encrypted files and propagated itself through
and Dharma are coordinated hits on organisations using a man- networks; these aspects of the ransomware were behavioural evo-
ual attack methodology (Whitepaper, 2019). Doxxing is also a new lutions and would display dynamic behavioural characteristics not
methodology in ransomware attacks, threatening to expose sen- previously associated with ransomware.
sitive data of attack victims (Goodchile, 2020), yet another ex-
ample of ransomware’s dangerous evolution. When high diversity 1.3. Contributions of this paper
and evolution rates exist in a destructive malware type like ran-
somware, the consequences for victims become severe. · Behavioural analysis of ransomware characteristics that change
or ”evolve” over time.
1.2. Motivation · Proposal of a feature selection architecture, which provides an
optimal feature set showing promising performance when
Our main motivation for this research is the need for robust exposed to concept drift. FeSA’s feature set remains robust
features which will allow ransomware detection systems to remain over time; the main element is maintaining a slower degra-
effective when exposed to concept drift. It is observable that fea- dation rate in detection rate.
tures can quickly be rendered ineffective by concept drift; there-
fore, this creates the need for an architecture that can create ro- 1.4. Paper organisation
bust feature sets for ransomware detection under which will not
degrade excessively under concept drift. A zero-day vulnerability is The remainder of this paper is as follows: Section 2 covers
a software vulnerability that attackers discover before the software work related to our research. Section 3 covers our proposal and the
vendor is aware of it. A zero-day exploit is a method of exploit- background information that accompanies our work. Section 4 de-
ing a zero-day vulnerability (Kaspersky, 2021). A zero-day malware scribes our experiments, and Section 5 discusses the results of the
threat is a threat that has not been seen by the detection sys- experiments shown in Section 4. Section 6 concludes and expands
tem before and can be a variant or malware type for which no our work.
signatures exist (Carson, 2007). Machine learning classifiers have
been proven effective in detecting zero-day malware threats, as 2. Related work
shown in Shaukat and Ribeiro (2018) and Sgandurra et al. (2016b);
however, zero-day ransomware is not necessarily an example of This section explores related work which has influenced our
ransomware that has evolved and will be difficult for a classifier research. Section 2.1 investigates studies that apply concept drift
to identify correctly. Machine learning detection systems like ran- with pros and cons. This study also investigates evolutionary algo-
somwall (Shaukat and Ribeiro, 2018) and the system designed by rithms and how they can be used in concept drift detection and
Sgandurra et al. use dynamic features like API calls to differentiate adaption. This study also investigates the use of machine learning
ransomware from benign files; therefore, machine learning detec- detection for ransomware and how these systems tackle zero-day
tion systems effectively detect ransomware rapidly without relying threats. The related studies identify the gaps in ransomware detec-
on signatures or heuristics. Machine learning systems will be able tion and concept drift in ransomware detection systems, and the
to identify patterns and statistical properties of malware that dis- genetic algorithms in Section 2.2 point us towards possible solu-
tinguish them from benign files, hence why they are effective at tions.
detecting zero-day threats. Zero-day ransomware may be consid-
ered a zero-day due method of delivery or how it is obfuscated to 2.1. Concept drift
evade anti-virus detection; however, once it begins executing, its
behaviour determines whether it is an evolved variant or not. A Ransomware variants that display evolutionary qualities that
ransomware variant may be delivered via a zero-day attack that are different from their predecessors are always emerging, which
exploits a new vulnerability, but its behavioural patterns during may be difficult for ransomware detection systems to identify.
execution may not deviate much or at all from the patterns of pre- Good examples of evolving ransomware are the MedusaLocker and
vious ransomware. Ransomware that displays behavioural patterns WannaCry ransomware families. MedusaLocker is a ransomware
during execution that differ from what the machine learning clas- variant that targets antivirus and ransomware detection modules

2
D.W. Fernando and N. Komninos Computers & Security 116 (2022) 102659

to turn them off and disable them from running in safe mode ing URL matches its allocated classification. Thus, if a concept drift
(Collins, 2019); this variant of ransomware is extremely evasive is detected, the system will be immediately retrained.
and effective in disabling endpoint protection and preventing ran-
somware detection modules from working. The WannaCry ran- 2.2. Genetic algorithms
somware variant was propagated through a Windows SMB vulner-
ability that the public had not seen before the infection, although A genetic algorithm is a search heuristic that takes Charles
it was known to the NSA at the time. The two variants mentioned Darwin’s natural evolution theory (Fatima et al., 2019). This algo-
behaved vastly differently from the ransomware before them and rithm mimics the process of natural selection, which will select the
made it clear that a zero-day that showed characteristics far be- strongest to survive and produce offspring. A genetic algorithm will
yond the current behavioural profile can cause detection systems apply this logic to a dataset and can be used to produce an op-
to fail. timal feature set. The system proposed in Vivekanandan and Ne-
The Transcend System proposed in Jordaney et al. (2018) is a dunchezhian (2011) uses a genetic algorithm to produce an opti-
framework that can work with any machine learning algorithm to mal feature set for malware detection. A typical genetic algorithm
output confidence values for predictions. Predictions can be mod- will repeat its evaluation and crossover phase, creating numerous
elled differently; confidence values can be extracted from a ran- features to obtain optimal features. This research uses a genetic en-
dom forest depending on how many trees vote for the chosen pre- gineering approach to reduce the number of generations and fea-
diction. Confidence values can be extracted from a Support Vec- tures needed to produce the optimal feature set, otherwise known
tor Machine by measuring a prediction’s distance from the hyper- as a feature set. The structure of a genetic algorithm is shown be-
plane. Obtaining confidence from a clustering approach would in- low.
volve measuring the distance of a prediction from a centroid. The
· Fitness Function: The fitness function determines the ability
transcend system aims to identify how similar the classified in-
each individual has to compete, in the context of a detection
stance is to the rest of the instances in its class and how simi-
system, this would be determined by how accurate a feature
lar the instance is to samples in the other classes. The transcend
set is.
system measures the confidence of a prediction and combines the
· Population Generation: The initial population of individuals
value with the confidence the predictor has in other classes to de-
is generated randomly from the pool of available chromo-
termine how credible the prediction is. Predictions that fall below
somes; in most cases, chromosomes represent features that
the credibility threshold will have to be investigated manually by
will create a feature set.
an IT team or some administrative presence. Transcend does not
· Selection The selection phase is designed to take the fittest
use any evolutionary feature selection algorithms to train algo-
individuals and allow them to pass their genes onto the next
rithms; however, the framework proposed by them uses a similar
generation. In the context of a detection system, these would
structure and approach to an evolutionary algorithm.
be feature sets that achieve the highest accuracy.
The system proposed in Kantchelian et al. (2013) combines
· Crossover: Crossover is the process of two selected individuals
human intervention with underlying machine algorithms to ad-
being mated to produce a child, which will be a combina-
dress concept drift in an adversarial machine learning scenario;
tion of both parents. This phase can be repeated with the
this system attempts to classify adversarial learning as an evo-
offspring and so forth but can be limited to a select number
lutionary family of the training dataset. The system proposed in
of generations.
Kantchelian et al. (2013) stresses the need for retraining and hu-
· Mutation: Genes of the offspring can be subject to mutation
man interaction to handle concept drift effectively. The type of
with a low random probability, in the context of a feature
concept drift addressed in Kantchelian et al. (2013) is a type of
selection algorithm, this can mean inheriting a random fea-
drift introduced by adversarial techniques that are not addressed
ture that does not exist in either parent.
in any other related studies referenced in this study. The system
proposed in Maggi et al. (2009) uses anomaly detection to dis- The StreamGP algorithm (Folino et al., 2007) constructs an
tinguish between genuine changes in a web application and ma- ensemble Genetic Programming and a boosting algorithm. The
licious changes; however, this system also relies on retraining to StreamGP system generates decision trees that are trained on dif-
adapt to concept drift and reduce false-positive rates. The system ferent parts of a data stream. StreamGP has a concept drift detec-
proposed in Maggi et al. (2009) is unique because it looks specifi- tion system inbuilt, which, once triggered, will build a new clas-
cally for malign and benign changes; despite this, the retraining of sifier using CGPC, the cellular genetic programming method de-
parts of the model is necessary to adapt to the detected changes. scribed in Folino et al. (2007). The populations of data in this algo-
The systems that address concept drift seem to rely on retrain- rithm are sets of individual data blocks which are initially drawn
ing and human intervention instead of having a specifically con- randomly. The newly created classifier is added to the ensemble,
structed mechanism to counteract the effects of concept drift. and the weights of each classifier are then updated; this system
The system used in Singhal et al. (2020) uses the Heterogeneous creates a new classifier when concept drift is detected rather than
Euclidean Overlap Metric (HEOM) to detect concept drift in detect- constantly adapted to the newest block of data like the EACD pro-
ing malicious web URLs. The system combines Gradient Boosted posed in Ghomeshi et al. (2019).
Trees to detect malicious URLs and the GTB algorithm with the The EACD system (Ghomeshi et al., 2019) proposes a genetic al-
HEOM measurement. The concept drifts detection component of gorithm approach to combatting concept drift. This evolutionary
the system in Singhal et al. (2020) attempts to identify the differ- algorithm is multi-layered with a base and a genetic layer; both
ences between the data distribution between the old training data layers act as a natural selection mechanism to find the strongest
and the new incoming data. The distance between the training set feature set. The base layer will select a set number of features
and the newer data is calculated using the HEOM. The research and save them as feature sets. These feature sets are saved and
presented in Tan et al. (2018) attempts to detect concept drift in evaluated. The highest performing feature sets are passed into a
malicious URL detection systems and uses the Wilcoxon Rank-Sum. secondary genetic layer that will ”breed” feature sets by randomly
The Wilcoxon Rank-Sum test is a non-parametric test that allows crossing strong feature sets to create strong offspring. This breed-
the user to determine whether two samples are from the same ing step is carried out until the overall system’s accuracy is higher
population. In the context of malicious web URLs, the Wilcoxon on the newest data. The number of repetitions of the breeding step
Rank-Sum test allows the system to determine whether the incom- is defined by the maximum number of generations the system will

3
D.W. Fernando and N. Komninos Computers & Security 116 (2022) 102659

allow. This genetic approach produces promising results, finding 3.1.1. Concept drift
optimal feature sets for systems that model scenarios that present Concept drift is defined as the change in relationships be-
concept drift. tween inputs and output data in the underlying problem over time
The Online Genetic Algorithm (OGA) Folino et al. (2007) is (Brownlee, 2017). Concept drift will make classifiers degrade over
a rule-based learner that updates its ruleset based on the data time leading to more incorrect classifications. Incorrect classifica-
stream’s evolution. Like the base layer in the EACD system, the tions in the context of malware detection can cause problems. A
initial rulesets are chosen randomly, and the genetic algorithm is malware analysis team would have high standards for abandon-
applied when a new block of data is encountered to update the ing an ageing classification model (Jordaney et al., 2018). In the
rulesets. This process is repeated until the end of the data stream. context of a ransomware classification, a model would have to be
Each block of data is a different iteration, which leads to a large constantly monitored for signs of concept drift due to the damage
number of iterations. OGA does not limit the number of iterations one ransomware infection can cause. Concept drift can occur grad-
the algorithm can go through, which means it can become very ually over time or artificially to cause classifiers’ errors, as stated
expensive. in Kantchelian et al. (2013).
Concept drift can fall into the following three categories;

2.3. Ransomware detection · Gradual Concept Drift: A gradual change over time.
· Cyclical Concept Drift: A recurring or cyclical change.
Ransomware detection research that integrates concept drift is · Abrupt Concept Drift: A sudden or abrupt change.
a rarity in the research space. The Elderan system described in
Sgandurra et al. (2016b) considers zero-day attacks and tests on The relationship between a classifier and its predictions is de-
samples that the model has not been trained on. The Elderan sys- fined as p(y|x ) and concept drift can be defined as changes in
tem’s accuracy drops from 96 to 93% when exposed to zero-day p(x, y ) (Zuhair et al., 2019). The changes in this joint probability
threats; however, it is unclear if the zero-day threats are more than can be identified through its components, suggesting that different
a couple of months ahead of the training set. The explicit test- detection aspects can cause concept drift.
ing on zero-day threats is explored in Takeuchi et al. (2018) and The FeSA system is built to adapt to sudden concept drift and
VinayKumar et al. (2017), similar to the Elderan system, which can gradual concept drift. Sudden concept drift is the type of concept
be considered testing under concept drift; however, the zero-day drift that poses the biggest threat to a malware detection system.
samples are not guaranteed to display concept drift in regards to The sudden appearance of new ransomware which does not con-
the training samples. form to a model’s current configuration is a problem that cannot
The RansHunt system described in Hasan and Rah- be solved by retraining unless the retraining is done before the
man (2017) attempts to predict future ransomware trends by system is exposed to the new ransomware. FeSA is effective when
training on ”Ransomwall”, a ransomware hybrid that authors dealing with gradual concept drift because the system is built us-
predicted to be a future ransomware type. According to the cre- ing features from different distributions. Using different distribu-
ators of RansHunt, a worm component would be used to spread tions to build the FeSA feature set allows the system to capture the
ransomware through a compromised network. This prediction best possible feature set, which applies to ransomware from differ-
approach and preparation for future trends could prevent models ent eras. Capturing common features from many different types of
from degrading under concept drift. The system explored in [35]) ransomware from different periods gives FeSA the best chance of
explores using a generative adversarial system to produce vari- having features that will remain relevant in the future.
ations of ransomware that might deceive ransomware detection
systems; this approach is designed to highlight the need for 3.2. FeSA architecture
ransomware detection systems to be reinforced.
We propose FeSA, a feature selection architecture for ran-
3. FeSA- feature selection architecture somware detection under concept-drift. The FeSA architecture is
shown in Fig. 1 and is comprised of three main components. FeSA
The previous sections in this study discussed malware evolution architecture is built following the structure of a genetic algorithm.
and ransomware. This section introduces our proposal to combat The FeSA architecture needs to be provided with an initial fea-
the concept drift in ransomware detection systems. The FESA sys- ture pool to create feature sets with. The number of features in
tem proposes using an architecture, which generates feature sets this initial feature pool is user-defined. The larger the number
for ransomware detection systems through information gain and of features in the initial feature pool, the larger the number of
a genetic algorithm. Our approach relies on the user’s underlying unique and diverse feature sets the base layer can create. The fea-
machine learning algorithm but is compatible with any machine ture ranker selects a set of ”important” features from the feature
learning approach. The underlying machine learning algorithm will pool to pass onto the feature base layer. The base layer gener-
be the classifier trained on the feature set produced by the FeSA ates a set of random feature sets from the feature pool, ensuring
architecture. Genetic algorithms are proven effective for concept these feature sets include the important features. The feature sets
drift scenarios when used by the systems described in Folino et al. in the base layer are evaluated, and their detection rate and over-
(20 06, 20 07) and Vivekanandan and Nedunchezhian (2011); the all accuracy are calculated. The feature sets that achieve accuracy
obtained results lead us to FeSA, which does not entirely rely on and detection rates above the average accuracy and detection rates
the natural selection mechanism to produce an optimal feature set. of all of the feature sets in the base layer are defined as high-
performance and passed onto the genetic layer. The genetic layer
performs a breeding crossover procedure involving selecting two
3.1. Preliminaries high-performance feature sets from the base layer and combining
them to produce a new feature set; the user defines the number
This section contains necessary background information on of times the crossover process is repeated. In theory, the combina-
concept drift and genetic algorithms. This section also presents tion of high-performance feature sets from the base layer should
Table 1, which gives the notation of the symbols used throughout produce new feature sets which can achieve higher accuracy and
the paper. detection rates than feature sets combined to create them.

4
D.W. Fernando and N. Komninos Computers & Security 116 (2022) 102659

Table 1
Notations.

Symbol Explanation

xi A feature in a feature set.

x The feature x does not appear
IG(xi ) Information Gain for a feature xi
ci The classiﬁcation of an instance into category i
p( c i | x ) Conditional probability of the ith category given the feature x appears.
p( c i | x ) The conditional probability of the ith category given the feature x does not appear.
|Z | The size of the set of important features. The important feature set is added to every feature set produced by FeSA.
a The proportion of features from the feature set which meet the requirements for being important features.
T(f) Total features in the initial feature set.
|N | The size of a feature set generated by the base layer, this feature set is part of the ﬁrst generation of feature sets produced.
r A proportion of the original feature pool.
H High performance feature sets.
m Maximum feature set limit.
dr Average detection rate of feature sets in the base layer.
ar Average accuracy of feature sets in the base layer.
Yi A feature set produced in the base layer.
Hrand1 · A selected parent feature set in the genetic layer.
Hrand2 A selected parent feature set in the genetic layer.
Oi An offspring feature set in the genetic layer.
T A set of offspring feature sets.

Fig. 1. FeSA: Feature Selection Architecture.

3.2.1. FeSA feature ranker algorithm appear.

The initial population of feature sets is randomly generated;
however, the FeSA architecture uses a feature ranker to identify
m
m

the highest information gain features. Information gain reduces the IG(xi ) = − p( c i ) + p( x ) p(ci |x ) · log p(ci |x )
i=1 i=1
complexity of the generated important features because random
feature selection requires multiple selections to ﬁnd the optimal
m

set. FeSA controls the base each feature set is built upon, ensur-
+ p( x ) p(ci |x ) · log p(ci |x ) (1)
i=1
ing strong feature sets. Before generating the initial population,
a feature ranking algorithm is proposed to decide which features
The feature set, taken from the feature ranker, is defined in Eq.
are most important. The feature ranker is the base component of
(2). The variable a is dependent on the features defined by the
our system because it ranks features in order of their importance
feature ranker as essential. T ( f ) represents the total features in
and attaches a numerical value to this ranking. Information gain
the original feature pool. Our FeSA implementation chooses fea-
is calculated according to Eq. (1). The feature importance step is
tures with an information gain equal to or greater than 0.5 as ”im-
designed to provide the initial ”building blocks” for each feature
portant” features. The decision to set 0.5 as the threshold value
set. The ranker algorithm uses information gain to isolate the most
was based on the fact that information gain is a reduction in en-
important features; it determines information gain and then ranks
tropy, a measure of randomness; therefore, features were chosen,
features to gain information. Information gain is the reduction in
which took away at least half of the data’s randomness. Based on
entropy after a dataset is split on an attribute. Entropy is defined
experimental observations, very few features exceeded or matched
as a measure of randomness in information; therefore, the higher
this value during our experiments which meant a threshold of 0.5
the entropy, the harder it is to draw any conclusions from the data
would mean only some features are selected as ”important”. Z is
Zhou (2019).
defined as the set of important features, which every feature set
Information gain (IG ) is a reduction in entropy when splitting
must contain, |Z | is the cardinality of this set. An algorithmic rep-
on an attribute and is calculated in Eq. (1), ci represents the ith
resentation of the feature ranker is shown in Algorithm 1. In addi-
class category i.e. ransomware or benign, and p(ci ) is the proba-
tion to defining the key features included in each feature set, the
bility of ith category. p(ci |x ) is the conditional probability of the
ranker eliminates features deemed to provide 0 information gain.
ith category given the feature x appears. and p(ci |x ) is the condi-
The FeSA system uses the ranker to identify features that present
tional probability of the ith category given the feature x does not
zero information gain and excludes them from the base layer and

5
D.W. Fernando and N. Komninos Computers & Security 116 (2022) 102659

Algorithm 1 FeSA Feature Ranker to avoid massive computational costs. FeSA evaluates the feature
sets that have been generated using the initial population with a
Input: Initial features x0 , ..., xi
random forest classifier. The user’s use of the underlying algorithm
Output: Important feature set Z
is flexible and determined based on their features and data. The
1: for Initial Features x0 to xi do
random forest performed best with our features; therefore, it is
2: Calculate Information Gain for each feature using:IG(xi ) =
m m chosen as our underlying algorithm. The feature sets will be eval-
− m i=1 p(ci ) + p(x ) i=1 p(ci |x ) · log p(ci |x ) + p(x ) i=1 p(ci |x ) · uated on overall accuracy and ransomware detection rate; there-
log p(ci |x )
fore, only the most accurate feature sets with the highest detec-
3: if IG(xi ) 0.5 then
tion rates are passed onto the next phase. The highest perform-
4: Add xi to important feature set Z
ing feature sets are determined by calculating the average accuracy
5: end if
and ransomware detection rates of all feature sets and passing on
6: Return important feature set Z
the feature sets with accuracy and detection rate above the aver-
7: end for
age. The structure of the feature set generation layer is shown in
Algorithm 2. The abbreviations used in Algorithm 2 are as follows,
subsequent genetic layer.
Algorithm 2 FeSA Base Layer
a
|Z | = (2) Input: Initial features x0 , ..., xi ,Important feature set Z
T(f)
Output: High performance feature sets H
Algorithm 1 shows the operation of the feature ranker. The fea- 1: m → maximum feature sets
ture ranker takes an initial set of features x0 to xi and calculates 2: Average detection rate dr = 0
each feature’s information gain IG. If the feature xi has an informa- 3: Average accuracy ar = 0
tion gain value above or equal to 0.5, it is added to the important 4: Total detection rate td = 0
feature set Z. Each feature in the important feature set is denoted 5: while feature set count< m do
as zi . 6: Generate feature set Yi
7: Add important features Z to Yi
3.2.2. FeSA fitness function 8: end while
The fitness function used by FeSA calculates the average detec- 9: Calculate detection rate of Yi using:
tion rate and accuracy amongst all feature sets in the current gen- 10:
TP
True Positive Rate (TPR) = T P+ FN
eration. The highest performing feature sets which display above 11: Calculate accuracy of Yi using:
average detection rates and accuracy are passed onto the next gen- 12:
T PR
Accuracy (ACC) = T PR+T N+F P+F N
eration by the fitness function. The fitness function is used in the 13: Calculate average detection rate dr using:

base layer and the subsequent genetic layer. Our fitness function m
T PR (Y )
14: dr = i=0 m i
uses the ranker’s values, but indirectly as opposed to directly in
15: Calculate average accuracy ar using:
its calculations. The ranker will enforce features with the highest m
ACC (Y )
information gain and eliminate features with no information gain, 16: ar = i=0 m i
thus ensuring that the feature sets produced in the base and ge- 17: for Y0 ..., Yi do
netic layers will provide as much information as possible while re- 18: if detection rate & accuracy of Yi > dr & ar then
moving excess features that provide no information. 19: Add Yi to high performance feature sets H
20: end if
3.2.3. FeSA base layer 21: end forreturn H
The FeSA base layer acts as the initial population generation re-
quired by a genetic algorithm. The base layer randomly generates True Positive (TP), False Positives (FP), False Negatives (FN), True
feature sets from a pool of initial features. The initial population Negatives(TN).
is a requirement of a genetic algorithm and is needed in order to Algorithm 2 shows the operation of the FeSA architecture base
generate strong feature sets in the genetic layer; the main differ- layer. The initial features are taken, and new feature sets are gen-
ence between the base layer and a regular population layer is that erated, including the important feature sets Z, denoted as Yi . The
the ranker has already defined a set of features that are enforced important features from the feature ranker are added to each
in each generated feature set. The ranker enforcing important fea- generated feature set. The average detection rate and accuracy
tures in the base layer feature sets means the base layer feature of every feature set generated, Yi , is calculated, and if it shows
sets will already have higher accuracy than if the feature sets were above-average performance, it is placed in the H, the set of high-
randomly generated. The base layer follows on from the ranker. performance features.
r
|N | = ( · T ( f )) + |Z | (3)
100 3.2.4. FeSA genetic layer algorithm
The number of features selected per feature set is shown in Eq. The FeSA genetic layer acts as the crossover phase in a ge-
(3), where r is the proportion of the initial feature pool in each netic algorithm. The genetic layer is needed to produce strong
generated feature set and N is a feature set generated by the base feature sets. The feature sets are expected to reach optimal per-
layer. N is calculated as r divided by 100, which obtains a propor- formance after the crossover phase has been completed multi-
tion of T ( f ), T ( f ) being all the of the features in the initial fea- ple times. The feature sets produced in the genetic layer will be
ture pool plus |Z | which is the important feature set chosen by the candidates for the optimal feature set. The genetic layer contains
ranker. the high-performance feature sets from the base layer and will
This process of generating feature sets is repeated as often as combine these high-performance feature sets using the uniform
the user defines and will define the population size for each gen- crossover method to yield more accurate feature sets. The genetic
eration. The number of repetitions will be balanced with a defined layer has the advantage of enforcing important features in each
size for each feature set. There are many features in the feature feature set; therefore, the iterations needed for the feature sets to
pool; therefore, possible combinations should be heavily regulated reach optimal performance is reduced in theory.

6
D.W. Fernando and N. Komninos Computers & Security 116 (2022) 102659

The genetic selection layer is a breeding mechanism for the 4. Experimental setup
highest performing feature sets taken from the initial feature selec-
tion layer. The genetic layer is made up of ”parent” and ”offspring” Our experiments’ main aim was to test the FeSA architecture’s
feature sets. The ”parent” feature sets are the high-performance effectiveness and compare it to a genetic and an evolutionary-
feature sets from the base layer. The ”offspring” feature sets are based feature selection algorithm. The architecture is evaluated in
produced by choosing two-parent feature sets and combining them scenarios where the test samples display concept drift and do not
with a crossover function. High performing ”parent” feature sets behave according to what the classifier expects and compare it
will be combined using uniform crossover, generating ”offspring” with other similar feature selection algorithms. FeSA has also com-
feature sets. In theory, the offspring feature sets will display a pared a greedy stepwise algorithm, genetic search, evolutionary
higher performance level than the preceding generation. Uniform search, best-first search and harmony search. Table 2 shows the
crossover takes two-parent feature sets and combines them. For formulas used to calculate the performance metrics. We refer to
each corresponding feature in each parent feature set, the feature the TPR of ransomware as detection rate throughout this paper.
the offspring feature set receives is determined by a coin-flip; this
is a probability of 0.5. The crossover function used by FeSA is user- 4.1. Testbed
defined; however, for our purpose and need to enforce particular
features into feature sets, the uniform crossover function proved to 4.1.1. Environment
be the most efficient. The resulting offspring feature sets are eval- Our test environment consists of a Cuckoo sandbox analysis en-
uated, and the feature set with the highest average detection rate vironment that generates the data used for our datasets. Each ran-
and overall accuracy is chosen as the optimal feature set. An im- somware and benign executable in our dataset was executed in a
portant factor in this phase is that only one generation generates virtual machine running Windows 7 in VirtualBox. The virtual ma-
the optimal feature set. The structure of the genetic layer is shown chines were cloaked and hardened by VMCloak and Paranoid Fish.
in Algorithm 3. The virtual machines were hardened to make them look and be-
have as close to a physical machine as possible. The hardening pro-
Algorithm 3 FeSA Genetic Layer cess was undertaken due to modern malware using anti-sandbox
technology to prevent proper execution in a sandbox environment.
Input: High performance feature sets H
Each execution was limited to two minutes; this was the default
Output: Optimal feature set Oi
for Cuckoo Sandbox. The machine learning platform used for the is
1: m → Max feature set count WEKA(Waikato Environment for Knowledge Analysis), a collection
2: n → Current feature set count of machine learning algorithms for data analysis.
3: Offspring feature set T
4: while n < m do 4.1.2. Data
5: Select random base feature set 1 Hrand1 from set H The ransomware samples used are from 2013 to 2019; the sam-
6: Select random base feature set 2 Hrand2 from set H ples from 2013 to 2015 were gathered using the Elderan dataset
7: Perform uniform crossover using Hrand1 and Hrand2 & Gener- (Sgandurra et al., 2016a), which contained a list of hashes for each
ate mixed feature set Oi ransomware sample they used. These samples are used as there
8: if Duplicate features are detected thenReplace duplicate fea- was a wide range of ransomware from 2013 to 2015. The samples
ture with a random feature xi from feature pool from 2016 to 2019 were gathered based on popularity and how
9: end if much each ransomware family has made in ransoms. The dataset
10: Add Oi to offspring set T consists of 639 ransomware files and 531 benign files; the benign
11: end while files are a mix of windows executables that includes legitimate
12: return Optimal Oi ∈ T which has the highest average detection software that behaves similarly to ransomware, such as AxCrypt,
rate & accuracy. Bitlocker 7zip and VerCrypt. Our experiments are carried out using
a random forest and 10-fold cross-validation. The random forest al-
gorithm is used because it is the algorithm that performs the best
Algorithm 3 shows the genetic layer of the FeSA architec-
with our API features. Our base data set contains 400 benign and
ture. The High-performance feature sets from the base layer. Two
531 ransomware samples. New ransomware files and new benign
random high-performance feature sets are selected Hr and1 and
files were added for each round of experiments in concept drift
Hr and2, and uniform crossover is carried out to mix the two high-
conditions. Each round of experiments uses the most prominent
performance feature sets to create a new feature set Oi . The pro-
ransomware samples from 2017 to 2019. API (Application Program-
cess of mixing the high-performance feature sets is repeated m
ming Interface) data is extracted for each sample. The API calls dic-
times until completion. The best performing of these newly gen-
tate how an executable interacts with the OS and what functions
erated feature sets is stored in set T . The best performing feature
an executable invokes.
set out of the newly generated feature sets in T is selected as op-
This research uses a nonconformity measure to prove concept
timal.
drift exists in the datasets used. The credibility p-value of each
prediction is to measure the credibility of each prediction a clas-
3.3. Mutation sifier makes. The p-value measures the proportion of instances,
which are as different or more different from the rest of the in-
The mutation function in a genetic algorithm is the introduction stances in the dataset as the new instance z. A high credibility
of diversity. A mutation would mean an offspring feature set in- value means that z is very similar to the objects in the class chosen
heriting a feature not present in either parent feature set in a fea- by the classifier, and low credibility would imply the opposite. The
ture selection context. During the crossover phase, duplicate fea- experiments were carried out using the random forest classifier;
tures are prohibited from being in a feature set. If there is a fea- therefore, the prediction probabilities extracted from the random
ture set with duplicate features, duplicates will be replaced with forest are used to calculate the p-values needed to prove drift. For
a random feature from the feature pool, leading to a 0.01% muta- example, it can be observed that, when trained on data from 2013–
tion rate. The low mutation rate is used to eliminate unnecessary 15, the average credibility of predictions on ransomware from 2015
randomness from the FeSA architecture. was 0.9. When the 2015 model is tested on ransomware from

7
D.W. Fernando and N. Komninos Computers & Security 116 (2022) 102659

Table 2
Performance metrics.

Metric Calculation Value

TP
TPR (True Positive Rate) / Recall Correct classiﬁcation of Ransomware.
(T P + F N )
FP
False Positive Rate (FPR) Benign software classed as Ransomware.
FP + TN
False Negative Rate (FNR) 1-TPR Ransomware classed as benign.
TP
Precision Proportion of ransomware classiﬁcations, that are actually ransomware.
TP + FP

2016–17, predictions’ credibility drops to an average of 0.74. The 4.2.1. Detection phase
drop in credibility is present in every scenario in the series of ex- The detection phase of the framework is tested in the experi-
periments carried out. There is an average drop in the credibility mental test phase. The optimal feature set chosen by FeSA is tested
of predictions of 0.21. The drop in credibility shows that the clas- on a dataset made up of ransomware and benign files from a fu-
sifier becomes more uncertain of its predictions, which indicates ture time to the training data. Our detection phase simulates the
the data is behaving in a way it is not prepared for; this indicates system coming into contact with ransomware which displays con-
concept drift. cept drift and is from a different distribution and may behave dif-
ferently to what the classifier expects of ransomware. The break-
down of the datasets is described in Section 4.2. The detection
4.1.3. Features
phase is repeated for the algorithms FeSA is compared with.
The feature pool of 320 features consisted of API calls used by
Windows programs during execution. The 320 features fall into API
4.2.2. Genetic search algorithm
16 call categories, which the feature set sizes are based on. We
The genetic algorithm used as a benchmark for our experiments
aim to capture two features per category on average; however, this
was a basic genetic algorithm that followed the structure described
does not always prove the case due to the natural selection mech-
in section 3.1. The experiments used the configuration suggested
anism. We choose to capture two features from each category to
by WEKA genetic search algorithm (2020), a generational and pop-
limit the feature size and complexity of the crossover phase.
ulation limit of 20, a crossover probability of 0.6, and a muta-
tion probability of 0.033. The default settings in both the genetic
4.2. Experiments search and evolutionary meant these algorithms had a significant
advantage over FeSA in population generation and feature set size.
Our experiments are set up to test the strength of the feature The machine learning algorithm used with the Genetic Search fea-
sets produced by FeSA in concept drift scenarios. The experiments ture selection algorithm is the random forest to maintain consis-
compare the performance of the FeSA feature sets with other tency and fairness compared with the FeSA algorithm. The ge-
nature-inspired feature selection algorithms. The datasets used in netic algorithm used in the experiments used both overall accu-
these experiments are structured to display real-life concept drift racy and overall information gain as fitness functions as provided
scenarios, and the validity of the concept drift in these datasets are by WEKA and found minimal difference between the two; there-
tested by p-values, as mentioned in Section 4. The concept drift ef- fore, the overall accuracy was chosen as it was closest to the fit-
fect is achieved by having ransomware and benign software from ness function used by FeSA.
different periods. Each classifier is tested on data produced after
the data the classifier is trained on. The process runs on our base 4.2.3. Evolutionary search algorithm
dataset that contains ransomware and benign samples from 2013 As presented in WEKA, the evolutionary algorithm also fol-
to 2015. The optimal feature set is produced, and FeSA trains a ran- lowed a similar structure to the genetic algorithm described in
dom forest using 10-fold cross-validation and observes the results. Section 3.1; however, it uses a different configuration. The evolu-
The benchmark algorithms are tuned and run by us. The settings tionary algorithm used a tournament selection method with a mu-
used for the benchmark algorithms are tuned to compare them to tation probability of 0.1. The tournament selection approach en-
FeSA as fairly as possible. The closest algorithms to FeSA, the ge- sures that the fittest feature sets are passed onto the next gen-
netic search and the evolutionary search, are given an advantage eration. The generation and population limit were set to 20, re-
over FeSA, in which they use more generations to generate their spectively, like the genetic algorithm. The machine learning algo-
feature sets. The underlying classification algorithm used with FeSA rithm used with the evolutionary algorithm feature selection is the
and the benchmark feature selection algorithms is the random for- random forest to maintain consistency and fairness compared with
est classifier. FeSA’s results are compared to the results obtained the FeSA algorithm. The fitness function used for the evolutionary
from the greedy stepwise algorithm, genetic search, evolutionary algorithm is the overall accuracy to maintain consistency with the
search, best-first search and harmony search. After observing re- FeSA algorithm.
sults on data the classifiers would see as up to date data, FeSA tests
how the classifiers perform on ransomware and benign data from 5. Experimental results and discussion
2016 and 2017. Ransomware and benign samples from 2016 and
2017 would represent concept drift as their behavioural patterns 5.1. Experimental results
are different, as per our observations. The process of observing de-
tection rates under concept drift is repeated by training on data This section explores and elucidates our results. In our experi-
from 2013–2017 and testing on data from 2018 and again, repeated ments, the FeSA architecture used the random forest classifier with
for data up to 2019. It is observed how the feature sets produced 10-fold cross-validation. Our implementation used the WEKA API
by the FeSA architecture perform compared to the feature sets pro- and an in-house feature extraction program to create our dataset.
duced by the greedy stepwise algorithm, genetic search, evolution- The experimental findings are presented in Tables 3, 4 and Fig. 2.
ary search, best-first search and harmony search. The results of our Table 3 shows the key statistics of the detection algorithms tested,
experiments are shown in Section 5. including the FeSA architecture when concept drift is not applied.

8
D.W. Fernando and N. Komninos Computers & Security 116 (2022) 102659

Table 3
Experimental results without concept drift.