0% found this document useful (0 votes)
16 views14 pages

Machine Learning Algorithms and Frameworks in Ransomware Detection

This paper provides a review of machine learning algorithms and frameworks that are used for ransomware detection. It discusses common ransomware types and challenges in detecting ransomware. The paper analyzes existing ransomware detection frameworks that utilize machine learning algorithms, including the algorithms used, year of creation, results and challenges. It aims to serve as a reference for future ransomware detection work by consolidating information on current frameworks.

Uploaded by

Gnan Deep Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views14 pages

Machine Learning Algorithms and Frameworks in Ransomware Detection

This paper provides a review of machine learning algorithms and frameworks that are used for ransomware detection. It discusses common ransomware types and challenges in detecting ransomware. The paper analyzes existing ransomware detection frameworks that utilize machine learning algorithms, including the algorithms used, year of creation, results and challenges. It aims to serve as a reference for future ransomware detection work by consolidating information on current frameworks.

Uploaded by

Gnan Deep Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received 24 September 2022, accepted 27 October 2022, date of publication 1 November 2022, date of current version 11 November 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3218779

Machine Learning Algorithms and Frameworks


in Ransomware Detection
DARYLE SMITH , SAJAD KHORSANDROO , AND KAUSHIK ROY
Department of Computer Science, North Carolina A&T State University, Greensboro, NC 27411, USA
Corresponding author: Daryle Smith ([email protected])

ABSTRACT Ransomware has been one of the biggest cyber threats against consumers in recent years. It can
leverage various attack vectors while it also evolves in terms of finding more innovative ways to invade
different cyber security systems. There have been many efforts to detect ransomware within the workforce
and academia leveraging machine learning algorithms, which has shown promising results. Accordingly,
there is a considerably large body of literature addressing various solutions on how ransomware threats
can be detected and mitigated. Such large and rapidly growing scientific and technical materials start to
make it difficult in knowing the actual ML algorithm(s) being used. Hence, the aim of this paper is to give
insight about ransomware detection frameworks and those ML algorithms which are typically being used to
extract ever-evolving characteristics of ransomware. In addition, this study will provide the cyber security
community with a detailed analysis of those frameworks. This will be augmented with information such as
datasets being used along with the challenges that each framework may be faced with in detecting a wide
variety of ransomware accurately. To summarize, this paper delivers a comparative study which can be used
by peers as a reference for future work in ransomware detection.

INDEX TERMS Artificial Neural Network (ANN), cyber security, deep convolutional neural network
(DCNN), deep neural network (DNN), Hardware Performance Counter (HPC), Long Short Term Memory
(LSTM), machine learning (ML), ransomware, Recurrent Neural Network (RNN), Sum of Product (SOP),
Support Vector Machine (SVM), Term Frequency and Inverse Document Frequency (TF-IDF), The Onion
Routing (TOR).

I. INTRODUCTION locker-ransomware, which is designed to lock the victims’


Ransomware has been a threat against typical end users, busi- computer, to prevent them from using it; Second, and most
ness units, and the government in recent years. For example, common nowadays, is crypto-ransomware, which encrypts
it has targeted medical centers, schools [1], universities [2], personal files to make them inaccessible to its victims [55].
and police departments [3], to name a few. It was even pre- Frameworks applying static and dynamic analysis, as well as,
dicted that ransomware would account for around $20 billion ML algorithms, have been aiding with ransomware detection,
in loss alone towards organizations in 2021 [4]. Ransomware and due to the nature of executing the ransomware, a high
is a form of malware designed to control access to data or a accuracy rate would be expected. However, analysis takes
system until a requested ransom amount from the attacker is a relatively long time, leaving gaps where the malicious
satisfied [5]. Detection of ransomware is tricky and a resource payload can intrude the sandbox system without detection.
hungry task because it is hidden within the application layer This entire process alone is overly complicated. In this paper,
payload. Mitigation can also be difficult because of the use we focus on those ML algorithms that are mostly used in
of encryption against the application. Though more studies ransomware detection. We also provide a brief review of com-
and evaluations have been involved in other areas of malware, mon ransomware frameworks using such algorithms, along
ransomware specifically has not been the focal point, and with their results.
the push to improve security measures and discovery have When it comes to ransomware attacks, cybercriminals have
been stagnant [5]. There are two types of ransomware: first, perfected these techniques over the years. However, both
academia and industry have been trying to address these
The associate editor coordinating the review of this manuscript and threats and protect victims by learning from past experi-
approving it for publication was Jun Wu. ences and utilizing technological advancements over time [6].
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 10, 2022 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 117597
D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

Nonetheless, these attacks are growing daily. The underly- created by cybercriminals to target web pages with the
ing reason is that combating ransomware is challenging [7]. use of JavaScript. One such example is Ransom32, which
Ransomware typically relies on strong encryption that first appeared in late 2015. Until its discovery, no other
is easy to accommodate due to the vast amounts of ransomware attack was used with that programming lan-
open-source implementations. It also makes use of most guage [9]. This type of ransomware-as-a-service is unique
infection techniques that are employed by modern malware because, being written in JavaScript, it uses a web browser to
families. Ransomware benefits from the common elusive initiate its attack. The impact of this threat is far superior in
methods utilized by modern malware and it frequently uses nature because it can be used theoretically on any operating
application programming interfaces (APIs) to carry out mali- system where a web browser exists. This grants Ransom32
cious actions that make it difficult to differentiate from benign so-called ‘‘write-once-infect-all’’ capabilities [9]. Nonethe-
applications. Furthermore, it uses TOR networks (The Onion less, Ransom32 has only been detected on a Windows-based
Routing networks) to keep its communication anonymous, platform thus far. It can be found on most underground TOR
and unregulated payment techniques like cryptocurrencies sites and can be downloaded by the affiliated user. To down-
to get paid without easily disclosing the identity of the load the executable, one must have a bitcoin address, as this
attackers [8]. is the way payments of ransom are made.
The remaining structure of this paper is organized as fol- The developers of the Ransom32 software take a 25% cut
lows and investigated perspectives can be found in Figure 1: of any ransom made, and the rest goes to the user of the
Section 2 reviews required research background and provides affiliated program [9]. When the Ransom32 executable runs,
a comprehensive review of different ML algorithms used in it extracts several files. During this process, a shortcut is
ransomware detection. Section 3 will provide an analysis created in the start menu, and the ransomware will start at
of ransomware frameworks that use ML algorithms, their login which guarantees the malware will be executed every
challenges, evaluations, and results. This section also expands time the system is started. The shortcut points to a chrome.exe
on the importance of this research, consolidating all the men- executable file that is typically an NW.js package. This pack-
tioned frameworks in a table, providing a description of each age contains JavaScript code used for encryption using AES
framework’s name, the algorithm(s) of choice, the year it and extracts to folders such as %AppData% and %Temp%.
was created, overall results, and challenges. Section 4 will Furthermore, this is the piece that contributes to performing
speak about some future concerns and defense topics of the harmful events towards the compromised system [10].
ransomware, ending with concluding the paper in section 5. With NW.js being a legitimate framework and application,
antivirus coverage in this area is still very weak in nature.
Any black hat or white hat developer can use this executable
to create and distribute native apps that run just like nor-
mal executables [11]. Furthermore, when looking closer into
Ransom32, it runs under the context of the user without hav-
ing any administrative rights or permissions. Figure 2 gives a
general idea of how a member can join the affiliate network,
then ultimately be granted access to download the malicious
code for use. The member would also be able to see statistics
related to the software such as the number of payments that

FIGURE 1. An overview of discussed ransomware.

II. RESEARCH BACKGROUND


This section will briefly cover different types of ransomware
that are common across the cyber security community. It will
also cover typical machine learning algorithms used in ran-
somware detection.
A. Ransom32
With the emergence of social media and its popularity in
the younger generations, new ransomware families are being FIGURE 2. Ransom32 membership and attack process flows.

117598 VOLUME 10, 2022


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

have been made and the number of installations that have as the Wscript2 [54]. It is believed that the attackers behind
been completed. The figure also shows the process of how the RAA ransomware are using the Jscript scripting language
a Ransom32 attack can occur. to make detection more difficult and to make complications
easier. Most malware attacks are written in compiled pro-
gramming languages with ransomware often disguised as
executables. Nonetheless, using languages which are not typ-
ically used to deliver malware, such as scripting languages,
could be less prone to detection [13].

C. CoronaVirus
Ransomware affiliates switched to COVID-19-themed social
engineering tactics during the 2020 pandemic to carry out
threats [14]. Mobile applications that looked legitimate would
download various forms of ransomware using spam attach-
ments that claim to provide health and safety information
about COVID-19 [15]. As the global pandemic increased the
need for health centers, the exposure to cyber-attacks also
boosted. This situation increased the number of ransomware
attacks within the health sectors, and Corona ransomware was
born [16]. This was a new strain that focused specifically on
hospitals and the encryption of patients’ health records. After
FIGURE 3. A deobfuscated function installing Pony. the host became infected, it displayed a COVID-19-themed
ransom message and demanded payment in Bitcoin [14].
B. RAA
The second example of JavaScript ransomware is RAA,
which spreads via email attachments pretending to be legit-
imate document files. These files typically have a valid file
format with a JavaScript extension, making the victim believe
its authenticity. Once the file is opened, it works just as
any other ransomware attack. The victim host’s files will
be encrypted, and a ransom will be demanded. RAA also
infects the victim’s computer by installing Pony, a well-
known password-stealing malware embedded in a JavaScript
file. A sample code of how this happens can be found in FIGURE 4. CoronaVirus delivery flow.
Figure 3. This malware can collect browser passwords and
other critical information on infected systems. Two security
The Covid-19 pandemic also opened the doors for many
researchers initially discovered RAA and according to them,
ransomware attacks against employees. Due to the threat of
it encrypts files using code from an open-source library called
catching the virus, many companies began to offer employ-
CryptoJS [12]. This code handles cipher algorithms such
ees the opportunity to work remotely [43]. This increases
as AES, DES, to name a few [12]. RAA targets images,
exposure to cyber-risks because individuals connect through
Ms-Word, Ms-Excel, Photoshop, .zip, .rar, sparing pro-
less reliable and unsecured Internet connections. Employ-
gram files, Windows files, AppData, and Microsoft files by
ees that accessed corporate networks using personal devices
appending a ‘‘.locked’’ to the end of the filenames [12]. Upon
provided a way to get into the hands of unauthorized
further analysis, Trend MicroTM discovered that the RAA
individuals through unsanctioned channels [43]. Attackers
ransomware is written in Jscript and not JavaScript [13].
also focused heavily on sophisticated phishing techniques.
Jscript is designed for Windows R systems and executed
According to [44], an APWG report showed 267,372 phish-
by the Windows Scripting Host Engine through Microsoft
ing campaigns were reported in the first quarter of 2020,
Internet Explorer (aka IE), but not via the Microsoft Edge
increasing (19.06%) over 2019 during the same period.
browser. Jscript carries some semblances with JavaScript
In Figure 4 below, the CoronaVirus delivery flow begins
because they are both derived from ECMAScript.1 Jscript
with a phishing website, locking the file system, then fully
is the implementation of ECMAScript while JavaScript is
compromising the hard drive until the ransom has been paid.
the Mozilla implementation of ECMAScript [53]. Jscript can
access objects exposed by IE and some systems objects such
1 ECMAScript is a Javascript standard that helps ensure the interoperabil- 2 Wscript are generic scripts specifically executed in Windows based
ity of web pages across different browsers. platforms.

VOLUME 10, 2022 117599


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

D. WannCry Wake-On-Lan, waking computers for encryption [59]. These


WannaCry, introduced itself and targeted computers running abilities contribute to the effectiveness and reach of its
the Windows operating system [45]. It encrypts a victim’s encryption and the damage it can cause. Ryuk accounts for
data using Microsoft’s flawed protocol EternalBlue, then three of the top 10 largest ransom demands of the year in
demands payment in Bitcoin once the infection has taken the CrowdStrike 2020 Global Threat Report, with amounts of
place [46]. This vulnerability allows the adversaries to exe- USD $5.3 million, $9.9 million, and $12.5 million [59]. The
cute a remote code on the infected machines by sending Russian hacker group, WIZARD SPIDER, is said to be the
specially crafted messages to an SMB v1 server, connecting creator of Ryuk, and in 2020 during the coronavirus pandemic
to TCP ports of unpatched Windows systems [47]. WannaCry an attack was targeted against Universal Health Services [59].
also works as a network worm because it includes a transport The fortune 500 company has health care facilities in both
mechanism to automatically spread itself, Figure 5 shows the US and UK, with exposure stemming from a phishing
this visual. This feature makes the attacks more effective and email [61]. Figure 6 demonstrates how Ryuk can attack an
requires defense mechanisms that can react quickly and in Active Directory that has been misconfigured.
real time. Furthermore, WannaCry has an encryption com-
ponent that is based on public-key cryptography. The virus
impacted more than 200,000 computers in over 150 coun-
tries [48]; Ukraine, India, Russia, and Taiwan were the four
most affected countries [49]. Vladamir Putin, president of
Russia, blamed the United States for the attack due to their
involvement with developing EternalBlue, but it was later
determined a group of North Korean hackers were responsi-
ble [50]. Microsoft began providing patches for older system FIGURE 6. Common Ryuk attack with regards to a misconfigured active
versions on the day of the outbreak, but the count of attacked directory.

systems continued to rise as new versions and variants of


the ransomware were constantly released [51]. The spread of F. MACHINE LEARNING WITH RANSOMWARE
the virus was slowed by the work of Marcus Hutchins, who Machine Learning (ML) has become a mature technology
discovered a ‘‘kill switch’’ inside the virus [52], [60]. that is being applied to a wide range of business problems
such as web search, online advertising, product recommen-
dations, object recognition, and so on. As a result, it has
become imperative for researchers and practitioners to have
a fundamental understanding of ML concepts and practical
knowledge of end-to-end modeling [17]. Machine Learning
contains the use of statistical methods for the detection of
patterns within data and those patterns are constructed against
mathematical models [17]. These models are then used to
make predictions against future data. Machine Learning is
being used extensively by companies across a broad spectrum
of applications and there are many other areas such as game
playing, unmanned cars, and automated question answering
FIGURE 5. WannaCry process flow as a network worm. where ML is poised to drastically change the way technology
affects our lives.
E. RYUK There have been multiple attempts to detect ransomware,
In mid to late 2018, a new type of ransomware started and a plethora of researchers have tested against sev-
targeting specific victims and carried out its attacks against eral frameworks. Various surveys that have condensed ran-
enterprises [56]. Ryuk mostly infected its targets via other somware characteristics and attacks to provide a full spectrum
malware [57] and attacks would disable the Windows system of what ransomware really is, how it works, and how to
restore option, making it impossible to restore encrypted files limit its threat. However, a study against Machine Learning
without a backup [58]. During infection, Ryuk first shuts algorithms specifically used to detect ransomware and the
down 180 services and 40 processes [59]. These services classification of those frameworks have not been addressed
and processes could prevent Ryuk from doing its own work directly. Due to the advancement of social media and enriched
and are needed to facilitate the attack. At that moment, websites involving user input and interaction, it is important
the encryption logic can begin. Ryuk encrypts files such to understand what algorithms are providing the best results
as photos, videos, databases, and documents – all the data in detecting ransomware so that the research community can
you care about – using AES-256 encryption. The symmet- improve the areas that are not working. Furthermore, as the
ric encryption keys are then encrypted using asymmetric threat of ransomware continues to grow, having a direct go-
RSA-4096. Ryuk can encrypt remotely and perform to-guide of algorithms that have proven to be effective will

117600 VOLUME 10, 2022


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

help the research community spend less time analyzing irrel-


evant information. To fill this research gap, a broad study
of ransomware detection frameworks and tools that utilize
ML algorithms are reviewed. This research shows how those
frameworks are used in relation to ransomware detection, the
ML algorithm of choice, and the results of accuracy to predict
and classify ransomware.
The following sections will discuss the typical algorithms
that are used in detecting ransomware. These algorithms
are performed under multiple trials of data sets and are
often combined with other algorithms using cross validation
analysis. FIGURE 7. A breakdown of decision tree using ransomware sample data.

1) DECISION TREE 2) RANDOM FOREST


Decision Tree algorithm belongs to the family of supervised Random Forest is a supervised learning algorithm that builds
learning algorithms where learning and prediction steps are a forest of decision trees, usually trained with the bagging
performed. The model in the learning step is developed based method. This method gives awareness that a mixture of
on given training data, and the model in the prediction step learning models increases the overall result. Random Forest
model is used to guess the response for given data [18]. can be used for both classification and regression problems
Unlike other supervised learning algorithms, the decision and typically has the same hyperparameters of a Decision
tree algorithm can be used for solving both regression and Tree [19]. While growing trees, this algorithm adds additional
classification problems. When using a Decision Tree, the randomness to the model in hopes of producing an even better
goal is to build a training model that can predict the class model. Instead of searching for the most important feature
or value of the target variable by learning simple decision while splitting a node, it searches for the best feature among
rules inferred from prior data. For predicting a class label a random subset of features [63]. Trees can become more
for a record, one would start from the root of the tree. The random by using random thresholds for each feature rather
values are then compared with the root attribute along with than searching for the best possible thresholds. Random forest
the record’s attribute. The branch which links to that value is a great algorithm to train early in the model development
is followed, based on the comparison, and jumps to the next process, to see how it performs. Its ease makes building a
node [18]. The name itself suggests that it uses a flowchart good random forest quite simple. The algorithm is also a
like a tree structure to show the predictions that result from a great choice for developing quick models and showing good
series of feature-based splits. It starts with a root node and metrics of the importance it assigns to features. The perfor-
ends with a decision made by leaves. There are two types mance of Random Forest is quite consistent and is difficult for
of Decision Trees based on the type of target variable being other algorithms to achieve [19]. Random Forest algorithms
used: Categorical Variable Decision Tree and Continuous are not ideal in the extrapolation of data, nor does it produce
Variable Decision Tree. A categorical variable decision tree satisfactory results with sparse data. They typically will spend
is illustrated by Figure 7. more time when compared to a decision tree and require
Decision Trees follow the Sum of Product (SOP) represen- more resources for computation [20]. Figure 8 provides an
tation. Every branch from the root of the tree to a leaf node example of how ransomware data may be used with the
having the same class is the conjunction (product) of values, random forest tree.
and different branches ending in that class form a disjunction
(sum). The main purpose of a Decision Tree is to detect which
attributes are needed to consider as the root node. The tree’s
accuracy is dependent upon its decisions on how it splits its
nodes, and they use multiple algorithms to decide to split a
node into two or more sub-nodes. The creation of sub-nodes
increases the homogeneity of resultant sub-nodes, meaning
the purity of the node increases with respect to the target
variable. The decision tree splits the nodes on all available
variables and then selects the split which results in the most
homogeneous sub-nodes [18]. One problem that must be
accounted for is overfitting, which happens when a tree is FIGURE 8. Random Forest tree operations.
overly complex and does not generalize against the trained
data. To correct this problem, a data compression technique 3) LONG SHORT TERM MEMORY
called pruning is performed to reduce the size of the tree by Long Short Term Memory is a type of recurrent neural
removing sections that provide limited value [62]. network (RNN) capable of learning order dependence in

VOLUME 10, 2022 117601


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

sequence prediction problems. They use special units and


standard units, which include a memory cell that can main-
tain information in memory for long periods of time. The
core concept of an LSTM is the cell state and three gate
phases [21]. The cell state acts as a gateway that transmits
relative information all the way down the sequence chain.
It can be thought of as the memory of the network. The cell
state carries information throughout the sequence processing,
allowing data from an earlier time step to be present later. This
process helps reduce the effects of short-term memory. As the
cell state goes on its journey, information is either added or
removed to the cell state via gates. These gates are different
neural networks that decide which information is allowed on
the cell state [21].
The first gate, the forget gate, decides whether the data
should be kept from the previous timestamp or forgotten.
Information from the previous hidden state and information
from the current input is passed through a sigmoid func- FIGURE 9. Code snippet of a NB algorithm using ransomware data.
tion where values come out between 0 and 1. The closer to
0 means to forget the data, whereas the closer to 1 means
to keep it [22]. The second part is called the input gate, and 5) GRADIENT TREE BOOSTING
it is used to quantify the importance of the new information Gradient Tree boosting is a machine learning algorithm
carried by the input. It passes the previous hidden state and used for building predictive models regarding its prediction
current input into a sigmoid function that decides which speed and accuracy, especially with large and complex data.
values will be updated. It also transfers the hidden state and It works with both classification and regression problems that
current input into the tanh function which helps regulate the utilize weaker learners to generate a more accurate predictor.
network. The sigmoid output will decide which information It relies on the intuition that the best possible next model,
is important to keep from the tanh output and it uses the when combined with previous models, minimizes the overall
same 0 and 1 approach as the forget gate. The output gate prediction error. A gradient-boosted trees model is made in
passes the updated information from the current timestamp a stage-wise fashion as in other boosting methods, but it
to the next timestamp, deciding what the next hidden state generalizes the other methods by allowing optimization of
should be. Because LSTM can give more accurate predictions a random differentiable loss function [25]. It is composed
from recent information, it solves the problem of long-term of three elements: a loss function to be optimized, a weak
dependencies by trying to predict words in long term memory. learner to make predictions, and an additive model to add
LSTM can maintain information for a long period of time and weak learners to minimize the loss function.
is used for processing, predicting, and classifying time-series Overfitting is a problem in Fitting the training set too
data [21], [22], [64]. closely can lead to degradation of the model’s generalization
ability. Some regularization techniques reduce this overfit-
ting effect by constraining the fitting procedure. One way to
4) NAÏVE BAYES achieve this goal is by using the number of gradient boosting
Naïve Bayes is a classification algorithm based on the Bayes iterations for its regularization parameter [25]. Increasing this
Theorem for calculating probabilities and conditional prob- reduces the error within the training set. However, it must not
abilities [23]. It is not a single algorithm but a family of be set too high. Monitoring the prediction error on a sepa-
algorithms that share a common principle. This algorithm rate validation data set can also aid in selecting the correct
is extremely fast and is mainly used with large datasets. number of iterations. Several other regularization techniques
It assumes that the occurrence of a particular feature does not can be used such as the depth of trees. A higher value in this
affect the other and is known to outperform some of the better regularization parameter typically shows that the model will
classification methods [24]. A Naive Bayes model consists of overfit the training data.
a large block that includes an input field name, an input field
value, and a target field value. The model is used to record 6) SUPPORT VECTOR MACHINE
how often a target field value appears together with a value of Support Vector Machine is an algorithm used for both regres-
an input field. The value of the probability-threshold param- sion and classification tasks but is primarily used in classifi-
eter is used if one of these fields of the block is empty, which cation objectives. It does not require high computation power
occurs when a training-data record with the combination of but still produces significant accuracy. The support vector
an input field value and target value does not exist. The NB machine algorithm finds a hyperplane in an N-dimensional
algorithm using ransomware data is shown in Figure 9. space that classifies the data points. The classification is

117602 VOLUME 10, 2022


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

performed by finding the hyper-plane that differentiates the By doing so, the most relevant features that provided optimal
two classes very well [26]. They are effective in high dimen- performance in detecting new ransomware were extracted.
sional spaces and can still be useful in cases where the Detection models were also developed for HSR, and they
number of dimensions is greater than the number of sam- utilized supervised machine learning algorithms on many
ples. SVMs use support vectors, a subset of training points prominent features. It was proven that this framework’s detec-
in the decision function, providing memory efficiency. The tion method achieved high accuracy and less false positive
versatility of the algorithm is also a key point as it can use rate for detecting HSR in the early phases of ransomware.
different kernel functions against the decision function or These methods have also been validated with an extensive
even use custom kernels. However, overfitting will occur if experimental evaluation to show their effectiveness. Lastly,
the number of features is much greater than the number of the capabilities of the proposed method were compared to
samples. SVMs do not provide probability metrics and must the results of previous work, other classifiers, and VirusTo-
use five-fold cross validation. The SVM algorithm has been tal. The framework itself is broken into 3 phases. The first
applied in biological science for use in the categorization of phase includes gathering ransomware and benign data from
protein; it has also been widely used with the classification a variety of sources. Once gathered, the data is checked and
of images, producing higher accuracy results than traditional labeled under a particular malware family using VirusTotal
query refinement schemes [26]. In Figure 10, pseudo code of software. The second phase analyzes the samples using a
the SVM algorithm using ransomware sample data is shown. Cuckoo sandbox and generates a report in JSON format of its
findings. Within the sandbox, log files are submitted through
pre-processing tasks, and when finished, the relevant features
are extracted to get valuable feature sets. Those features
are applied against the term frequency and inverse docu-
ment frequency (TF-IDF) algorithm for feature selection. The
last phase uses supervised machine learning algorithms Sup-
port vector machine (SVM), and Artificial Neural Network
(ANN) to derive statistics of the data.
Three different experimental evaluations were conducted
to measure the performance of the framework. The first
evaluation used the train-test splitting method which divides
the whole data set into training and testing data. The dataset
was split randomly with a uniform distribution of 80:20
ratio as training and testing, respectively. The experimental
results of ANN showed an accuracy of 0.958 with 0.101 false
positive rates, while SVM presented a higher false positive
of 0.109 compared to ANN and an accuracy of 0.932. In the
train-test splitting method, the data can become obscure and
FIGURE 10. Code snippet of a SVM algorithm using ransomware data. irrelevant, which is why the second experimental evaluation
is used. The 10-Fold cross-validation technique can prevent
the overfitting problem and estimate the effectiveness of the
III. RANSOMWARE DETECTION FRAMEWORKS
detection models. The best accuracy obtained by SVM is
This section investigates several ML-based frameworks
by presenting 0.982 of area under the curve (AUC) with
which are widely used in detecting ransomware. Some of
less than 0.035 of false positive rate [27]. It is important
the reviewed frameworks utilize one ML algorithm while the
to examine the ability of the classifiers to distinguish the
others might use multiple. These frameworks yield promising
ransomware from benign samples. Therefore, precision and
results in detection of different types of ransomware and
recall are applied to both datasets and presents 0.945 and
have potential to be used in future research works by the
0.942 respectively. SVM also showed better accuracy of
cyber security community. In what follows, eight state-of-
0.952 when compared to MLP that showed 0.945 of detection
the-art frameworks including Behavior Based [27], DNAact-
rate and 0.036 of false positive rates.
Ran [28], RANDS [30], RATAFIA [32], RansomWall [33],
The last evaluation used selected subset features which
CryptoKnight [34], EldeRan [35], and DRTHIS [40] will be
eliminate the redundant and irrelevant features and reduces
studied and compared.
the dimensionality of the dataset. The features were divided
A. BEHAVIOR BASED into seven subset features (top20, top30, top40, top50, top60,
A proposed behavior-based framework was built for defin- top70, top80) by considering their importance and ranking
ing dynamically monitored valuable features of high sur- based on phase 2 processing. The results of the experi-
vivable ransomware (HSR) [27]. The analysis of HSR was ment demonstrated that ANN showed the highest accuracy
conducted within an isolated sandbox environment, through of 98.79% when the top30 of the feature set was used
the Term Frequency-Inverse document frequency (TF-IDF). as training and testing [27]. However, this classification

VOLUME 10, 2022 117603


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

accuracy had dramatically decreased to 95.63% when the uses 3 constraints (Tm, GC Content, and AT_GC Ratio) to
top20 of the feature set was used. The best model of SVM avoid inadequate data.
presented an accuracy of 97.6% when top40 of the feature The last step used by DNAact-Ran is the actual ran-
was applied while training the model [27]. ANN and SVM somware detection step. The dataset is trained using an active
both had low classification accuracy when the top 80 of learning classifier. Once this process is done, digital DNA
the feature set was used to train and test, which indicates sequences are randomly generated from the test data where
that more features do not improve the performance of the it is classified as good-ware or ransomware. Finally, the
classifiers. ransomware family is detected using the traditional ML clas-
sification algorithms. Machine Learning applications strug-
B. DNAACT-RAN gle with the amount of time and effort required to interpret
DNAact-Ran is a Machine Learning-based digital DNA large amounts of data sets that are required for supervised
sequencing engine used in classifying and detecting ran- learning in the process of training a high-accuracy classifier.
somware. It uses an active ML approach for sequencing its To solve this issue, active learning has been proposed and
digital DNA and detects ransomware in three key process designed to decrease the cost by finding data points to be
steps: Feature Selection, DNA Sequence Generation and used by the learning algorithm. The active learning algorithm
Ransomware Detection. Feature selection removes irrelevant uses three parameters for determining accuracy: Smoothing
features and reduces storage and computational cost. It is Parameter (SP), Regularization Parameter (RP), and Learn-
considered the most important process of machine learning. ing Rate (LR). The data was tested against traditional ML
Multi-Objective Grey Wolf Optimization (MOGWO) and algorithms and the experiment showed a 78.5% detection
Binary Cuckoo Search (BCS) algorithms are used to select the accuracy for Naïve Bayes, 75.8% for Decision Stump, 83.2%
relevant features from the collected dataset. MOGWO uses for AdaBoost, and 87.9% for the proposed solution [28]. This
a grid and archive approach with selecting the most dom- experiment partially proves that active learning classifiers are
inant features, while BCS uses a heuristic search approach better at detecting ransomware more efficiently.
to determine its features [28]. Figure 11 gives the complete
architecture of DNAact-Ran. C. RANDS
RANDS is a windows-based anti-ransomware tool that
implements a multi-tier framework with ransomware traits
archive and machine learning algorithms. The architecture
of RANDs is classified in three tiers: Analysis, Learning,
and Detection. The first tier checks the traits of different
ransomware families in a recursive test routine in a virtual
test environment. The virtual environment is utilized to avoid
the severe damage and malfunctions of ransomware on the
platform system. The second tier studies the combined traits
from the archive using a hybrid machine learning algorithm
to generate the classification model. The generated classifi-
cation models will be used to detect any suspicious activity
against the actions or traits in the last tier too. The last
tier applies the classification model to detect any unknown
ransomware variant via a computer scan [30] which alerts
FIGURE 11. DNAact-Ran architecture.
the system’s user that a ransomware is going to possibly
infect the system. RANDS machine learning algorithm uses
In the digital DNA sequence generation step, a new dataset a hybrid approach. It uses both Decision Tree and Naïve
is used to generate the digital DNA sequence after the feature Bayes decision functions due to their pruning margins for
selection process is completed. The design constraint of dig- more accurate categorization. The Decision Tree algorithm
ital DNA is then computed, and the k-mer frequency vector generates its predictions of the traits within a tree structure
is generated for the DNA sequence. A new dataset is then of nodes, leaves, and branches throughout the pruning and
generated for the ransomware detection training phase based tree building process [31]. The Naïve Bayes algorithm is used
on those calculations. A synthetic DNA representation of a for predicting the actual category of the overlooked traits in
digital artifact is used because it does not represent the content the vague nodes of Decision Tree. The Bayes’ probabilistic
of biological DNA. DNA is represented computationally by theorem is used when a trait that goes unclassified cannot be
character strings containing only the characters A, G, C and T, classified.
therefore, Pedersen et al. [29] created a reversible translation To test and demonstrate whether RANDS could adapt
of the byte sequence of a digital artifact which mapped binary the zero-day ransomware variants and their corresponding
pairs into those string characters [28]. The DNA sequence families, performance metrics including Detection Accuracy
design is used as an approach of control and DNAact-Ran Rate, Mistake Rate, Miss Rate, and Elapsed Time along

117604 VOLUME 10, 2022


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

with plots of ROC curve have been utilized through exper- generated during the sample’s execution aid in organizing
iments [31]. The ten-fold cross evaluation routine inferred the layers in a computational order. It is implemented solely
that RANDS could manifest its adaptive and effective clas- for Windows operating system. RansomWall’s architecture
sification against zero-day ransomware. That was accred- models that of a hybrid approach utilizing a joint static and
ited by the hybrid machine learning approach that RANDS dynamic analysis to compute values of the selected feature
utilized. RANDS implemented ransomware traits to distin- set [33]. The Machine Learning Engine is used to develop
guish different ransomware families and identify their related a generalized model which is effective against zero-day ran-
variants. However, the performance trend line reported at somware attacks. It takes feature values collected by static,
certain days showed erratic behavior caused by the qualities dynamic and trap layers as input and classifies the executable
of ransomware families and their corresponding variants that as ransomware or benign. The engine is trained offline using
might be varied. Results showed a 96.27% average accuracy supervised algorithms and the training data consists of fea-
rate and 1.32% average mistake rate throughout the real-time ture values with ransomware and benign labels. The Trained
assessment [31]. Machine Learning Engine then uses the learned model to
D. RATAFIA
classify executables in real-time based on input feature val-
ues. The following supervised machine learning algorithms
An unsupervised detection framework RATAFIA uses a DNN
are evaluated based on performance: Logistic Regression,
architecture and Fast Fourier Transformation to develop a
SVM, ANN, Random Forests, and Gradient Tree Boosting.
highly accurate, fast, and reliable solution to ransomware
The ransomware sample set has a 12-Fold Cross Validation
detection using minimal trace-points. The advantage of using
performed on it. In each test run, Machine Learning Layer
an unsupervised technique is that the learning process does
is trained on all samples from 11 out of 12 Cryptographic
not require a labeled dataset, which is often difficult to obtain
ransomware families and 221 out of 442 samples from benign
considering the occurrences of several newer unknown vari-
software [33]. The learned model is tested against remaining
eties of ransomware. RATAFIA specifically was created to
benign samples on the evaluation setup and all samples from
learn the behavior of a system under observation with the
the last ransomware family. Since most of the successful
statistics obtained from a cluster of Hardware Performance
ransomware attacks are zero-day intrusions, this process of
Counter (HPC) events [32]. The first phase of RATAFIA
evaluation is selected, with samples from an entirely new
tests its robustness and uses an analysis in the presence of
ransomware family or its upgraded variant. During the evalu-
expensive SPEC benchmarks. It is observed that the execu-
ation, the functionality of the File Backup Layer is verified to
tion behaviors of HPC events are significantly different from
check if the files are correctly backed up for suspicious pro-
normal observations, and the sequences of time-series data
cesses after receiving classification output from the Machine
in which RATAFIA processes are treated as being malicious
Learning Layer.
due to reaching computational thresholds. However, these
The metrics show the best results with Gradient Tree
are simply the SPEC programs creating false positive errors.
Boosting Algorithm. RansomWall attains a detection rate
The second phase uses FFT to try to eliminate the false
of 98.25% with near-zero false positives with this algorithm.
positive by changing the HPC values from time domain to
The Gradient Tree Boosting algorithm provides effective
frequency domain. This is done to understand the repetitive
handling of heterogeneous data, high predictive power and
pattern within the values because ransomware executable
robustness to outliers resulting in high performance [GG].
runs encryption repeatedly on multiple files. The entire detec-
Analysis of false negatives show that two ransomware sam-
tion procedure does not need any template of the malicious
ples abruptly terminated after encrypting only a few files.
process from beforehand. Instead, it thrives on an anomaly
As the resulting file system activity is reduced, samples do not
detection procedure to detect infectious ransomware in as
get detected. Limited file system activity is leading to false
less as 5 seconds with almost zero false positives, using
negatives due to the low number of such files on the user’s
frequency analysis [32]. The proposed detection method
system. The rest of the false negatives come from decision
works on any platform having HPCs. However, the tunable
boundary errors.
hyper-parameters will be different for different systems. The
determination of values for these parameters is a one-time
process, which will be accomplished during the training of F. CRYPTOKNIGHT
autoencoders. RATAFIA uses a template of the normal sys-
CryptoKnight was built to classify cryptographic primitives
tem behavior in terms of HPC values to train the autoen-
in compiled binary executables using the Dynamic Convo-
coders. The advantage of using HPCs is that they are difficult
lutional Neural Network (DCNN) algorithm. It introduces
to tamper with. While one may increase some HPC values
a learning system that can easily integrate new samples
by a program, it is difficult to reduce the HPC values without
through the scalable synthesis of customizable cryptographic
explicitly targeting the HPC registers.
algorithms. CryptoKnight’s architecture is intended to limit
E. RANSOMWALL human interaction, allowing the structure of an effective
RansomWall protects against cryptographic ransomware model at run-time [34]. The entire system is comprised of
using a layered defense system. The features that are three stages:

VOLUME 10, 2022 117605


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

1. Procedural generation guides the synthesis of unique 100 random splits was introduced for each explored value,
cryptographic binaries with variable obfuscation and where 80% of the samples were used for training and 20%
alternate compilation. as test samples. It is determined that all three classifiers
2. Assumptions of cryptographic code aid the discrim- show maximum performance at 400 features, and the accu-
ination of diagnostics from the dynamic analysis of racy showed no improvement beyond that number [36]. The
synthetic or reference binaries to build an ‘image’ of second experiment observes the performance of the previous
execution. classifiers along with VirusTotal. The top 400 features were
3. A DCNN fits variable-length matrices for ease of train- used for the original three classifiers, and the test covered
ing and the immediate classification of new samples. all the methods averaging the results over 100 independent
CryptoKnight was tested on many applications using non- train/test splits with 80% of samples for training. It is deter-
library linked functionality and analysis showed that it is a mined VirusTotal shows better performance when compared
viable solution that can quickly learn from new cryptographic to the other algorithms, although EldeRan is just slightly
execution patterns to classify unknown software [34]. Cryp- behind. The last experiment tests how effective EldeRan
toKnight also demonstrated that it could classify results faster can detect new families of ransomware. For new families of
compared to that of previous methodologies and is consider- ransomware, it is common for them to share the same char-
ably re-usable. At a 96 % accuracy rate, CryptoKnight con- acteristics and goals of previous classes [37]. Datasets were
firmed that cryptographic primitive classification in compiled clustered into 11 classes using their known family name,
binary executables could be successfully achieved using a since the naming conventions of the antivirus (AV) vendors
DCNN algorithm. are not always consistent or compatible amongst them. Two
cases were considered by selecting the top 100 and 400 fea-
G. ELDERAN tures according to the MI criterion. For eight of the ran-
In 2016, EldeRan was developed to identify the most sig- somware families the detection rate is above 90% and for ten
nificant ransomware features and use them to detect ran- families the detection rate is above 80%, both occurring when
somware [35]. The framework is based on the observation of using 100 features. When using 400 features, the detection
actions or events that typically occur within ransomware and rates become worse with only five families achieving 90%
goodware samples in its early stages. Ransomware and good- and eight families achieving 80%. The average detection rate
ware sample datasets are dynamically analyzed in a sandbox is higher (93.3%) when using only the top 100 features than in
environment first. From the two datasets, EldeRan retrieves the case of using 400 features (87.1%) [36]. Figure 12 below
and analyses the following classes of features: Windows API shows an average ROC of the test samples.
calls, Registry Key Operations, File System Operations, the
set of file operations performed per File Extension, Directory
Operations, Dropped Files, and Strings. Other than Strings,
the rest of the features are collected while dynamically
analyzing the ransomware. Once the monitoring phase has
completed, the Mutual Information criterion [37], a feature
selection algorithm is used to choose the ones that are most
relevant. The feature selection process is not always used in
machine learning algorithms, but for EldeRan, it helps with
performance and provides more competence in the algorithm
[38]. Finally, the matrices containing these features are used
in a Regularized Logistic Regression classifier. This classifier
will return ransomware or goodware once detected and is
also run online on a PC to classify new samples, which can FIGURE 12. Average ROC for the test samples over 100 random splits for
EldeRan, the SVM, NB, and VirusTotal [36].
come from infected websites or multiple infected vectors.
The training set is analyzed offline and completed within Some limitations of EldeRan existed. If the ransomware
minutes in the sandbox environment while new applications remained silent or waited for user interaction within the
are classified at run-time through an online classifier, which sandbox environment, EldeRan does not properly extract the
is also fast [36]. ransomware, therefore, goes undetected. Secondly, no other
EldeRan was conducted in three different experiments. applications were running within the sandbox environment,
The first experiment tested how performant EldeRan was which was purposely done to eliminate the ransomware
compared to two other classifiers, SVM (Support Vector checks to evade detection. Lastly, the original ransomware
Machine) and NB (Naive Bayes) [36]. It was determined and goodware data samples were limited because EldeRan
that both SVM and EldeRan outperformed NB, and EldeRan could not process empty API calls efficiently during the
slightly edged SVM. These metrics were evaluated against dynamic analysis phase. This reduced the dataset by half the
the AUC (Area Under the Curve) using between 50 and samples. Ultimately, EldeRan can only detect ransomware
1500 features, all supported by MI criterion. A structure of once the infection occurs [39].

117606 VOLUME 10, 2022


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

H. DRTHIS frameworks is the datasets. Most have been manipulated


DRTHIS was developed to determine ransomware from based on feature selection to improve the accuracy of each of
goodware and to identify their families. It uses LSTM and their respective models. However, the framework RATAFIA
CCN deep learning techniques for classification. Based on is very different from the others. For example, it does not
the application sequence of activities, a binary classifier, follow the traditional procedures of using ML algorithms
a Deep Feature Extractor (DFE), and a One Class Classifier because it strictly factors in performance based on unsuper-
(OCCs) are all used for hunting ransomware samples and vised learning.
identifying their families. The system records executed events Another observation between these frameworks in
when a user launches an application, and within the first Table 1 below is the ML algorithms used. Each framework in
10 seconds of application execution, the captured sequence this research was specifically chosen to bring about nuances
is transformed to detect if a given sample is considered ran- in the related field, therefore, the approach of each framework
somware [40]. Ransomware samples that have been identified is different and all use different algorithms. As this paper
are sent to the system to categorize its family. During the is collectively bringing about research geared towards ran-
threat intelligence phase, DFE is used to extract a vector to somware detection, providing a variety of different frame-
feed the OCC, and it contains the pre-trained model LSTM or works that are not producing the same results gives better
CNN. This is the step that produces the family. DRTHIS does insight as to what is happening in this area of research study.
a data transformation task to transform textual sequences of Lastly, this paper gives awareness as to how ransomware
events into a numerical form. Then, the combining and label- detection was conducted years ago versus how it has been
ing component combines input datasets into one integrated advanced and improved in recent years.
dataset suitable for our deep learning tasks. It is notable that
combining, and labeling creates two separate datasets with IV. FUTURE OF RANSOMWARE
the same samples but different class labels [40]. Trends show that ransomware attacks will continue to surge
DRTHIS uses both ransomware samples from new fam- in 2022 and will continue to be a top threat for all sectors [41].
ilies and unforeseen benign applications. The created OCC Ransomware-as-a-Service, which is a powerful asset, will
generated 24% wrong prediction when it classified 16 out also boost malicious attacks on end users, as it does not
of 66 Locky samples as goodware. DRTHIS also wrongly require anyone to have any technical knowledge. An increase
classified one Cerber sample and two TeslaCrypt samples as in the usage of Initial Access Brokers (IAB) is also projected
a new family of ransomware [40]. DRTHIS takes advantage to peak, as they gain access to a victim’s network and then sell
of One Class Classifier to determine if a sample belongs it to open ransomware markets for profit. The Covid 19 pan-
to a known family of ransomware or whether it belongs demic will continue contributing to more ransomware related
to a new family. Samples from CryptoWall, TorrentLocker activities as it has caused a drop in employment. Many com-
and Sage families are used for evaluating the performance panies currently do not have the workforce to increase their
of the system against samples from unforeseen families. cyber security measures or provide awareness, and mitigation
DRTHIS correctly recognized these three families as a new tactics are lacking. Financial impacts have reduced funds for
family without any conflict with the trained families. 99% of companies to invest in such software and are steering smaller
CryptoWall, 75% of TorrentLocker and 92% of Sage sam- companies in the direction of bypassing security measures
ples are correctly detected as samples from a new family altogether. It is also noted that ransomware operators are
of ransomware. DRTHIS wrongly classifies 1 CryptoWall likely to intensify the ways that they pressure victims into
sample (1%), 4 TorrentLocker samples (14.2%), and 1 Sage paying ransoms by contacting customers of interest, engaging
sample (1.2%) as goodware. DRTHIS identifies 2 Torrent- in media sources and journalists, or simply calling victims
Locker samples (7%) and 1 Sage sample (1.2%) as Cerber directly [41].
samples. Three samples (3.8%) of Sage and 1 sample (3.5%) Defenders of ransomware will have to stay ahead of the
of TorrentLocker are also detected as TeslaCrypt [40]. Due advancement of ransomware schemes. For instance, IoT
to the fast classification of new instances, DRTHIS can be devices and 5G networks create many loopholes for ran-
considered as a basic method in the cyber security industry somware intrusions. Integrating security technology, secure
for implementing new threat hunting and intelligence tools. design principles, and governance at each phase of an organi-
zation’s IoT and 5G landscapes will also need to be set forth in
I. A COMPARATIVE ANALYSIS OF FRAMEWORKS each sublayer. Quantum computing is also being used in the
Section 3 has provided details of several different frame- detection of ransomware and is projected to reach maturity
works. The below table captures those frameworks and con- within the next few years [42]. Detection algorithms could
solidates data points and metrics for quick references. The be enriched with quantum technology to expedite the iden-
table is composed of the author’s reference of work, the name tification and decryption of encrypted malware. Defenders
of the framework, the dataset quantity, the number of features would also be able to use quantum computers to decrypt
(if applicable), the type(s) of machine learning algorithms malice communications using proactive monitoring. Lever-
used, the year it was created, the results, and the challenges aging quantum could also disrupt the flow of ransomware in
each framework faced. The first important factor about these its attack sequences once it has encrypted its targeted file.

VOLUME 10, 2022 117607


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

TABLE 1. ConsolidateD view of frameworks.

In [65], the authors present a new type of framework public blockchains, therefore, more research interests in the
called Detection Avoidance Mitigation (DAM). It can handle area is needed as this shows a concern.
classification, detection, and mitigation all in one go. Its
architecture consists of typical detection techniques using
static and dynamic analysis, avoidance techniques such as V. CONCLUSION
system updates and patches, and mitigation techniques such In recent years, ransomware has continuously been a top
as reverse engineering. DAM evaluated different combat topic in cybersecurity and attacks are now taking place not
strategies for preventing ransomware attacks and widespread only on individuals but organizations as well. Ransomware
financial losses, proving that avoidance techniques are the has evolved from elementary scareware and locker related
most desirable in protecting users and organizations from user interfaces, to cryptographic and fileless ransomware.
ransomware. In this paper, we provide a comprehensive survey of ran-
Lastly, the first blockchain-based ransomware schemes somware types, common frameworks that are used to detect
were introduced in [66]. The authors focused on smart con- ransomware, and the ML algorithms in which they use.
tracts to contribute to the paying of single files or refunding A detailed list of all pertinent information is gathered and
the ransom payment back to the victim if the decryption keys arranged in a table. Though other research papers have
were not sent within a reasonable time. The results of this provided reviews with similar concepts, these surveys have
research showed no practical countermeasures when using not captured the explicit details in one place as this research.
117608 VOLUME 10, 2022
D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

By collecting such material and providing a comparative [17] C. M. Bishop, Pattern Recognition and Machine Learning (Informa-
study, this paper provides a means for others to foresee an tion Science and Statistics). New York, NY, USA: Springer-Verlag,
2006.
area of interest and investigate parts where improvements can [18] N. Chauhan. (Jan. 2020). Decision Tree Algorithm, Explained. KDnuggets.
be made due to poor results or limitations. Ultimately, this Accessed: Jan. 22, 2021. [Online]. Available: https://www.kdnuggets.
paper can provide direction to those who are looking to utilize com/2020/01/decision-tree-algorithm-explained.html
[19] N. Donges. (Jun. 22, 2021). A Complete Guide to the Random
one of the mentioned frameworks for advancement in future Forest Algorithm. Accessed: Jan. 22, 2021. [Online]. Available:
work. https://builtin.com/data-science/random-forest-algorithm
[20] O. Mbaabu. (Dec. 11, 2020). Introduction to Random Forest in
Machine Learning. Accessed: Jan. 22, 2021. [Online]. Available:
REFERENCES https://www.section.io/engineering-education/introduction-to-random-
forest-in-machine-learning/
[1] L. Abrams. (2020). SunCrypt Ransomware Shuts Down North [21] J. Brownlee. (Jul. 7, 2021). A Gentle Introduction to Long Short-
Carolina School District. Accessed: Jan. 11, 2021. [Online]. Available: Term Memory Networks by the Experts. Machine Learning Mastery.
https://www.bleepingcomputer.com/news/security/suncrypt-ransomware- Accessed: Jan. 24, 2021. [Online]. Available: https://machinelearning
shuts-down-north-carolina-schooldistrict/ mastery.com/gentle-introduction-long-short-term-memory-networks-
[2] BBC News. (2020). Northumbria University Hit by Cyber Attack. experts/
Accessed: Jan. 11, 2021. [Online]. Available: https://www.bbc.com/ [22] S. Saxena. (Mar. 16, 2021). Introduction to Long Short Term Memory
news/uk-england-tyne-53989404 (LSTM). Analytics Vidhya. Accessed: Jan. 24, 2021. [Online]. Available:
[3] B. Fraga. (2013). Swansea Police Pay $750 ‘Ransom’ After https://www.analyticsvidhya.com/blog/2021/03/introduction-to-long-
Computer Virus Strikes. Accessed: Jan. 11, 2021. [Online]. Available: short-term-memory-lstm/
https://www.heraldnews.com/x2132756948/Swansea-police-pay-750- [23] S. Ray. (Sep. 11, 2017). 6 Easy Steps to Learn Naive Bayes Algo-
ransom-after-computer virus-strikes rithm With Codes in Python and R. Analytics Vidhya. Accessed:
[4] L. Freedman. (2020). Ransomware Attacks Predicted to Occur Jan. 24, 2021. [Online]. Available: https://www.analyticsvidhya.com/
Every 11 Seconds in 202 With a Cost of $20 Billion. Accessed: blog/2017/09/naive-bayes-explained/
Jan. 11, 2021. [Online]. Available: https://www.dataprivacyandsecurityin [24] P. Domingos and M. Pazzani, ‘‘On the optimality of the simple Bayesian
sider.com/2020/02/ransomwareattacks-predicted-to-occur-every-11- classifier under zero-one loss,’’ Mach. Learn., vol. 29, pp. 103–130,
seconds-in-2021-with-a-cost-of-20-billion/ Nov. 1997.
[5] K. Savage, P. Coogan, and H. Lau. (2015). The Evolution of Ransomware. [25] C. Li. (2016). A Gentle Introduction to Gradient Boosting. Accessed:
[Online]. Available: https://its.fsu.edu/sites/g/files/imported/storage/ Jan. 26, 2021. [Online]. Available: https://www.ccs.neu.edu/
images/information-security-and-privacy-office/the-evolution-of- home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf
ransomware [26] R. Gandhi. (Jul. 7, 2018). Support Vector Machine—Introduction to
[6] I. Segun, B. I. Ujioghosa, S. O. Ojewande, F. O. Sweetwilliams, Machine Learning Algorithms. Accessed: Jan. 26, 2021. [Online].
S. N. John, and A. A. Atayero, ‘‘Ransomware: Current trend, challenges, Available: https://towardsdatascience.com/support-vector-machine-
and research directions,’’ in Proc. World Congr. Eng. Comput. Sci., 2017, introduction-to-machine-learningalgorithms-934a444fca47
pp. 169–174. [27] Y. A. Ahmed, B. Kocer, and B. A. S. Al-rimy, ‘‘Automated analy-
[7] A. Kharaz, S. Arshad, C. Mulliner, W. Robertson, and E. Kirda, ‘‘UNVEIL: sis approach for the detection of high survivable ransomware,’’ KSII
A large-scale, automated approach to detecting ransomware,’’ in Proc. 25th Trans. Internet Inf. Syst., vol. 14, no. 5, pp. 2236–2257, 2020, doi:
USENIX Secur. Symp., 2016, pp. 757–772. 10.3837/TIIS.2020.05.021.
[8] D. Y. Huang, M. M. Aliapoulios, V. G. Li, L. Invernizzi, E. Bursztein, [28] F. Khan, C. Ncube, L. K. Ramasamy, S. Kadry, and Y. Nam,
K. McRoberts, J. Levin, K. Levchenko, A. C. Snoeren, and D. McCoy, ‘‘A digital DNA sequencing engine for ransomware detection
‘‘Tracking ransomware end-to-end,’’ in Proc. IEEE Symp. Secur. Privacy using machine learning,’’ IEEE Access, vol 8, pp. 119710–119719,
(SP), May 2018, pp. 618–631. 2020.
[9] L. Abrams. (Jan. 4, 2016). Ransom32 is the First Ransomware Written in [29] J. Pedersen, D. Bastola, K. Dick, R. Gandhi, and W. Mahoney, ‘‘Blast your
Javascript. BleepingComputer. Accessed: Jan. 12, 2021. [Online]. Avail- way through malware analysis assisted by bioinformatics tools,’’ in Proc.
able: https://www.bleepingcomputer.com/news/security/ransom32-is-the- Int. Conf. Secur. Manage., 2012, p. 1.
first-ransomware-written-in-javascript/ [30] H. Zuhair and A. Selamat, ‘‘RANDS: A machine learning-based
[10] KnowBe4. (2021). Ransom32 Ransomware-as-a-Service. anti-ransomware tool for Windows platforms,’’ in Advancing Technology
Accessed: Jan. 12, 2021. [Online]. Available: https://www.knowbe4.com/ Industrialization Through Intelligent Software Methodologies, Tools and
ransom-32-ransomware-as-a-service Techniques, vol. 318, 2019.
[11] S. Sjouwerman. (Feb. 5, 2019). First Javascript-Only Ransomware as [31] N. Hampton, Z. Baig, and S. Zeadally, ‘‘Ransomware behavioural anal-
a Service Poses New Threat. TechBeacon. Accessed: Jan. 12, 2021. ysis on Windows platforms,’’ J. Inf. Secur. Appl., vol. 40, pp. 44–51,
[Online]. Available: https://techbeacon.com/security/first-javascript-only- Jun. 2018.
ransomware-service-poses-new-threat [32] M. Alam, S. Bhattacharya, S. Dutta, S. Sinha, D. Mukhopadhyay,
[12] M. J. Schwartz and R. Ross. (Jun. 20, 2016). Latest Ransomware Relies on and A. Chattopadhyay, ‘‘RATAFIA: Ransomware analysis using time
JavaScript. Bank Information Security. Accessed: Dec. 2, 2021. [Online]. and frequency informed autoencoders,’’ in Proc. IEEE Int. Symp.
Available: https://www.bankinfosecurity.com/latest-ransomware-relies- Hardw. Oriented Secur. Trust (HOST), May 2019, pp. 218–227, doi:
on-javascript-a-9212 10.1109/HST.2019.8740837.
[13] (Jun. 16, 2016). New RAA Ransomware Uses Only JavaScript to [33] S. K. Shaukat and V. J. Ribeiro, ‘‘RansomWall: A layered defense system
Infect Computers. Accessed: Jan. 12, 2021. [Online]. Available: against cryptographic ransomware attacks using machine learning,’’ in
https://www.trendmicro.com/vinfo/mx/security/news/cybercrime-and- Proc. 10th Int. Conf. Commun. Syst. Netw. (COMSNETS), Jan. 2018,
digital-threats/new-raa-ransomware-uses-only-javascript-to-infect- pp. 356–363.
computers [34] G. Hill and X. Bellekens, ‘‘CryptoKnight: Generating and modelling
[14] J. Tolbert. (2020). Malicious Actors Exploiting Coronavirus Fears. compiled cryptographic primitives,’’ Information, vol. 9, no. 9, p. 231,
Accessed: Jan. 12, 2021. [Online]. Available: https://www.kuppingercole. Sep. 2018.
com/blog/tolbert/maliciousactors-exploiting-coronavirus-fears [35] Z.-G. Chen, H.-S. Kang, S.-N. Yin, and S.-R. Kim, ‘‘Automatic ran-
[15] Brooke Crothers. (2020). Apps Designed to Track COVID-19 somware detection and analysis based on dynamic API calls flow graph,’’
Might be Full of Ransomware, Report Says. [Online]. Available: in Proc. Int. Conf. Res. Adapt. Convergent Syst., Sep. 2017, pp. 196–201,
https://www.foxnews.com/tech/apps-track-covid-19-full-ransomware doi: 10.1145/3129676.3129704.
[16] Acronis. (2020). Digital CoronaVirus: Yet Another Ransomware Com- [36] D. Sgandurra, L. Muñoz-González, R. Mohsen, and E. C. Lupu, ‘‘Auto-
bined With Infostealer. Accessed: Jan. 12, 2021. [Online]. Available: mated dynamic analysis of ransomware: Benefits, limitations and use for
https://www.cbronline.com/news/tesla-cyber-attack detection,’’ 2016, arXiv:1609.03020.

VOLUME 10, 2022 117609


D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection

[37] A. Kharraz, W. Robertson, D. Balzarotti, L. Bilge, and E. Kirda, [58] H. Ke, H. Wu, and D. Yang, ‘‘Towards evolving security requirements
‘‘Cutting the gordian knot: A look under the hood of ransomware of industrial internet: A layered security architecture solution based
attacks,’’ in Detection of Intrusions and Malware, and Vulnerability on data transfer techniques,’’ in Proc. Int. Conf. Cyberspace Innov.
Assessment. Cham, Switzerland: Springer, 2015, pp. 3–24, Adv. Technol., New York, NY, USA, Dec. 2020, pp. 504–511, doi:
doi: 10.1007/978-3-319-20550-2_1. 10.1145/3444370.3444620.
[38] J. Z. Kolter and M. A. Maloof, ‘‘Learning to detect and classify malicious [59] Trend Micro. What is Ryuk Ransomware. Accessed: Apr. 5, 2021. [Online].
executables in the wild,’’ J. Mach. Learn. Res., vol. 7, pp. 2721–2744, Available: https://www.trendmicro.com/en_us/what-is/ransomware/ryuk-
Dec. 2006. ransomware.html
[39] G. Cusack, O. Michel, and E. Keller, ‘‘Machine learning-based detection [60] WannaCry Ransomware. (May 15, 2017). WannaCry Ransom
of ransomware using SDN,’’ in Proc. ACM Int. Workshop Secur. Softw. ware—LogRhythm. Accessed: Apr. 14, 2021. [Online]. Available:
Defined Netw. Netw. Function Virtualization, Mar. 2018, pp. 1–6, doi: https://logrhythm.com/blog/wannacry-ransomware/
10.1145/3180465.3180467. [61] A. Kujawa. (Jan. 8, 2019). Ryuk Ransomware Attacks Businesses Over the
[40] S. Homayoun, A. Dehghantanha, M. Ahmadzadeh, S. Hashemi, Holidays. Malwarebytes Labs. Accessed: Apr. 14, 2021. [Online]. Avail-
R. Khayami, K.-K. R. Choo, and D. E. Newton, ‘‘DRTHIS: Deep able: https://blog.malwarebytes.com/cybercrime/malware/2019/01/ryuk-
ransomware threat hunting and intelligence system at the fog layer,’’ ransomware-attacks-businesses-over-the-holidays/
Future Gener. Comput. Syst., vol. 90, pp. 94–104, Jan. 2019. [62] R. Nimbalkar. (Jul. 13, 2021). Decision Tree Algorithms-Machine Learn-
[41] QuoIntelligence. (Jan. 18, 2022). Ransomware is Here to Stay and ing. Accessed: Apr. 14, 2021. [Online]. Available: https://medium.com/
Other Cybersecurity Predictions for 2022. Accessed: Jan. 31, 2021. appengine-ai/decision-tree-algorithms-machine-learning-9e2e8cadfcae
[Online]. Available: https://quointelligence.eu/2022/01/ransomware-and- [63] S. India. (Jul. 4, 2020). Hands-on Training With Machine Learn-
other-cybersecurity-predictions-for-2022/ ing Algorithms: Decision Tree and Random Forest. Springboard Blog.
[42] D. Golden and K. Norton. (2021). Defending Against Ransomware in an Accessed: Apr. 14, 2021. [Online]. Available: https://in.springboard.
Age of Emerging Technology. Deloitte. Accessed: Jan. 31, 2021. [Online]. com/blog/machine-learning-algorithms-decision-tree-random-forest/
Available: https://www2.deloitte.com/us/en/pages/risk/articles/defending- [64] G. Van Houdt, C. Mosquera, and G. Npoles, ‘‘A review on the long short-
against-ransomware.html term memory model,’’ Artif. Intell. Rev., vol. 53, no. 8, pp. 5929–5955,
[43] L. Simonovich. (Jan. 15, 2020). Are Utilities Doing Enough to 2020, doi: 10.1007/s10462-020-09838-1.
Protect Themselves From Cyberattack?. World Economic Forum. [65] A. Kapoor, A. Gupta, R. Gupta, S. Tanwar, G. Sharma, and I. E. Davidson,
Accessed: Apr. 4, 2021. [Online]. Available: https://www.weforum.org/ ‘‘Ransomware detection, avoidance, and mitigation scheme: A review and
agenda/2020/01/are-utilities-doing-enough-to-protect-themselves-from- future directions,’’ Sustainability, vol. 14, no. 1, p. 8, Dec. 2021.
cyberattack/ [66] O. Delgado-Mohatar, J. M. Sierra-Cámara, and E. Anguiano, ‘‘Blockchain-
[44] APWG. (May 11, 2020). Phishing Activity Trends Report in Q1 based semi-autonomous ransomware,’’ Future Gener. Comput. Syst.,
of 2020. Accessed: Apr. 4, 2021. [Online]. Available: https://docs. vol. 112, pp. 589–603, Nov. 2020.
apwg.org/reports/apwg_trends_report_q1_2020.pdf
[45] Q. Chen and R. A. Bridges, ‘‘Automated behavioral analysis of mal- DARYLE SMITH was born in Lenoir, NC, USA,
ware: A case study of WannaCry ransomware,’’ in Proc. 16th IEEE in 1985. He received the B.S. and M.S. degrees
Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2017, pp. 454–460, doi:
in computer science from Winston-Salem State
10.1109/ICMLA.2017.0-119.
University. He is currently pursuing the Ph.D.
[46] (May 22, 2017). WannaCry Ransomware Campaign Exploiting SMB
Vulnerability. Accessed: Apr. 4, 2021. [Online]. Available: https://cert. degree in computer science with North Carolina
europa.eu/static/SecurityAdvisories/2017/CERT-EU-SA2017-012.pdf A&T State University. Since 2010, he has been
[47] M. Akbanov, V. G. Vassilakis, and M. D. Logothetis, ‘‘WannaCry ran- starting his IT career as a Software Engineer and
somware: Analysis of infection, persistence, recovery prevention and prop- has been involved in every aspect of e-commerce
agation mechanisms,’’ J. Telecommun. Inf. Technol., vol. 1, no. 2019, since. He is currently an E-Commerce Architect
pp. 113–124, Apr. 2019. with the Peapod Digital Laboratories, Salisbury,
[48] L. J. Trautman and P. Ormerod, ‘‘Wannacry, ransomware, and the emerging NC headquarters.
threat to corporations,’’ SSRN Electron. J., vol. 86, p. 503, Jan. 2018, doi:
10.2139/ssrn.3238293. SAJAD KHORSANDROO received the Ph.D.
[49] S. Jones and T. Bradshaw. (May 14, 2017). Global Alert to Prepare degree in computer science from The University
for Fresh Cyber Attacks. Accessed: Apr. 4, 2021. [Online]. Available: of Texas at San Antonio, in 2019. Currently, he is
https://www.ft.com/content/bb4dda38-389f-11e7-821a-6027b8a20f23 an Assistant Professor of computer science at
[50] M. V. Liy. (May 15, 2017). Putin Culpa a Los Servicios Secretos de
North Carolina A&T State University, where he is
EE UU Por el Virus ’WannaCry’ Que Desencaden? el Ciberataque
also an Associate Director of the Cyber Defense
Mundial. Accessed: Apr. 4, 2021. [Online]. Available: https://elpais.
com/internacional/2017/05/15/actualidad/1494855826_022843.html and AI Laboratory. He has already secured $1.7M
[51] S. K. Sahi, ‘‘A study of wannacry ransomware attack,’’ Int. J. Eng. Res. in funds from NSF, DoD, Palo Alto Networks,
Comput. Sci. Eng., vol. 4, no. 9, pp. 5–7, 2017. and Carolina Cyber Center. His current research
[52] R. Collier, ‘‘NHS ransomware attack spreads worldwide,’’ Can. Med. interests include the application of AI/ML in cyber
Assoc. J., vol. 189, no. 22, pp. E786–E787, 2017. security, next-generation network infrastructures, cloud computing, and
[53] JavaScript|MDN. (Feb. 18, 2022). JavaScript Language Resour secure cyber physical systems.
ces—JavaScript: MDN. Accessed: Apr. 4, 2021. [Online]. Available:
https://developer.mozilla.org/enUS/docs/Web/JavaScript/Langu KAUSHIK ROY is currently a Professor and the
age_Resources Interim Chair of the Department of Computer
[54] J. Gerend. (Mar. 3, 2021). Wscript. Microsoft Docs. Science, North Carolina A&T State University
Accessed: Apr. 4, 2021. [Online]. Available: https://docs.microsoft.com/ (NCAT). He has over 140 publications, including
en-us/windows-server/administration/windows-commands/wscript 35 journal articles and a book. His current research
[55] T. McIntosh, A. S. M. Kayes, Y.-P.-P. Chen, A. Ng, and P. Watters,
interests include cybersecurity, cyber identity, bio-
‘‘Ransomware mitigation in the modern era: A comprehensive review,
metrics, machine learning (deep learning), data
research challenges, and future directions,’’ ACM Comput. Surv., vol. 54,
no. 9, pp. 1–36, Dec. 2022, doi: 10.1145/3479393. science, the IoT, cyber-physical systems, and big
[56] H. Oz, A. Aris, A. Levi, and A. S. Uluagac, ‘‘A survey on ransomware: data analytics. His research is funded by the
Evolution, taxonomy, and defense solutions,’’ ACM Comput. Surv., vol. 54, National Science Foundation (NSF), Department
no. 11s, pp. 1–37, Jan. 2022, doi: 10.1145/3514229. of Defense (DoD), National Security Agency (NSA), and Department of
[57] CIS Security. (2019). Fall 2019 Threat of the Quarter: Ryuk Ransomware. Energy (DoE). He is the Director of the Center for Cyber Defense (CCD).
Accessed: Apr. 5, 2021. [Online]. Available: https://www.cisecurity. He also directs the Cyber Defense and AI Laboratory.
org/white-papers/fall-2019-threat-ofthe-quarter-ryuk-ransomware/

117610 VOLUME 10, 2022

You might also like