Machine Learning Algorithms and Frameworks in Ransomware Detection
Machine Learning Algorithms and Frameworks in Ransomware Detection
ABSTRACT Ransomware has been one of the biggest cyber threats against consumers in recent years. It can
leverage various attack vectors while it also evolves in terms of finding more innovative ways to invade
different cyber security systems. There have been many efforts to detect ransomware within the workforce
and academia leveraging machine learning algorithms, which has shown promising results. Accordingly,
there is a considerably large body of literature addressing various solutions on how ransomware threats
can be detected and mitigated. Such large and rapidly growing scientific and technical materials start to
make it difficult in knowing the actual ML algorithm(s) being used. Hence, the aim of this paper is to give
insight about ransomware detection frameworks and those ML algorithms which are typically being used to
extract ever-evolving characteristics of ransomware. In addition, this study will provide the cyber security
community with a detailed analysis of those frameworks. This will be augmented with information such as
datasets being used along with the challenges that each framework may be faced with in detecting a wide
variety of ransomware accurately. To summarize, this paper delivers a comparative study which can be used
by peers as a reference for future work in ransomware detection.
INDEX TERMS Artificial Neural Network (ANN), cyber security, deep convolutional neural network
(DCNN), deep neural network (DNN), Hardware Performance Counter (HPC), Long Short Term Memory
(LSTM), machine learning (ML), ransomware, Recurrent Neural Network (RNN), Sum of Product (SOP),
Support Vector Machine (SVM), Term Frequency and Inverse Document Frequency (TF-IDF), The Onion
Routing (TOR).
Nonetheless, these attacks are growing daily. The underly- created by cybercriminals to target web pages with the
ing reason is that combating ransomware is challenging [7]. use of JavaScript. One such example is Ransom32, which
Ransomware typically relies on strong encryption that first appeared in late 2015. Until its discovery, no other
is easy to accommodate due to the vast amounts of ransomware attack was used with that programming lan-
open-source implementations. It also makes use of most guage [9]. This type of ransomware-as-a-service is unique
infection techniques that are employed by modern malware because, being written in JavaScript, it uses a web browser to
families. Ransomware benefits from the common elusive initiate its attack. The impact of this threat is far superior in
methods utilized by modern malware and it frequently uses nature because it can be used theoretically on any operating
application programming interfaces (APIs) to carry out mali- system where a web browser exists. This grants Ransom32
cious actions that make it difficult to differentiate from benign so-called ‘‘write-once-infect-all’’ capabilities [9]. Nonethe-
applications. Furthermore, it uses TOR networks (The Onion less, Ransom32 has only been detected on a Windows-based
Routing networks) to keep its communication anonymous, platform thus far. It can be found on most underground TOR
and unregulated payment techniques like cryptocurrencies sites and can be downloaded by the affiliated user. To down-
to get paid without easily disclosing the identity of the load the executable, one must have a bitcoin address, as this
attackers [8]. is the way payments of ransom are made.
The remaining structure of this paper is organized as fol- The developers of the Ransom32 software take a 25% cut
lows and investigated perspectives can be found in Figure 1: of any ransom made, and the rest goes to the user of the
Section 2 reviews required research background and provides affiliated program [9]. When the Ransom32 executable runs,
a comprehensive review of different ML algorithms used in it extracts several files. During this process, a shortcut is
ransomware detection. Section 3 will provide an analysis created in the start menu, and the ransomware will start at
of ransomware frameworks that use ML algorithms, their login which guarantees the malware will be executed every
challenges, evaluations, and results. This section also expands time the system is started. The shortcut points to a chrome.exe
on the importance of this research, consolidating all the men- executable file that is typically an NW.js package. This pack-
tioned frameworks in a table, providing a description of each age contains JavaScript code used for encryption using AES
framework’s name, the algorithm(s) of choice, the year it and extracts to folders such as %AppData% and %Temp%.
was created, overall results, and challenges. Section 4 will Furthermore, this is the piece that contributes to performing
speak about some future concerns and defense topics of the harmful events towards the compromised system [10].
ransomware, ending with concluding the paper in section 5. With NW.js being a legitimate framework and application,
antivirus coverage in this area is still very weak in nature.
Any black hat or white hat developer can use this executable
to create and distribute native apps that run just like nor-
mal executables [11]. Furthermore, when looking closer into
Ransom32, it runs under the context of the user without hav-
ing any administrative rights or permissions. Figure 2 gives a
general idea of how a member can join the affiliate network,
then ultimately be granted access to download the malicious
code for use. The member would also be able to see statistics
related to the software such as the number of payments that
have been made and the number of installations that have as the Wscript2 [54]. It is believed that the attackers behind
been completed. The figure also shows the process of how the RAA ransomware are using the Jscript scripting language
a Ransom32 attack can occur. to make detection more difficult and to make complications
easier. Most malware attacks are written in compiled pro-
gramming languages with ransomware often disguised as
executables. Nonetheless, using languages which are not typ-
ically used to deliver malware, such as scripting languages,
could be less prone to detection [13].
C. CoronaVirus
Ransomware affiliates switched to COVID-19-themed social
engineering tactics during the 2020 pandemic to carry out
threats [14]. Mobile applications that looked legitimate would
download various forms of ransomware using spam attach-
ments that claim to provide health and safety information
about COVID-19 [15]. As the global pandemic increased the
need for health centers, the exposure to cyber-attacks also
boosted. This situation increased the number of ransomware
attacks within the health sectors, and Corona ransomware was
born [16]. This was a new strain that focused specifically on
hospitals and the encryption of patients’ health records. After
FIGURE 3. A deobfuscated function installing Pony. the host became infected, it displayed a COVID-19-themed
ransom message and demanded payment in Bitcoin [14].
B. RAA
The second example of JavaScript ransomware is RAA,
which spreads via email attachments pretending to be legit-
imate document files. These files typically have a valid file
format with a JavaScript extension, making the victim believe
its authenticity. Once the file is opened, it works just as
any other ransomware attack. The victim host’s files will
be encrypted, and a ransom will be demanded. RAA also
infects the victim’s computer by installing Pony, a well-
known password-stealing malware embedded in a JavaScript
file. A sample code of how this happens can be found in FIGURE 4. CoronaVirus delivery flow.
Figure 3. This malware can collect browser passwords and
other critical information on infected systems. Two security
The Covid-19 pandemic also opened the doors for many
researchers initially discovered RAA and according to them,
ransomware attacks against employees. Due to the threat of
it encrypts files using code from an open-source library called
catching the virus, many companies began to offer employ-
CryptoJS [12]. This code handles cipher algorithms such
ees the opportunity to work remotely [43]. This increases
as AES, DES, to name a few [12]. RAA targets images,
exposure to cyber-risks because individuals connect through
Ms-Word, Ms-Excel, Photoshop, .zip, .rar, sparing pro-
less reliable and unsecured Internet connections. Employ-
gram files, Windows files, AppData, and Microsoft files by
ees that accessed corporate networks using personal devices
appending a ‘‘.locked’’ to the end of the filenames [12]. Upon
provided a way to get into the hands of unauthorized
further analysis, Trend MicroTM discovered that the RAA
individuals through unsanctioned channels [43]. Attackers
ransomware is written in Jscript and not JavaScript [13].
also focused heavily on sophisticated phishing techniques.
Jscript is designed for Windows R systems and executed
According to [44], an APWG report showed 267,372 phish-
by the Windows Scripting Host Engine through Microsoft
ing campaigns were reported in the first quarter of 2020,
Internet Explorer (aka IE), but not via the Microsoft Edge
increasing (19.06%) over 2019 during the same period.
browser. Jscript carries some semblances with JavaScript
In Figure 4 below, the CoronaVirus delivery flow begins
because they are both derived from ECMAScript.1 Jscript
with a phishing website, locking the file system, then fully
is the implementation of ECMAScript while JavaScript is
compromising the hard drive until the ransom has been paid.
the Mozilla implementation of ECMAScript [53]. Jscript can
access objects exposed by IE and some systems objects such
1 ECMAScript is a Javascript standard that helps ensure the interoperabil- 2 Wscript are generic scripts specifically executed in Windows based
ity of web pages across different browsers. platforms.
performed by finding the hyper-plane that differentiates the By doing so, the most relevant features that provided optimal
two classes very well [26]. They are effective in high dimen- performance in detecting new ransomware were extracted.
sional spaces and can still be useful in cases where the Detection models were also developed for HSR, and they
number of dimensions is greater than the number of sam- utilized supervised machine learning algorithms on many
ples. SVMs use support vectors, a subset of training points prominent features. It was proven that this framework’s detec-
in the decision function, providing memory efficiency. The tion method achieved high accuracy and less false positive
versatility of the algorithm is also a key point as it can use rate for detecting HSR in the early phases of ransomware.
different kernel functions against the decision function or These methods have also been validated with an extensive
even use custom kernels. However, overfitting will occur if experimental evaluation to show their effectiveness. Lastly,
the number of features is much greater than the number of the capabilities of the proposed method were compared to
samples. SVMs do not provide probability metrics and must the results of previous work, other classifiers, and VirusTo-
use five-fold cross validation. The SVM algorithm has been tal. The framework itself is broken into 3 phases. The first
applied in biological science for use in the categorization of phase includes gathering ransomware and benign data from
protein; it has also been widely used with the classification a variety of sources. Once gathered, the data is checked and
of images, producing higher accuracy results than traditional labeled under a particular malware family using VirusTotal
query refinement schemes [26]. In Figure 10, pseudo code of software. The second phase analyzes the samples using a
the SVM algorithm using ransomware sample data is shown. Cuckoo sandbox and generates a report in JSON format of its
findings. Within the sandbox, log files are submitted through
pre-processing tasks, and when finished, the relevant features
are extracted to get valuable feature sets. Those features
are applied against the term frequency and inverse docu-
ment frequency (TF-IDF) algorithm for feature selection. The
last phase uses supervised machine learning algorithms Sup-
port vector machine (SVM), and Artificial Neural Network
(ANN) to derive statistics of the data.
Three different experimental evaluations were conducted
to measure the performance of the framework. The first
evaluation used the train-test splitting method which divides
the whole data set into training and testing data. The dataset
was split randomly with a uniform distribution of 80:20
ratio as training and testing, respectively. The experimental
results of ANN showed an accuracy of 0.958 with 0.101 false
positive rates, while SVM presented a higher false positive
of 0.109 compared to ANN and an accuracy of 0.932. In the
train-test splitting method, the data can become obscure and
FIGURE 10. Code snippet of a SVM algorithm using ransomware data. irrelevant, which is why the second experimental evaluation
is used. The 10-Fold cross-validation technique can prevent
the overfitting problem and estimate the effectiveness of the
III. RANSOMWARE DETECTION FRAMEWORKS
detection models. The best accuracy obtained by SVM is
This section investigates several ML-based frameworks
by presenting 0.982 of area under the curve (AUC) with
which are widely used in detecting ransomware. Some of
less than 0.035 of false positive rate [27]. It is important
the reviewed frameworks utilize one ML algorithm while the
to examine the ability of the classifiers to distinguish the
others might use multiple. These frameworks yield promising
ransomware from benign samples. Therefore, precision and
results in detection of different types of ransomware and
recall are applied to both datasets and presents 0.945 and
have potential to be used in future research works by the
0.942 respectively. SVM also showed better accuracy of
cyber security community. In what follows, eight state-of-
0.952 when compared to MLP that showed 0.945 of detection
the-art frameworks including Behavior Based [27], DNAact-
rate and 0.036 of false positive rates.
Ran [28], RANDS [30], RATAFIA [32], RansomWall [33],
The last evaluation used selected subset features which
CryptoKnight [34], EldeRan [35], and DRTHIS [40] will be
eliminate the redundant and irrelevant features and reduces
studied and compared.
the dimensionality of the dataset. The features were divided
A. BEHAVIOR BASED into seven subset features (top20, top30, top40, top50, top60,
A proposed behavior-based framework was built for defin- top70, top80) by considering their importance and ranking
ing dynamically monitored valuable features of high sur- based on phase 2 processing. The results of the experi-
vivable ransomware (HSR) [27]. The analysis of HSR was ment demonstrated that ANN showed the highest accuracy
conducted within an isolated sandbox environment, through of 98.79% when the top30 of the feature set was used
the Term Frequency-Inverse document frequency (TF-IDF). as training and testing [27]. However, this classification
accuracy had dramatically decreased to 95.63% when the uses 3 constraints (Tm, GC Content, and AT_GC Ratio) to
top20 of the feature set was used. The best model of SVM avoid inadequate data.
presented an accuracy of 97.6% when top40 of the feature The last step used by DNAact-Ran is the actual ran-
was applied while training the model [27]. ANN and SVM somware detection step. The dataset is trained using an active
both had low classification accuracy when the top 80 of learning classifier. Once this process is done, digital DNA
the feature set was used to train and test, which indicates sequences are randomly generated from the test data where
that more features do not improve the performance of the it is classified as good-ware or ransomware. Finally, the
classifiers. ransomware family is detected using the traditional ML clas-
sification algorithms. Machine Learning applications strug-
B. DNAACT-RAN gle with the amount of time and effort required to interpret
DNAact-Ran is a Machine Learning-based digital DNA large amounts of data sets that are required for supervised
sequencing engine used in classifying and detecting ran- learning in the process of training a high-accuracy classifier.
somware. It uses an active ML approach for sequencing its To solve this issue, active learning has been proposed and
digital DNA and detects ransomware in three key process designed to decrease the cost by finding data points to be
steps: Feature Selection, DNA Sequence Generation and used by the learning algorithm. The active learning algorithm
Ransomware Detection. Feature selection removes irrelevant uses three parameters for determining accuracy: Smoothing
features and reduces storage and computational cost. It is Parameter (SP), Regularization Parameter (RP), and Learn-
considered the most important process of machine learning. ing Rate (LR). The data was tested against traditional ML
Multi-Objective Grey Wolf Optimization (MOGWO) and algorithms and the experiment showed a 78.5% detection
Binary Cuckoo Search (BCS) algorithms are used to select the accuracy for Naïve Bayes, 75.8% for Decision Stump, 83.2%
relevant features from the collected dataset. MOGWO uses for AdaBoost, and 87.9% for the proposed solution [28]. This
a grid and archive approach with selecting the most dom- experiment partially proves that active learning classifiers are
inant features, while BCS uses a heuristic search approach better at detecting ransomware more efficiently.
to determine its features [28]. Figure 11 gives the complete
architecture of DNAact-Ran. C. RANDS
RANDS is a windows-based anti-ransomware tool that
implements a multi-tier framework with ransomware traits
archive and machine learning algorithms. The architecture
of RANDs is classified in three tiers: Analysis, Learning,
and Detection. The first tier checks the traits of different
ransomware families in a recursive test routine in a virtual
test environment. The virtual environment is utilized to avoid
the severe damage and malfunctions of ransomware on the
platform system. The second tier studies the combined traits
from the archive using a hybrid machine learning algorithm
to generate the classification model. The generated classifi-
cation models will be used to detect any suspicious activity
against the actions or traits in the last tier too. The last
tier applies the classification model to detect any unknown
ransomware variant via a computer scan [30] which alerts
FIGURE 11. DNAact-Ran architecture.
the system’s user that a ransomware is going to possibly
infect the system. RANDS machine learning algorithm uses
In the digital DNA sequence generation step, a new dataset a hybrid approach. It uses both Decision Tree and Naïve
is used to generate the digital DNA sequence after the feature Bayes decision functions due to their pruning margins for
selection process is completed. The design constraint of dig- more accurate categorization. The Decision Tree algorithm
ital DNA is then computed, and the k-mer frequency vector generates its predictions of the traits within a tree structure
is generated for the DNA sequence. A new dataset is then of nodes, leaves, and branches throughout the pruning and
generated for the ransomware detection training phase based tree building process [31]. The Naïve Bayes algorithm is used
on those calculations. A synthetic DNA representation of a for predicting the actual category of the overlooked traits in
digital artifact is used because it does not represent the content the vague nodes of Decision Tree. The Bayes’ probabilistic
of biological DNA. DNA is represented computationally by theorem is used when a trait that goes unclassified cannot be
character strings containing only the characters A, G, C and T, classified.
therefore, Pedersen et al. [29] created a reversible translation To test and demonstrate whether RANDS could adapt
of the byte sequence of a digital artifact which mapped binary the zero-day ransomware variants and their corresponding
pairs into those string characters [28]. The DNA sequence families, performance metrics including Detection Accuracy
design is used as an approach of control and DNAact-Ran Rate, Mistake Rate, Miss Rate, and Elapsed Time along
with plots of ROC curve have been utilized through exper- generated during the sample’s execution aid in organizing
iments [31]. The ten-fold cross evaluation routine inferred the layers in a computational order. It is implemented solely
that RANDS could manifest its adaptive and effective clas- for Windows operating system. RansomWall’s architecture
sification against zero-day ransomware. That was accred- models that of a hybrid approach utilizing a joint static and
ited by the hybrid machine learning approach that RANDS dynamic analysis to compute values of the selected feature
utilized. RANDS implemented ransomware traits to distin- set [33]. The Machine Learning Engine is used to develop
guish different ransomware families and identify their related a generalized model which is effective against zero-day ran-
variants. However, the performance trend line reported at somware attacks. It takes feature values collected by static,
certain days showed erratic behavior caused by the qualities dynamic and trap layers as input and classifies the executable
of ransomware families and their corresponding variants that as ransomware or benign. The engine is trained offline using
might be varied. Results showed a 96.27% average accuracy supervised algorithms and the training data consists of fea-
rate and 1.32% average mistake rate throughout the real-time ture values with ransomware and benign labels. The Trained
assessment [31]. Machine Learning Engine then uses the learned model to
D. RATAFIA
classify executables in real-time based on input feature val-
ues. The following supervised machine learning algorithms
An unsupervised detection framework RATAFIA uses a DNN
are evaluated based on performance: Logistic Regression,
architecture and Fast Fourier Transformation to develop a
SVM, ANN, Random Forests, and Gradient Tree Boosting.
highly accurate, fast, and reliable solution to ransomware
The ransomware sample set has a 12-Fold Cross Validation
detection using minimal trace-points. The advantage of using
performed on it. In each test run, Machine Learning Layer
an unsupervised technique is that the learning process does
is trained on all samples from 11 out of 12 Cryptographic
not require a labeled dataset, which is often difficult to obtain
ransomware families and 221 out of 442 samples from benign
considering the occurrences of several newer unknown vari-
software [33]. The learned model is tested against remaining
eties of ransomware. RATAFIA specifically was created to
benign samples on the evaluation setup and all samples from
learn the behavior of a system under observation with the
the last ransomware family. Since most of the successful
statistics obtained from a cluster of Hardware Performance
ransomware attacks are zero-day intrusions, this process of
Counter (HPC) events [32]. The first phase of RATAFIA
evaluation is selected, with samples from an entirely new
tests its robustness and uses an analysis in the presence of
ransomware family or its upgraded variant. During the evalu-
expensive SPEC benchmarks. It is observed that the execu-
ation, the functionality of the File Backup Layer is verified to
tion behaviors of HPC events are significantly different from
check if the files are correctly backed up for suspicious pro-
normal observations, and the sequences of time-series data
cesses after receiving classification output from the Machine
in which RATAFIA processes are treated as being malicious
Learning Layer.
due to reaching computational thresholds. However, these
The metrics show the best results with Gradient Tree
are simply the SPEC programs creating false positive errors.
Boosting Algorithm. RansomWall attains a detection rate
The second phase uses FFT to try to eliminate the false
of 98.25% with near-zero false positives with this algorithm.
positive by changing the HPC values from time domain to
The Gradient Tree Boosting algorithm provides effective
frequency domain. This is done to understand the repetitive
handling of heterogeneous data, high predictive power and
pattern within the values because ransomware executable
robustness to outliers resulting in high performance [GG].
runs encryption repeatedly on multiple files. The entire detec-
Analysis of false negatives show that two ransomware sam-
tion procedure does not need any template of the malicious
ples abruptly terminated after encrypting only a few files.
process from beforehand. Instead, it thrives on an anomaly
As the resulting file system activity is reduced, samples do not
detection procedure to detect infectious ransomware in as
get detected. Limited file system activity is leading to false
less as 5 seconds with almost zero false positives, using
negatives due to the low number of such files on the user’s
frequency analysis [32]. The proposed detection method
system. The rest of the false negatives come from decision
works on any platform having HPCs. However, the tunable
boundary errors.
hyper-parameters will be different for different systems. The
determination of values for these parameters is a one-time
process, which will be accomplished during the training of F. CRYPTOKNIGHT
autoencoders. RATAFIA uses a template of the normal sys-
CryptoKnight was built to classify cryptographic primitives
tem behavior in terms of HPC values to train the autoen-
in compiled binary executables using the Dynamic Convo-
coders. The advantage of using HPCs is that they are difficult
lutional Neural Network (DCNN) algorithm. It introduces
to tamper with. While one may increase some HPC values
a learning system that can easily integrate new samples
by a program, it is difficult to reduce the HPC values without
through the scalable synthesis of customizable cryptographic
explicitly targeting the HPC registers.
algorithms. CryptoKnight’s architecture is intended to limit
E. RANSOMWALL human interaction, allowing the structure of an effective
RansomWall protects against cryptographic ransomware model at run-time [34]. The entire system is comprised of
using a layered defense system. The features that are three stages:
1. Procedural generation guides the synthesis of unique 100 random splits was introduced for each explored value,
cryptographic binaries with variable obfuscation and where 80% of the samples were used for training and 20%
alternate compilation. as test samples. It is determined that all three classifiers
2. Assumptions of cryptographic code aid the discrim- show maximum performance at 400 features, and the accu-
ination of diagnostics from the dynamic analysis of racy showed no improvement beyond that number [36]. The
synthetic or reference binaries to build an ‘image’ of second experiment observes the performance of the previous
execution. classifiers along with VirusTotal. The top 400 features were
3. A DCNN fits variable-length matrices for ease of train- used for the original three classifiers, and the test covered
ing and the immediate classification of new samples. all the methods averaging the results over 100 independent
CryptoKnight was tested on many applications using non- train/test splits with 80% of samples for training. It is deter-
library linked functionality and analysis showed that it is a mined VirusTotal shows better performance when compared
viable solution that can quickly learn from new cryptographic to the other algorithms, although EldeRan is just slightly
execution patterns to classify unknown software [34]. Cryp- behind. The last experiment tests how effective EldeRan
toKnight also demonstrated that it could classify results faster can detect new families of ransomware. For new families of
compared to that of previous methodologies and is consider- ransomware, it is common for them to share the same char-
ably re-usable. At a 96 % accuracy rate, CryptoKnight con- acteristics and goals of previous classes [37]. Datasets were
firmed that cryptographic primitive classification in compiled clustered into 11 classes using their known family name,
binary executables could be successfully achieved using a since the naming conventions of the antivirus (AV) vendors
DCNN algorithm. are not always consistent or compatible amongst them. Two
cases were considered by selecting the top 100 and 400 fea-
G. ELDERAN tures according to the MI criterion. For eight of the ran-
In 2016, EldeRan was developed to identify the most sig- somware families the detection rate is above 90% and for ten
nificant ransomware features and use them to detect ran- families the detection rate is above 80%, both occurring when
somware [35]. The framework is based on the observation of using 100 features. When using 400 features, the detection
actions or events that typically occur within ransomware and rates become worse with only five families achieving 90%
goodware samples in its early stages. Ransomware and good- and eight families achieving 80%. The average detection rate
ware sample datasets are dynamically analyzed in a sandbox is higher (93.3%) when using only the top 100 features than in
environment first. From the two datasets, EldeRan retrieves the case of using 400 features (87.1%) [36]. Figure 12 below
and analyses the following classes of features: Windows API shows an average ROC of the test samples.
calls, Registry Key Operations, File System Operations, the
set of file operations performed per File Extension, Directory
Operations, Dropped Files, and Strings. Other than Strings,
the rest of the features are collected while dynamically
analyzing the ransomware. Once the monitoring phase has
completed, the Mutual Information criterion [37], a feature
selection algorithm is used to choose the ones that are most
relevant. The feature selection process is not always used in
machine learning algorithms, but for EldeRan, it helps with
performance and provides more competence in the algorithm
[38]. Finally, the matrices containing these features are used
in a Regularized Logistic Regression classifier. This classifier
will return ransomware or goodware once detected and is
also run online on a PC to classify new samples, which can FIGURE 12. Average ROC for the test samples over 100 random splits for
EldeRan, the SVM, NB, and VirusTotal [36].
come from infected websites or multiple infected vectors.
The training set is analyzed offline and completed within Some limitations of EldeRan existed. If the ransomware
minutes in the sandbox environment while new applications remained silent or waited for user interaction within the
are classified at run-time through an online classifier, which sandbox environment, EldeRan does not properly extract the
is also fast [36]. ransomware, therefore, goes undetected. Secondly, no other
EldeRan was conducted in three different experiments. applications were running within the sandbox environment,
The first experiment tested how performant EldeRan was which was purposely done to eliminate the ransomware
compared to two other classifiers, SVM (Support Vector checks to evade detection. Lastly, the original ransomware
Machine) and NB (Naive Bayes) [36]. It was determined and goodware data samples were limited because EldeRan
that both SVM and EldeRan outperformed NB, and EldeRan could not process empty API calls efficiently during the
slightly edged SVM. These metrics were evaluated against dynamic analysis phase. This reduced the dataset by half the
the AUC (Area Under the Curve) using between 50 and samples. Ultimately, EldeRan can only detect ransomware
1500 features, all supported by MI criterion. A structure of once the infection occurs [39].
In [65], the authors present a new type of framework public blockchains, therefore, more research interests in the
called Detection Avoidance Mitigation (DAM). It can handle area is needed as this shows a concern.
classification, detection, and mitigation all in one go. Its
architecture consists of typical detection techniques using
static and dynamic analysis, avoidance techniques such as V. CONCLUSION
system updates and patches, and mitigation techniques such In recent years, ransomware has continuously been a top
as reverse engineering. DAM evaluated different combat topic in cybersecurity and attacks are now taking place not
strategies for preventing ransomware attacks and widespread only on individuals but organizations as well. Ransomware
financial losses, proving that avoidance techniques are the has evolved from elementary scareware and locker related
most desirable in protecting users and organizations from user interfaces, to cryptographic and fileless ransomware.
ransomware. In this paper, we provide a comprehensive survey of ran-
Lastly, the first blockchain-based ransomware schemes somware types, common frameworks that are used to detect
were introduced in [66]. The authors focused on smart con- ransomware, and the ML algorithms in which they use.
tracts to contribute to the paying of single files or refunding A detailed list of all pertinent information is gathered and
the ransom payment back to the victim if the decryption keys arranged in a table. Though other research papers have
were not sent within a reasonable time. The results of this provided reviews with similar concepts, these surveys have
research showed no practical countermeasures when using not captured the explicit details in one place as this research.
117608 VOLUME 10, 2022
D. Smith et al.: Machine Learning Algorithms and Frameworks in Ransomware Detection
By collecting such material and providing a comparative [17] C. M. Bishop, Pattern Recognition and Machine Learning (Informa-
study, this paper provides a means for others to foresee an tion Science and Statistics). New York, NY, USA: Springer-Verlag,
2006.
area of interest and investigate parts where improvements can [18] N. Chauhan. (Jan. 2020). Decision Tree Algorithm, Explained. KDnuggets.
be made due to poor results or limitations. Ultimately, this Accessed: Jan. 22, 2021. [Online]. Available: https://www.kdnuggets.
paper can provide direction to those who are looking to utilize com/2020/01/decision-tree-algorithm-explained.html
[19] N. Donges. (Jun. 22, 2021). A Complete Guide to the Random
one of the mentioned frameworks for advancement in future Forest Algorithm. Accessed: Jan. 22, 2021. [Online]. Available:
work. https://builtin.com/data-science/random-forest-algorithm
[20] O. Mbaabu. (Dec. 11, 2020). Introduction to Random Forest in
Machine Learning. Accessed: Jan. 22, 2021. [Online]. Available:
REFERENCES https://www.section.io/engineering-education/introduction-to-random-
forest-in-machine-learning/
[1] L. Abrams. (2020). SunCrypt Ransomware Shuts Down North [21] J. Brownlee. (Jul. 7, 2021). A Gentle Introduction to Long Short-
Carolina School District. Accessed: Jan. 11, 2021. [Online]. Available: Term Memory Networks by the Experts. Machine Learning Mastery.
https://www.bleepingcomputer.com/news/security/suncrypt-ransomware- Accessed: Jan. 24, 2021. [Online]. Available: https://machinelearning
shuts-down-north-carolina-schooldistrict/ mastery.com/gentle-introduction-long-short-term-memory-networks-
[2] BBC News. (2020). Northumbria University Hit by Cyber Attack. experts/
Accessed: Jan. 11, 2021. [Online]. Available: https://www.bbc.com/ [22] S. Saxena. (Mar. 16, 2021). Introduction to Long Short Term Memory
news/uk-england-tyne-53989404 (LSTM). Analytics Vidhya. Accessed: Jan. 24, 2021. [Online]. Available:
[3] B. Fraga. (2013). Swansea Police Pay $750 ‘Ransom’ After https://www.analyticsvidhya.com/blog/2021/03/introduction-to-long-
Computer Virus Strikes. Accessed: Jan. 11, 2021. [Online]. Available: short-term-memory-lstm/
https://www.heraldnews.com/x2132756948/Swansea-police-pay-750- [23] S. Ray. (Sep. 11, 2017). 6 Easy Steps to Learn Naive Bayes Algo-
ransom-after-computer virus-strikes rithm With Codes in Python and R. Analytics Vidhya. Accessed:
[4] L. Freedman. (2020). Ransomware Attacks Predicted to Occur Jan. 24, 2021. [Online]. Available: https://www.analyticsvidhya.com/
Every 11 Seconds in 202 With a Cost of $20 Billion. Accessed: blog/2017/09/naive-bayes-explained/
Jan. 11, 2021. [Online]. Available: https://www.dataprivacyandsecurityin [24] P. Domingos and M. Pazzani, ‘‘On the optimality of the simple Bayesian
sider.com/2020/02/ransomwareattacks-predicted-to-occur-every-11- classifier under zero-one loss,’’ Mach. Learn., vol. 29, pp. 103–130,
seconds-in-2021-with-a-cost-of-20-billion/ Nov. 1997.
[5] K. Savage, P. Coogan, and H. Lau. (2015). The Evolution of Ransomware. [25] C. Li. (2016). A Gentle Introduction to Gradient Boosting. Accessed:
[Online]. Available: https://its.fsu.edu/sites/g/files/imported/storage/ Jan. 26, 2021. [Online]. Available: https://www.ccs.neu.edu/
images/information-security-and-privacy-office/the-evolution-of- home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf
ransomware [26] R. Gandhi. (Jul. 7, 2018). Support Vector Machine—Introduction to
[6] I. Segun, B. I. Ujioghosa, S. O. Ojewande, F. O. Sweetwilliams, Machine Learning Algorithms. Accessed: Jan. 26, 2021. [Online].
S. N. John, and A. A. Atayero, ‘‘Ransomware: Current trend, challenges, Available: https://towardsdatascience.com/support-vector-machine-
and research directions,’’ in Proc. World Congr. Eng. Comput. Sci., 2017, introduction-to-machine-learningalgorithms-934a444fca47
pp. 169–174. [27] Y. A. Ahmed, B. Kocer, and B. A. S. Al-rimy, ‘‘Automated analy-
[7] A. Kharaz, S. Arshad, C. Mulliner, W. Robertson, and E. Kirda, ‘‘UNVEIL: sis approach for the detection of high survivable ransomware,’’ KSII
A large-scale, automated approach to detecting ransomware,’’ in Proc. 25th Trans. Internet Inf. Syst., vol. 14, no. 5, pp. 2236–2257, 2020, doi:
USENIX Secur. Symp., 2016, pp. 757–772. 10.3837/TIIS.2020.05.021.
[8] D. Y. Huang, M. M. Aliapoulios, V. G. Li, L. Invernizzi, E. Bursztein, [28] F. Khan, C. Ncube, L. K. Ramasamy, S. Kadry, and Y. Nam,
K. McRoberts, J. Levin, K. Levchenko, A. C. Snoeren, and D. McCoy, ‘‘A digital DNA sequencing engine for ransomware detection
‘‘Tracking ransomware end-to-end,’’ in Proc. IEEE Symp. Secur. Privacy using machine learning,’’ IEEE Access, vol 8, pp. 119710–119719,
(SP), May 2018, pp. 618–631. 2020.
[9] L. Abrams. (Jan. 4, 2016). Ransom32 is the First Ransomware Written in [29] J. Pedersen, D. Bastola, K. Dick, R. Gandhi, and W. Mahoney, ‘‘Blast your
Javascript. BleepingComputer. Accessed: Jan. 12, 2021. [Online]. Avail- way through malware analysis assisted by bioinformatics tools,’’ in Proc.
able: https://www.bleepingcomputer.com/news/security/ransom32-is-the- Int. Conf. Secur. Manage., 2012, p. 1.
first-ransomware-written-in-javascript/ [30] H. Zuhair and A. Selamat, ‘‘RANDS: A machine learning-based
[10] KnowBe4. (2021). Ransom32 Ransomware-as-a-Service. anti-ransomware tool for Windows platforms,’’ in Advancing Technology
Accessed: Jan. 12, 2021. [Online]. Available: https://www.knowbe4.com/ Industrialization Through Intelligent Software Methodologies, Tools and
ransom-32-ransomware-as-a-service Techniques, vol. 318, 2019.
[11] S. Sjouwerman. (Feb. 5, 2019). First Javascript-Only Ransomware as [31] N. Hampton, Z. Baig, and S. Zeadally, ‘‘Ransomware behavioural anal-
a Service Poses New Threat. TechBeacon. Accessed: Jan. 12, 2021. ysis on Windows platforms,’’ J. Inf. Secur. Appl., vol. 40, pp. 44–51,
[Online]. Available: https://techbeacon.com/security/first-javascript-only- Jun. 2018.
ransomware-service-poses-new-threat [32] M. Alam, S. Bhattacharya, S. Dutta, S. Sinha, D. Mukhopadhyay,
[12] M. J. Schwartz and R. Ross. (Jun. 20, 2016). Latest Ransomware Relies on and A. Chattopadhyay, ‘‘RATAFIA: Ransomware analysis using time
JavaScript. Bank Information Security. Accessed: Dec. 2, 2021. [Online]. and frequency informed autoencoders,’’ in Proc. IEEE Int. Symp.
Available: https://www.bankinfosecurity.com/latest-ransomware-relies- Hardw. Oriented Secur. Trust (HOST), May 2019, pp. 218–227, doi:
on-javascript-a-9212 10.1109/HST.2019.8740837.
[13] (Jun. 16, 2016). New RAA Ransomware Uses Only JavaScript to [33] S. K. Shaukat and V. J. Ribeiro, ‘‘RansomWall: A layered defense system
Infect Computers. Accessed: Jan. 12, 2021. [Online]. Available: against cryptographic ransomware attacks using machine learning,’’ in
https://www.trendmicro.com/vinfo/mx/security/news/cybercrime-and- Proc. 10th Int. Conf. Commun. Syst. Netw. (COMSNETS), Jan. 2018,
digital-threats/new-raa-ransomware-uses-only-javascript-to-infect- pp. 356–363.
computers [34] G. Hill and X. Bellekens, ‘‘CryptoKnight: Generating and modelling
[14] J. Tolbert. (2020). Malicious Actors Exploiting Coronavirus Fears. compiled cryptographic primitives,’’ Information, vol. 9, no. 9, p. 231,
Accessed: Jan. 12, 2021. [Online]. Available: https://www.kuppingercole. Sep. 2018.
com/blog/tolbert/maliciousactors-exploiting-coronavirus-fears [35] Z.-G. Chen, H.-S. Kang, S.-N. Yin, and S.-R. Kim, ‘‘Automatic ran-
[15] Brooke Crothers. (2020). Apps Designed to Track COVID-19 somware detection and analysis based on dynamic API calls flow graph,’’
Might be Full of Ransomware, Report Says. [Online]. Available: in Proc. Int. Conf. Res. Adapt. Convergent Syst., Sep. 2017, pp. 196–201,
https://www.foxnews.com/tech/apps-track-covid-19-full-ransomware doi: 10.1145/3129676.3129704.
[16] Acronis. (2020). Digital CoronaVirus: Yet Another Ransomware Com- [36] D. Sgandurra, L. Muñoz-González, R. Mohsen, and E. C. Lupu, ‘‘Auto-
bined With Infostealer. Accessed: Jan. 12, 2021. [Online]. Available: mated dynamic analysis of ransomware: Benefits, limitations and use for
https://www.cbronline.com/news/tesla-cyber-attack detection,’’ 2016, arXiv:1609.03020.
[37] A. Kharraz, W. Robertson, D. Balzarotti, L. Bilge, and E. Kirda, [58] H. Ke, H. Wu, and D. Yang, ‘‘Towards evolving security requirements
‘‘Cutting the gordian knot: A look under the hood of ransomware of industrial internet: A layered security architecture solution based
attacks,’’ in Detection of Intrusions and Malware, and Vulnerability on data transfer techniques,’’ in Proc. Int. Conf. Cyberspace Innov.
Assessment. Cham, Switzerland: Springer, 2015, pp. 3–24, Adv. Technol., New York, NY, USA, Dec. 2020, pp. 504–511, doi:
doi: 10.1007/978-3-319-20550-2_1. 10.1145/3444370.3444620.
[38] J. Z. Kolter and M. A. Maloof, ‘‘Learning to detect and classify malicious [59] Trend Micro. What is Ryuk Ransomware. Accessed: Apr. 5, 2021. [Online].
executables in the wild,’’ J. Mach. Learn. Res., vol. 7, pp. 2721–2744, Available: https://www.trendmicro.com/en_us/what-is/ransomware/ryuk-
Dec. 2006. ransomware.html
[39] G. Cusack, O. Michel, and E. Keller, ‘‘Machine learning-based detection [60] WannaCry Ransomware. (May 15, 2017). WannaCry Ransom
of ransomware using SDN,’’ in Proc. ACM Int. Workshop Secur. Softw. ware—LogRhythm. Accessed: Apr. 14, 2021. [Online]. Available:
Defined Netw. Netw. Function Virtualization, Mar. 2018, pp. 1–6, doi: https://logrhythm.com/blog/wannacry-ransomware/
10.1145/3180465.3180467. [61] A. Kujawa. (Jan. 8, 2019). Ryuk Ransomware Attacks Businesses Over the
[40] S. Homayoun, A. Dehghantanha, M. Ahmadzadeh, S. Hashemi, Holidays. Malwarebytes Labs. Accessed: Apr. 14, 2021. [Online]. Avail-
R. Khayami, K.-K. R. Choo, and D. E. Newton, ‘‘DRTHIS: Deep able: https://blog.malwarebytes.com/cybercrime/malware/2019/01/ryuk-
ransomware threat hunting and intelligence system at the fog layer,’’ ransomware-attacks-businesses-over-the-holidays/
Future Gener. Comput. Syst., vol. 90, pp. 94–104, Jan. 2019. [62] R. Nimbalkar. (Jul. 13, 2021). Decision Tree Algorithms-Machine Learn-
[41] QuoIntelligence. (Jan. 18, 2022). Ransomware is Here to Stay and ing. Accessed: Apr. 14, 2021. [Online]. Available: https://medium.com/
Other Cybersecurity Predictions for 2022. Accessed: Jan. 31, 2021. appengine-ai/decision-tree-algorithms-machine-learning-9e2e8cadfcae
[Online]. Available: https://quointelligence.eu/2022/01/ransomware-and- [63] S. India. (Jul. 4, 2020). Hands-on Training With Machine Learn-
other-cybersecurity-predictions-for-2022/ ing Algorithms: Decision Tree and Random Forest. Springboard Blog.
[42] D. Golden and K. Norton. (2021). Defending Against Ransomware in an Accessed: Apr. 14, 2021. [Online]. Available: https://in.springboard.
Age of Emerging Technology. Deloitte. Accessed: Jan. 31, 2021. [Online]. com/blog/machine-learning-algorithms-decision-tree-random-forest/
Available: https://www2.deloitte.com/us/en/pages/risk/articles/defending- [64] G. Van Houdt, C. Mosquera, and G. Npoles, ‘‘A review on the long short-
against-ransomware.html term memory model,’’ Artif. Intell. Rev., vol. 53, no. 8, pp. 5929–5955,
[43] L. Simonovich. (Jan. 15, 2020). Are Utilities Doing Enough to 2020, doi: 10.1007/s10462-020-09838-1.
Protect Themselves From Cyberattack?. World Economic Forum. [65] A. Kapoor, A. Gupta, R. Gupta, S. Tanwar, G. Sharma, and I. E. Davidson,
Accessed: Apr. 4, 2021. [Online]. Available: https://www.weforum.org/ ‘‘Ransomware detection, avoidance, and mitigation scheme: A review and
agenda/2020/01/are-utilities-doing-enough-to-protect-themselves-from- future directions,’’ Sustainability, vol. 14, no. 1, p. 8, Dec. 2021.
cyberattack/ [66] O. Delgado-Mohatar, J. M. Sierra-Cámara, and E. Anguiano, ‘‘Blockchain-
[44] APWG. (May 11, 2020). Phishing Activity Trends Report in Q1 based semi-autonomous ransomware,’’ Future Gener. Comput. Syst.,
of 2020. Accessed: Apr. 4, 2021. [Online]. Available: https://docs. vol. 112, pp. 589–603, Nov. 2020.
apwg.org/reports/apwg_trends_report_q1_2020.pdf
[45] Q. Chen and R. A. Bridges, ‘‘Automated behavioral analysis of mal- DARYLE SMITH was born in Lenoir, NC, USA,
ware: A case study of WannaCry ransomware,’’ in Proc. 16th IEEE in 1985. He received the B.S. and M.S. degrees
Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2017, pp. 454–460, doi:
in computer science from Winston-Salem State
10.1109/ICMLA.2017.0-119.
University. He is currently pursuing the Ph.D.
[46] (May 22, 2017). WannaCry Ransomware Campaign Exploiting SMB
Vulnerability. Accessed: Apr. 4, 2021. [Online]. Available: https://cert. degree in computer science with North Carolina
europa.eu/static/SecurityAdvisories/2017/CERT-EU-SA2017-012.pdf A&T State University. Since 2010, he has been
[47] M. Akbanov, V. G. Vassilakis, and M. D. Logothetis, ‘‘WannaCry ran- starting his IT career as a Software Engineer and
somware: Analysis of infection, persistence, recovery prevention and prop- has been involved in every aspect of e-commerce
agation mechanisms,’’ J. Telecommun. Inf. Technol., vol. 1, no. 2019, since. He is currently an E-Commerce Architect
pp. 113–124, Apr. 2019. with the Peapod Digital Laboratories, Salisbury,
[48] L. J. Trautman and P. Ormerod, ‘‘Wannacry, ransomware, and the emerging NC headquarters.
threat to corporations,’’ SSRN Electron. J., vol. 86, p. 503, Jan. 2018, doi:
10.2139/ssrn.3238293. SAJAD KHORSANDROO received the Ph.D.
[49] S. Jones and T. Bradshaw. (May 14, 2017). Global Alert to Prepare degree in computer science from The University
for Fresh Cyber Attacks. Accessed: Apr. 4, 2021. [Online]. Available: of Texas at San Antonio, in 2019. Currently, he is
https://www.ft.com/content/bb4dda38-389f-11e7-821a-6027b8a20f23 an Assistant Professor of computer science at
[50] M. V. Liy. (May 15, 2017). Putin Culpa a Los Servicios Secretos de
North Carolina A&T State University, where he is
EE UU Por el Virus ’WannaCry’ Que Desencaden? el Ciberataque
also an Associate Director of the Cyber Defense
Mundial. Accessed: Apr. 4, 2021. [Online]. Available: https://elpais.
com/internacional/2017/05/15/actualidad/1494855826_022843.html and AI Laboratory. He has already secured $1.7M
[51] S. K. Sahi, ‘‘A study of wannacry ransomware attack,’’ Int. J. Eng. Res. in funds from NSF, DoD, Palo Alto Networks,
Comput. Sci. Eng., vol. 4, no. 9, pp. 5–7, 2017. and Carolina Cyber Center. His current research
[52] R. Collier, ‘‘NHS ransomware attack spreads worldwide,’’ Can. Med. interests include the application of AI/ML in cyber
Assoc. J., vol. 189, no. 22, pp. E786–E787, 2017. security, next-generation network infrastructures, cloud computing, and
[53] JavaScript|MDN. (Feb. 18, 2022). JavaScript Language Resour secure cyber physical systems.
ces—JavaScript: MDN. Accessed: Apr. 4, 2021. [Online]. Available:
https://developer.mozilla.org/enUS/docs/Web/JavaScript/Langu KAUSHIK ROY is currently a Professor and the
age_Resources Interim Chair of the Department of Computer
[54] J. Gerend. (Mar. 3, 2021). Wscript. Microsoft Docs. Science, North Carolina A&T State University
Accessed: Apr. 4, 2021. [Online]. Available: https://docs.microsoft.com/ (NCAT). He has over 140 publications, including
en-us/windows-server/administration/windows-commands/wscript 35 journal articles and a book. His current research
[55] T. McIntosh, A. S. M. Kayes, Y.-P.-P. Chen, A. Ng, and P. Watters,
interests include cybersecurity, cyber identity, bio-
‘‘Ransomware mitigation in the modern era: A comprehensive review,
metrics, machine learning (deep learning), data
research challenges, and future directions,’’ ACM Comput. Surv., vol. 54,
no. 9, pp. 1–36, Dec. 2022, doi: 10.1145/3479393. science, the IoT, cyber-physical systems, and big
[56] H. Oz, A. Aris, A. Levi, and A. S. Uluagac, ‘‘A survey on ransomware: data analytics. His research is funded by the
Evolution, taxonomy, and defense solutions,’’ ACM Comput. Surv., vol. 54, National Science Foundation (NSF), Department
no. 11s, pp. 1–37, Jan. 2022, doi: 10.1145/3514229. of Defense (DoD), National Security Agency (NSA), and Department of
[57] CIS Security. (2019). Fall 2019 Threat of the Quarter: Ryuk Ransomware. Energy (DoE). He is the Director of the Center for Cyber Defense (CCD).
Accessed: Apr. 5, 2021. [Online]. Available: https://www.cisecurity. He also directs the Cyber Defense and AI Laboratory.
org/white-papers/fall-2019-threat-ofthe-quarter-ryuk-ransomware/