40 - Malware Detection Using Machine Learning and Performance Evaluation
40 - Malware Detection Using Machine Learning and Performance Evaluation
Evaluation
Master
of
Computer Application
Submitted by
STUDENT_NAME
ROLL_NO
Under the esteemed guidance of
GUIDE_NAME
Assistant Professor
CERTIFICATE
This is to certify that the project report entitled PROJECT NAME” is the bonafied record of
project work carried out by STUDENT NAME, a student of this college, during the academic
year 2014 - 2016, in partial fulfillment of the requirements for the award of the degree of Master
Of Computer Application from St.Marys Group Of Institutions Guntur of Jawaharlal Nehru
Technological University, Kakinada.
GUIDE_NAME,
Asst. Professor Associate. Professor
(Project Guide) (Head of Department, CSE)
DECLARATION
We, hereby declare that the project report entitled “PROJECT_NAME” is an original work done at
St.Mary„s Group of Institutions Guntur, Chebrolu, Guntur and submitted in fulfillment of the
requirements for the award of Master of Computer Application, to St.Mary„s Group of Institutions
Guntur, Chebrolu, Guntur.
STUDENT_NAME
ROLL_NO
ACKNOWLEDGEMENT
We consider it as a privilege to thank all those people who helped us a lot for successful
completion of the project “PROJECT_NAME” A special gratitude we extend to our guide
GUIDE_NAME, Asst. Professor whose contribution in stimulating suggestions and
encouragement ,helped us to coordinate our project especially in writing this report, whose
valuable suggestions, guidance and comprehensive assistance helped us a lot in presenting
the project “PROJECT_NAME”.
We would also like to acknowledge with much appreciation the crucial role of our Co-
Ordinator GUIDE_NAME, Asst.Professor for helping us a lot in completing our project.
We just wanted to say thank you for being such a wonderful educator as well as a person.
We express our heartfelt thanks to HOD_NAME, Head of the Department, CSE, for his
spontaneous expression of knowledge, which helped us in bringing up this project through
the academic year.
STUDENT_NAME
ROLL_NO
3
CONTENTS
1 INTRODUCTION................................................................................................5
2 THEORETICAL BACKGROUND.....................................................................6
2.1 Malware types.............................................................................................6
2.2 Detection methods.......................................................................................8
2.3 Need for machine learning........................................................................10
2.4 Related work.............................................................................................11
3 MACHINE LEARNING METHODS................................................................12
3.1 Machine Learning Basics..........................................................................12
3.1.1 Feature extraction........................................................................14
3.1.2 Supervised and Unsupervised Learning......................................15
3.2 Classification methods..............................................................................16
3.2.1 K-nearest neighbours...................................................................17
3.2.2 Support Vector Machines............................................................19
3.2.3 Naive Bayes.................................................................................21
3.2.4 J48 Decision Tree........................................................................22
3.2.5 Random Forest.............................................................................24
3.3 Cross-validation........................................................................................26
4 PRACTICAL PART...........................................................................................27
4.1 Data...........................................................................................................28
4.1.1 Dridex..........................................................................................28
4.1.2 Locky...........................................................................................30
4.1.3 Teslacrypt....................................................................................32
4.1.4 Vawtrak........................................................................................34
4.1.5 Zeus..............................................................................................36
4.1.6 DarkComet...................................................................................37
4.1.7 CyberGate....................................................................................38
4.1.8 Xtreme.........................................................................................39
4.1.9 CTB-Locker.................................................................................40
4.2 Cuckoo Sandbox.......................................................................................41
4.2.1 Scoring system.............................................................................44
4.2.2 Reports and features....................................................................46
4.3 Feature representation...............................................................................48
4.3.1 Binary representation...................................................................49
4.3.2 Frequency representation.............................................................49
4
4.3.3 Combining representation............................................................50
4.4 Feature selection.......................................................................................50
4.5 Implementation.........................................................................................51
4.5.1 Sandbox configuration.................................................................52
4.5.2 Feature extraction........................................................................52
4.5.3 Feature selection..........................................................................54
4.5.4 Application of machine learning methods...................................55
5 RESULTS AND DISCUSSION........................................................................56
5.1 K-Nearest Neighbors.................................................................................56
5.2 Support Vector Machines..........................................................................59
5.3 J48 Decision Tree......................................................................................62
5.4 Naive Bayes..............................................................................................67
5.5 Random Forest..........................................................................................69
6 CONCLUSIONS................................................................................................72
6.1. Future Work..............................................................................................73
BIBLIOGRAPHY.........................................................................................................75
APPENDICES..............................................................................................................80
1. Feature Extraction Code (python).............................................................80
2. Feature selection code (R).........................................................................85
3. Classification code (R)..............................................................................86
4. List of MD5 hashes of malware samples..................................................89
5
1 INTRODUCTION
With the rapid development of the Internet, malware became one of the major cyber
threats nowadays. Any software performing malicious actions, including information
stealing, espionage, etc. can be referred to as malware. Kaspersky Labs (2017) define
malware as “a type of computer program designed to infect a legitimate user's
computer and inflict harm on it in multiple ways.”
While the diversity of malware is increasing, anti-virus scanners cannot fulfill the
needs of protection, resulting in millions of hosts being attacked. According to
Kaspersky Labs (2016), 6 563 145 different hosts were attacked, and 4 000 000
unique malware objects were detected in 2015. In turn, Juniper Research (2016)
predicts the cost of data breaches to increase to $2.1 trillion globally by 2019.
In addition to that, there is a decrease in the skill level that is required for malware
development, due to the high availability of attacking tools on the Internet nowadays.
High availability of anti-detection techniques, as well as ability to buy malware on the
black market result in the opportunity to become an attacker for anyone, not
depending on the skill level. Current studies show that more and more attacks are
being issued by script-kiddies or are automated. (Aliyev 2010).
The goal of this project is to develop the proof of concept for the machine learning
based malware classification based on Cuckoo Sandbox. This sandbox will be utilized
for the extraction of the behavior of the malware samples, which will be used as an
input to the machine learning algorithms. The goal is to
6
determine the best feature representation method and how the features should be
extracted, the most accurate algorithm that can distinguish the malware families with
the lowest error rate.
The accuracy will be measured both for the case of detection of wheher the file is
malicious and for the case of classification of the file to the malware family. The
accuracy of the obtained results will also be assessed in relation to current scoring
implemented in Cuckoo Sandbox, and the decision of which method performs better
will be made. The study conducted will allow building an additional detection module
to Cuckoo Sandbox. However, the implementation of this module is beyond the scope
of this project and will not be discussed in this paper.
2 THEORETICAL BACKGROUND
This chapter provides the background that is essential to understand the malware
detection and the need for machine learning methods. The malware types relevant to
the study are described first, followed by the standard malware detection methods.
After that, based on the knowledge gained, the need for machine learning is discussed,
along with the relevant work performed in this field.
To have a better understanding of the methods and logic behind the malware, it is
useful to classify it. Malware can be divided into several classes depending on its
purpose. The classes are as follows:
Virus. This is the simplest form of software. It is simply any piece of software
that is loaded and launched without user’s permission while reproducing itself
or infecting (modifying) other software (Horton and Seberry 1997).
Worm. This malware type is very similar to the virus. The difference is that
worm can spread over the network and replicate to other machines (Smith, et
al. 2009).
7
Trojan. This malware class is used to define the malware types that aim to
appear as legitimate software. Because of this, the general spreading vector
utilized in this class is social engineering, i.e. making people think that they
are downloading the legitimate software (Moffie, et al. 2006).
Spyware. As it implies from the name, the malware that permorms espionage
can be referred to as spyware. Typical actions of spyware include tracking
search history to send personalized advertisements, tracking activities to sell
them to the third parties subsequently (Chien 2005).
Rootkit. Its functionality enables the attacker to access the data with higher
permissions than is allowed. For example, it can be used to give an
unauthorized user administrative access. Rootkits always hide its existence and
quite often are unnoticeable on the system, making the detection and therefore
removal incredibly hard. (Chuvakin 2003).
Keylogger. The idea behind this malware class is to log all the keys pressed
by the user, and, therefore, store all data, including passwords, bank card
numbers and other sensitive information (Lopez, et al. 2013).
Ransomware. This type of malware aims to encrypt all the data on the
machine and ask a victim to transfer some money to get the decryption key.
Usually, a machine infected by ransomware is “frozen” as the user cannot open
any file, and the desktop picture is used to provide information on attacker’s
demands. (Savage, Coogan and Lau 2015).
8
All malware detection techniques can be divided into signature-based and behavior-
based methods. Before going into these methods, it is essential to understand the
basics of two malware analysis approaches: static and dynamic malware analysis. As
it implies from the name, static analysis is performed “statically”, i.e. without
execution of the file. In contrast, dynamic analysis is conducted on the file while it is
being executed for example in the virtual machine.
Static analysis can be viewed as “reading” the source code of the malware and
trying to infer the behavioral properties of the file. Static analysis can include various
techniques (Prasad, Annangi and Pendyala 2016) :
1. File Format Inspection: file metadata can provide useful information. For
example, Windows PE (portable executable) files can provide much
information on compile time, imported and exported functions, etc.
2. String Extraction: this refers to the examination of the software output (e.g.
status or error messages) and inferring information about the malware
operation.
Static analysis often relies on certain tools. Beyond the simple analysis, they can
provide information on protection techniques used by malware. The main advantage
of static analysis is the ability to discover all possible behavioral scenarios.
Researching the code itself allows the researcher to see all ways of malware
execution, that are not limited to the current situation. Moreover, this kind of analysis
is safer than dynamic, since the file is not executed and it cannot result in bad
consequences for the system. On the other hand, static analysis is much more time-
consuming. Because of these reasons it is not usually used in real-world dynamic
environments, such as anti-virus systems, but is often used for research purposes, e.g.
when developing signatures for zero-day malware. (Prasad, Annangi and Pendyala
2016).
Another analysis type is dynamic analysis. Unlike static analysis, here the behavior
of the file is monitored while it is executing and the properties and intentions of the
file are inferred from that information. Usually, the file is run in the virtual
environment, for example in the sandbox. During this kind of analysis, it is possible to
find all behavioral attributes, such as opened files, created mutexes, etc. Moreover, it
is much faster than static analysis. On the other hand, the static analysis only shows
the behavioral scenario relevant to the current system properties. For example, if our
virtual machine has Windows 7 installed, the results might be different from the
malware running under Windows 8.1. (Egele, et al. 2012).
Now, having the background on malware analysis, we can define the detection
methods. The signature-based analysis is a static method that relies on pre-
defined signatures. These can be file fingerprints, e.g. MD5 or SHA1 hashes, static
strings, file metadata. The scenario of detection, in this case, would be as follows:
when a file arrives at the system, it is statically analyzed by the anti- virus software. If
any of the signatures is matched, an alert is triggered, stating that this file is
suspicious. Very often this kind of analysis is enough since well- known malware
samples can often be detected based on hash values.
1
However, attackers started to develop malware in a way that it can change its
signature. This malware feature is referred to as polymorphism. Obviously, such
malware cannot be detected using purely signature-based detection techniques.
Moreover, new malware types cannot be detected using signatures, until the signatures
are created. Therefore, AV vendors had to come up with another way of detection –
behavior-based also referred to as heuristics- based analysis. In this method, the
actual behavior of malware is observed during its execution, looking for the signs of
malicious behavior: modifying host files, registry keys, establishing suspicious
connections. By itself, each of these actions cannot be a reasonable sign of malware,
but their combination can raise the level of suspiciousness of the file. There is some
threshold level of suspiciousness defined, and any malware exceeding this level raises
an alert. (Harley and Lee 2009).
As stated before, malware detectors that are based on signatures can perform well on
previously-known malware, that was already discovered by some anti- virus vendors.
However, it is unable to detect polymorphic malware, that has an ability to change its
signatures, as well as new malware, for which signatures have not been created yet. In
turn, the accuracy of heuristics-based detectors is not always sufficient for adequate
detection, resulting in a lot of false-positives and false-negatives. (Baskaran and
Ralescu 2016).
Need for the new detection methods is dictated by the high spreading rate of
polymorphic viruses. One of the solutions to this problem is reliance on the
1
To take these correlations into account and provide more accurate detection, machine
learning methods can be used.
Although not widely implemented, the concept of machine learning methods for
malware detection is not new. Several types of studies were carried out in this field,
aiming to figure the accuracy of different methods.
In his paper “Malware Detection Using Machine Learning” Dragos Gavrilut aimed for
developing a detection system based on several modified perceptron algorithms. For
different algorithms, he achieved the accuracy of 69.90%- 96.18%. It should be stated
that the algorithms that resulted in best accuracy also produced the highest number of
false-positives: the most accurate one resulted in 48 false positives. The most
”balanced”s algorithm with appropriate accuracy and the low false-positive rate had
the accuracy of 93.01%. (Gavrilut, et al. 2009).
The paper “Malware Detection Module using Machine Learning Algorithms to Assist
in Centralized Security in Enterprise Networks” discusses the detection method based
on modified Random Forest algorithm in combination with Information Gain for
better feature representation. It should be noticed, that the data set consists purely of
portable executable files, for which feature extraction
1
is generally easier. The result achieved is the accuracy of 97% and 0.03 false- positive
rate. (Singhal and Raul 2015).
As it can be seen, all studies ended up with different results. From here, we can
conclude that no unified methodology was created yet neither for detection nor feature
representation. The accuracy of each separate case depends on the specifics of
malware families used and on the actual implementation.
This chapter gives a theoretical background on machine learning methods, needed for
understanding the practical implementation. First, the overview of the machine
learning field is discussed, followed by the description of methods relevant to this
study. These methods include k-Nearest Neighbors, Decision Trees, Random Forests,
Support Vector Machines and Naive Bayes.
The rapid development of data mining techniques and methods resulted in Machine
Learning forming a separate field of Computer Science. It can be viewed as a subclass
of the Artificial Intelligence field, where the main idea is the ability of a system
(computer program, algorithm, etc.) to learn from its own actions. It was firstly
referred to as "field of study that gives computers the ability to learn without being
explicitly programmed" by Arthur Samuel in 1959. A more formal definition is given
by T. Mitchell: "A computer program is said to learn
1
from experience E with respect to some class of tasks T and performance measure P if
its performance at tasks in T, as measured by P, improves with experience E."
(Mitchell 1997).
The basic idea of any machine learning task is to train the model, based on some
algorithm, to perform a certain task: classification, clusterization, regression, etc.
Training is done based on the input dataset, and the model that is built is subsequently
used to make predictions. The output of such model depends on the initial task and the
implementation. Possible applications are: given data about house attributes, such as
room number, size, and price, predict the price of the previously unknown house;
based on two datasets with healthy medical images and the ones with tumor, classify a
pool of new images; cluster pictures of animals to several clusters from an unsorted
pool.
1. Data intake. At first, the dataset is loaded from the file and is saved in
memory.
2. Data transformation. At this point, the data that was loaded at step 1 is
transformed, cleared, and normalized to be suitable for the algorithm. Data is
converted so that it lies in the same range, has the same format, etc. At this
point feature extraction and selection, which are discussed further, are
performed as well. In addition to that, the data is separated into sets – ‘training
set’ and ‘test set’. Data from the training set is used to build the model, which
is later evaluated using the test set.
1
3. Model Training. At this stage, a model is built using the selected algorithm.
4. Model Testing. The model that was built or trained during step 3 is tested
using the test data set, and the produced result is used for building a new
model, that would consider previous models, i.e. “learn” from them.
5. Model Deployment. At this stage, the best model is selected (either after the
defined number of iteration or as soon as the needed result is achieved).
In any of the examples mentioned above, we should be able to extract the attributes
from the input data, so that it can be fed to the algorithm. For example, for the housing
prices case, data could be represented as a multidimensional matrix, where each
column represents an attribute and rows represent the numerical values for these
attributes. In the image case, data can be represented as an RGB value of each pixel.
Such attributes are referred to as features, and the matrix is referred to as feature
vector. The process of extracting data from the files is called feature extraction. The
goal of feature extraction is to obtain a set of informative and non-redundant data. It is
essential to understand that features should represent the important and relevant
information about our dataset since without it we cannot make an accurate prediction.
That is why feature extraction is often a non-obvious task, which requires a lot of
testing and research. Moreover, it is very domain-specific, so general methods apply
here poorly.
In addition to that, if the input data is too big to be fed into the algorithm (has too
many features), then it can be transformed to a reduced feature vector
1
(vector, having a smaller number of features). The process of reducing the vector
dimensions is referred to as feature selection. At the end of this process, we expect the
selected features to outline the relevant information from the initial set so that it can
be used instead of initial data without any accuracy loss.
1. Normalization
An example of normalization can be dividing an image x, where xis are the
number of pixels with color i, by the total number of counts to encode the
distribution and remove the dependence on the size of the image.
𝑥
This translates into the formula: 𝑥′ = (Guyon and Elisseef 2006).
||𝑥 ||
2. Standardization
Sometimes, even while referring to comparable objects, features can have
different scales. For example, consider the housing prices example. Here,
feature ‘room size’ is an integer, probably not exceeding 5 and feature ‘house
size’ is measured in square meters. Although both values can be compared,
added, multiplied, etc., the result would be unreasonable before normalization.
The following scaling is often done:
x'i= (xi−µi)/σi , where µi and σi are the mean and the standard deviation of
feature xi over training examples. (Guyon and Elisseef 2006).
3. Non-linear expansions
Although in most cases we want to reduce the dimensionality of data, in some
cases it might make sense to increase it. This can be useful for complex
problems, where first-order interactions are not sufficient for accurate results.
So far we have discussed the machine learning concepts from the point of view, where
we have initial data, on which the model can be trained. However, this is not always
the case. Here we want to introduce the two machine learning approaches - supervised
and unsupervised learning.
1
1. Regression
Predict the value based on previous observations, i.e. values of the samples
from the training set. Usually, we can say that if the output is a real number/is
continuous, then it is a regression problem.
2. Classification
Based on the set of labeled data, where each label defines a class, that the
sample belongs to, we want to predict the class for the previously unknown
sample. The set of possible outputs is finite and usually small. Generally, we
can say that if the output is a discrete/categorical variable, then it is a
classification problem.
3. Clustering
Find the hidden patterns in the unlabeled data and separate it into clusters
according to similarity. An example can be the discovery of different customer
groups inside the customer base of the online shop.
belongs, it is easier to identify the proper class, and the result would be more accurate
than with clusterization algorithms. In this section, the theoretical background is given
on all the methods used in this project.
K-Nearest Neighbors (KNN) is one of the simplest, though, accurate machine learning
algorithms. KNN is a non-parametric algorithm, meaning that it does not make any
assumptions about the data structure. In real world problems, data rarely obeys the
general theoretical assumptions, making non-parametric algorithms a good solution
for such problems. KNN model representation is as simple as the dataset – there is no
learning required, the entire training set is stored.
KNN can be used for both classification and regression problems. In both problems,
the prediction is based on the k training instances that are closest to the input instance.
In the KNN classification problem, the output would be a class, to which the input
instance belongs, predicted by the majority vote of the k closest neighbors. In the
regression problem, the output would be the property value, which is generally a mean
value of the k nearest neighbors. The schematic example is outlined in Figure 2.
Different distance measurement methods are used for finding the closest neighbors.
The popular ones include Hamming Distance, Manhattan Distance, Minkowski
distance:
1
𝑛 1⁄
The most used method for continuous variables is generally the Euclidean
Distance, which is defined by the formulae below:
Euclidian distance is good for the problems, where the features are of the same type.
For the features of different types, it is advised to use, for example, Manhattan
Distance.
For the classification problems, the output can also be presented as a set of
probabilities of an instance belonging to the class. For example, for binary
𝑁0
problems, the probabilities can be calculated like 𝑃(0) = , where P(0) is
𝑁0+𝑁1
the probability of the 0 class membership and 𝑁0, 𝑁1 are numbers of neighbors
belonging to the classes 0 and 1 respectively. (Thirumuruganathan 2010).
The value of k plays a crucial role in the prediction accuracy of the algorithm.
However, selecting the k value is a non-trivial task. Smaller values of k will most
likely result in lower accuracy, especially in the datasets with much noise, since every
instance of the training set now has a higher weight during the decision process.
Larger values of k lower the performance of the algorithm. In addition to that, if the
value is too high, the model can overfit, making the class boundaries less distinct and
resulting in lower accuracy again. As a general approach, it is advised to select k using
the formula below:
𝑘 = √𝑛 [5]
1
The drawback of the KNN algorithm is the bad performance on the unevenly
distributed datasets. Thus, if one class vastly dominates the other ones, it is more
likely to have more neighbors of that class due to their large number, and, therefore,
make incorrect predictions. (Laaksonen and Oja 1996).
Intuitively, we understand that the further from the hyperplane our classes lie, the
more accurate predictions we can make. That is why, although multiple hyperplanes
can be found per problem, the goal of the SVM algorithm is to find such a hyperplane
that would result in the maximum margins.
On Figure 3, there is a dataset of two classes. Therefore, the problem lies in a two-
dimensional space, and a hyperplane is represented as a line. In general, hyperplane
can take as many dimensions as we want.
1. We define X and Y as the input and output sets respectively. (𝑥1, 𝑦1),
…,(𝑥𝑚, 𝑦𝑚) is the training set.
2. Given x, we want to be able to predict y. We can refer to this problem as to
learning the classifier y=f(x, a), where a is the parameter of the classification
function.
3. F(x, a) can be learned by minimizing the training error of the function that
learns on training data. Here, L is the loss function, and 𝑅𝑒𝑚𝑝 is referred to as
empirical risk.
𝑚
1
[6]
𝑅𝑒𝑚𝑝(𝑎) = ∑ 𝑙(𝑓(𝑥𝑖, 𝑎), 𝑦𝑖) = 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝐸𝑟𝑟𝑜𝑟
𝑚 𝑖=1
4. We are aiming at minimizing the overall risk, too. Here, P(x,y) is the joint
distribution function of x and y.
SVMs are generally able to result in good accuracy, especially on ”clean” datasets.
Moreover, it is good with working with the high-dimensional datasets, also when the
number of dimensions is higher than the number of the samples. However, for large
datasets with a lot of noise or overlapping classes, it can be more effective. Also, with
larger datasets training time can be high. (Jing and Zhang 2010).
2
Naive Bayes is the classification machine learning algorithm that relies on the Bayes
Theorem. It can be used for both binary and multi-class classification problems. The
main point relies on the idea of treating each feature independently. Naive Bayes
method evaluates the probability of each feature independently, regardless of any
correlations, and makes the prediction based on the Bayes Theorem. That is why this
method is called ”naive” – in real-world problems features often have some level of
correlation between each other.
To understand the algorithm of Naive Bayes, the concepts of class probabilities and
conditional probabilities should be introduced first.
𝑐𝑜𝑢𝑛𝑡(𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝐶)
𝑃(𝐶) = [9]
𝑐𝑜𝑢𝑛𝑡(𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝑁𝑡𝑜𝑡𝑎𝑙)
𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃(𝐴|𝐵) = [11]
𝑃(𝐵)
2
4. Probabilities of the item belonging to all classes are compared and the class
with the highest probability if selected as a result.
The advantages of using this method include its simplicity and easiness of
understanding. In addition to that, it performs well on the data sets with irrelevant
features, since the probabilities of them contributing to the output are low. Therefore
they are not taken into account when making predictions. Moreover, this algorithm
usually results in a good performance in terms of consumed resources, since it only
needs to calculate the probabilities of the features and classes, there is no need to find
any coefficients like in other algorithms. As already mentioned, its main drawback is
that each feature is treated independently, although in most cases this cannot be true.
(Bishop 2006).
As it implies from the name, decision trees are data structures that have a structure of
the tree. The training dataset is used for the creation of the tree, that is subsequently
used for making predictions on the test data. In this algorithm, the goal is to achieve
the most accurate result with the least number of the decisions that must be made.
Decision trees can be used for both classification and regression problems. An
example can be seen in Table 1:
As it can be seen in Figure 4, the model was trained based on the dataset and can now
classify the tennis playing decision to “yes” or “no”. Here, the tree consists of the
decision nodes and leaf nodes. Decision nodes have several branches leading to leaf
nodes. Leaf nodes represent the decisions or classifications. The topmost initial node
is referred to as root node.
The common algorithm for decision trees is ID3 (Iterative Dichotomiser 3). It
relies on the concepts of the Entropy and Information Gain. Entropy here refers to
the level of uncertainty in the data content. For example, the entropy of the coin toss
would be indefinite, since there is no way to be sure in the result. Contrarily, a coin
toss of the coin with two heads on both sides would result in zero entropy, since we
can predict the outcome with 100% probability before each toss. (Mitchell 1997).
In simple words, the ID3 algorithm can be described as follows: starting from the root
node, at each stage we want to partition the data into homogenous (similar in their
structure) dataset. More specifically, we want to find the attribute that would result in
the highest information gain, i.e. return the most homogenous branches (Swain and
Hauska 1977):
2. Split the dataset and calculate the entropy of each branch. Then calculate the
information gain of the split, that is the differences in the initial entropy and
the proportional sum of the entropies of the branches.
3. The attribute with the highest Gain value is selected as the decision node.
4. If one of the branches of the selected decision node has an entropy of 0, it
becomes the leaf node. Other branches require further splitting.
5. The algorithm is run recursively until there is nothing to split anymore.
J48 is the implementation of the ID3 algorithm, that is included in one of the R
packages, and this is the implementation we are going to use in our study.
Decision tree method achieved its popularity because of its simplicity. It can deal well
with large datasets and can handle the noise in the datasets very well. Another
advantage is that unlike other algorithms, such as SVM or KNN, decision trees
operate in a “white box”, meaning that we can clearly see how the outcome is
obtained and which decisions led to it. These facts made it a popular solution for
medical diagnosis, spam filtering, security screening and other fields. (Mitchell 1997).
Random Forest is one of the most popular machine learning algorithms. It requires
almost no data preparation and modeling but usually results in accurate results.
Random Forests are based on the decision trees described in the previous section.
More specifically, Random Forests are the collections of decision trees, producing a
better prediction accuracy. That is why it is called a ’forest’ – it is basically a set of
decision trees.
The basic idea is to grow multiple decision trees based on the independent subsets of
the dataset. At each node, n variables out of the feature set are selected randomly, and
the best split on these variables is found.
In simple words, the algorithm can be described as follows (Biau 2013):
2
1. Multiple trees are built roughly on the two third of the training data (62.3%).
Data is chosen randomly.
2. Several predictor variables are randomly selected out of all the predictor
variables. Then, the best split on these selected variables is used to split the
node. By default, the amount of the selected variables is the square root of the
total number of all predictors for classification, and it is constant for all trees.
3. Using the rest of the data, the misclassification rate is calculated. The total
error rate is calculated as the overall out-of-bag error rate.
4. Each trained tree gives its own classification result, giving its own ”vote”. The
class that received the most ”votes” is chosen as the result.
The scheme of the algorithm is seen in Figure 5.
As in the decision trees, this algorithm removes the need for feature selection for
removing irrelevant features – they will not be taken into account in any case. The
only need for any feature selection with the random forest algorithms arises
2
when there is a need for dimensionality reduction. Moreover, the out-of-bag error rate,
which was mentioned earlier can be considered the algorithm’s own cross-validation
method. This removes the need for tedious cross-validation measures, that would have
to be taken otherwise. (Mitchell 1997).
Random forests inherit many of the advantages of the decision trees algorithms. They
are applicable to both regression and classification problems; they are easy to compute
and quick to fit. They also usually result in the better accuracy. However, unlike
decision trees, it is not very easy to interpret the results. In decision trees, by
examining the resulting tree, we can gain valuable information about which variables
are important and how they affect the result. This is not possible with random forests.
It can also be described as a more stable algorithm than the decision trees – if we
modify the data a little bit, decision trees will change, most likely reducing the
accuracy. This will not happen in the random forest algorithms – since it is the
combination of many decision trees, the random forest will remain stable. (Louppe
2014).
3.3 Cross-validation
The drawback of the accuracy evaluation methods that are present in the machine
learning methods themselves is that they cannot predict how the model will perform
on the new data. The approach to overcoming this drawback relies on the cross-
validation. The idea is to split the initial dataset. The model is trained on the biggest
part of the dataset and then subsequently tested on the smaller part. There are three
different classes of cross-validation:
1. Holdout method – here, the dataset is separated into two parts: a training set
and test set. The model is fit on the training set. The model is then tested on the
test set, which it has not seen before. The resulting errors are used to compute
the mean absolute test error, that is used for model evaluation. The advantage
of this method is its high speed. On the other hand, the evaluation result
depends highly on how the test set was selected since the variance is usually
high. Therefore, the evaluation result can differ significantly between different
test sets.
2
2. The k-fold method can be seen as the improvement over the holdout
method. Here, the k subsets are selected, and the holdout method is repeated k
times, where each time one of the k subsets is used as a training set, and the k-
1 subset is used as the test set. The average error is then computed over all k
runs of holdout method. With the increase of k, the variance is reduced,
ensuring that the accuracy will not change with different datasets. The
disadvantage is the complexity and the running time, which is higher as
compared to the holdout method.
3. The leave-one-out method is the extreme case of the k-fold method, where
the k is as big as the sample universe. On each run of the holdout method, data
is trained on all the data points except from one, and that one point is
subsequently used for testing. The variance, in this case, is as small as
possible. The computing complexity, on the other hand, is high. (Schneider
1997).
This chapter provided background on the machine learning that is essential for
understanding the practical implementation of the project, that is described in the next
chapter. The concepts of feature set, feature extraction, and selection methods were
discussed along with the machine learning algorithms that will be used in practical
part. The chosen algorithms are K-Nearest Neighbours, Support Vector Machines,
Decision Trees, Random Forests and Naive Bayes.
4 PRACTICAL PART
As a reminder, the goal of the project lies in the determination of the most suitable
feature representation and extraction methods, the most accurate algorithm that can
distinguish the malware families with the lowest error rate and how this accuracy
relates to the current scoring system accuracy. This chapter discusses the practical
aspects of the project implementation. This includes data gathering, description of
malware families that represent the dataset, selection of the features that will be used
for the algorithm and finding the optimal feature representation method, evaluation
method, and the implementation process.
2
4.1 Data
For this project, a total of 2 140 files were collected. For most of them, hashes, which
uniquely identify files were found in incidence reports or malware reverse engineering
reports, and these hashes were subsequently used to get the corresponding samples
from the VirusTotal service with the help of external malware researchers. (VirusTotal
2017) To be able to operate with a diverse dataset, nine malware families were used,
resulting in 1 156 malicious files and 984 benign files. Malicious families that were
used are Dridex, Locky, TeslaCrypt, Vawtrak, Zeus, DarkComet, CyberGate, Xtreme,
CTB-Locker. They are discussed in detail further in this chapter. Benign files were
mainly software installers of the .exe format, but also included several files of
.pdf,
.docx, etc. formats, as they are often used as malware spreading vectors. To achieve
the most meaningful and up-to-date results only malware that has appeared in the last
two years is used.
4.1.1 Dridex
The first malware family with a total of 172 unique files is Dridex. This malware
belongs to the Trojan class, specifically, banking trojan. It caused a huge infection in
2015, resulting in 3 000 - 5 000 infections per month.
Dridex is derived from Cridex, malware that spread in 2012. Cridex was also a
banking credentials stealer, but more specifically, it was a worm, that utilized attached
storage devices as a spreading vector. In 2014 a renewed version appeared, switching
from command and control communications to peer-to- peer and therefore becoming
more resilient to takedown operations.
The Dridex attack was targeted to users of specific banks, aiming to steal their
credentials during banking sessions. It is said to be target over 300 institutions and 40
regions, mostly focusing on English-speaking countries with high income rates: most
infections happened in the United States, the United Kingdom, and Australia.
(O’Brien 2016).
Most of the Dridex malware files were distributed during a massive-scale spam
campaign, by using real company names as the sender addresses, but fake top level
domains, matching the location of the targeted users. Most emails were either
invoices or orders. Attackers behind Dridex showed a high level of
2
attention to details: emails with real company names also utilized real employee
names and were sent during business hours.
Dridex has a modular architecture, allowing for the attackers to easily add additional
functionality. According to Symantec, there are the following modules (O’Brien
2016):
1. Loader module’s only purpose is to install the main module. The loader will
find one of the servers inside of its configuration and request a binary and
configuration data using HTTPS request.
3
4.1.2 Locky
The second malware family, represented by 115 unique files, is Locky. This is
ransomware that encrypts all data on the victim’s system using the RSA-2048 and
AES-256 ciphers and adds a .locky extension to it. Locky emerged in February 2016
and has been distributed aggressively since then. The most common distribution
vectors are spam campaigns, specifically, fake invoices and phishing websites. These
spam campaigns were extremely similar to the ones used to distribute Dridex in its
size, utilization of financial documents and macros, which gives a sign of the Dridex
group being responsible for this malware. The price for decryption of system files
varied from 0.5 to 1 bitcoin. (Symantec Security Response 2016). The operation
scheme of Locky can be seen in Figure 7.
3
Upon delivery to the system, the macros embedded into a .docx or .xls file runs and
downloads the Locky malware. Malware file, in turn, injects itself into the
%temp% folder with a random name and .exe or .dll file format. A “Run” registry key
with value “Locky” will subsequently be added to the registry, pointing out the .exe
file in the %temp% folder. The initial file will be deleted at this point. A new process
will be started after that, exploring the volume properties and deleting shadow copies
present on the volume. Recovery instructions and the public key are retrieved with a
POST request from a command and control server. After that, all files on the
system are encrypted, and the desktop
3
background is changed to the image with the decryption instructions. (McAfee Labs
2016). An additional registry key is created, allowing the malware to run every time
the system is started. Figure 8 shows the decryption instructions for Locky.
Figure 8. Recovery instructions of the Locky malware (Symantec Security Response 2016)
4.1.3 Teslacrypt
Teslacrypt is the third malware family, consisting of 115 files and belonging to the
ransomware class. Main distribution vector is compromised websites and emails with
links leading to malicious websites that download the malware once they are visited.
Upon download, the file is executed immediately. The operation scheme of Teslacrypt
can be found in Figure 9.
3
Payment for a decryption key is requested to be made via PayPal or Bitcoin (1 000
USD or 1.5 bitcoin). Unlike other ransomware families, TeslaCrypt encrypted not
only obvious data files, such as .pdf, .doc, .jpg etc., but also game-related files,
including Call of Duty, World of Tanks, Minecraft and World of Warcraft.
Interestingly, in May 2016, the attackers behind TeslaCrypt announced that they
closed the project and released the master decryption key. Several days later, ESET
antivirus released a free decryption tool. More details can be found in Figure 11.
Figure 11. Payment page of TeslaCrypt with the master decryption key (Mimoso 2016)
4.1.4 Vawtrak
Fourth malware family that consists of 74 unique files is Vawtrak. Also referred to as
Neverquest or Snifula, Vawtrak is another example of banking Trojan. The most
infections happened in Czech Republic, USA, UK, and Germany. Spreading vectors
include malware downloaders, spam with malicious links or other drive-by
downloads. After downloading, Vawtrak is capable of gaining access to banking
accounts of a victim, as well as stealing credentials, passwords, private keys, etc.
The operation process of this malware family is outlined in Figure 12. The execution
of the initial file, downloaded to the drive, results in the installation of a dropper file
into %ProgramData% folder with a randomly created extension and filename. The
initial file is deleted after that. (Křoustek 2015). This dropper file is a DLL that is
responsible for unpacking the Vawtrak module and injecting it to the running
processes. To do that, the DLL firstly decrypts the payload with
3
the hardcoded key and decompresses itself, resulting in a new DLL, which replaces
the initial one. This DLL, in turn, extracts the final module, which turns out to be a
compressed version of two DLLs: 64 and 32-bit modifications. These DLLs are
injected into the system processes and are responsible for the Vawtrak’s functionality.
4.1.5 Zeus
Zeus is the fifth malware family and is represented by 116 unique files. It is a botnet
package, which can be easily traded on the black market for around 700 USD. After
its appearance in 2007, Zeus has evolved and remains one of the most common botnet
malware representatives.
The summary of Zeus operation can be found in Figure 13. Infection vectors of Zeus
vary dramatically, starting from spam emails, and ending with drive-by downloads.
After the download, the malware injects itself into the sdra64.exe process and
modifies the registry values so that it is executed upon system startup. After that, Zeus
injects itself into the winlogon.exe process and terminates the initial executable.
Winlogon injected code injects additional data into the svchost.exe process and creates
two files: local.ds contains the up-to- date configurations, and user.ds contains data to
be transmitted to the command and control server. (Falliere and Chien 2009).
3
The popularity of Zeus malware is related to the fact that it is relatively cheap and
easy to use. Moreover, it comes as a ready-to-deploy package and as a result can be
used by novices and script kiddies.
4.1.6 DarkComet
During Syrian conflict in 2014, it was used by the Syrian government for espionage on
Syrian citizens that were bypassing government’s censorship on the Internet. In 2015,
the ”Je Suis Charlie” slogan was used to trick people into downloading the
DarkComet: it was disguised as a picture, which compromised the users once
downloaded.
As most of the RATs, the DarkComet includes two components: the client and the
server. However, they have a reverse meaning from the perspective of the attacker,
where the ’server’ is the machine with malware, and the ’client’ is the attacker. The
DarkComet relies on the remote-connection architecture: once it executes, the server
connects to the client, which has a GUI, allowing it to control the server. (Kujawa
2012) The functionality of DarkComet is broad, including, but not limited to (Kujawa
2012):
Webcam and sound capture
Keylogging
Power off/Shutdown/Restart
Remote Desktop functionality
Active ports discovery
LAN computers discovery
URL download
3
The communication between the server and the client is outlined in Figure 14.
4.1.7 CyberGate
The operation of the CyberGate is guided by the attacker, and the communication
happens with a client-server model. Again, here the attacker is referred to as a client
and the infected machine is a server. The communication happens in a way similar to
the one outlined in Figure 14.
In addition to that, there are plenty of the tutorials that can be found on the Internet,
allowing people with a limited set of skills to take advantage of this RAT for
malicious purposes. (Aziz 2014).
4.1.8 Xtreme
Another example of RAT is Xtreme. Developed in Delphi, it is available for free and
shares the source code with several other Delphi RAT malware, including CyberGate.
Xtreme was used in several governmental attacks, as well as several attacks targeting
Israel and Palestina. The architecture of Xtreme relies on the client- server
architecture, where the attacker is considered to be a client. The configurations are
written to the %APPDATA%\Microsoft\Windows folder or the folder named after the
mutex created. The data is subsequently encrypted using RC4 and ”CONFIG” or
”CYBERGATEPASS” as a password. The configurations are stored in the file of
”.ngo” or ”.cfg” extensions. The configuration data includes the name of installed file,
an injection process, FTP and CnC information, mutex name. (Villeneuve and Bennett
2014). The communication between the infected machine and the attacker happens in
a way similar to the one of the DarkComet, which is outlined in Figure 14.
4
The functionality of Xtreme allows the attacker to (Villeneuve and Bennett 2014):
Read and modify the registry
Interact via the remote shell
Desktop capturing
Capture data from connected devices, such as a microphone, webcam, etc.
Manipulate running processes
Upload and download files
4.1.9 CTB-Locker
The last malware family used was CTB-Locker, and it was represented by 79 unique
files. This is another example of ransomware which encrypts user’s files asking for
money for the decryption key. CTB is an acronym for Curve Tor Bitcoin, referring to
Elliptic Curve algorithm that was used for encryption.
The propagation of the CTB-Locker samples was happening through the e- mails with
malicious attachments. Attachments represented .zip files with the downloader inside.
The initial operation of CTB-Locker is outlined in Figure 15. Upon execution malware
drops itself to the %temp% folder with a random name and injects itself into the
svchost.exe process. Moreover, a mutex of random name is created, ensuring that
there is only one instance of CTB-Locker running on the machine.
The study is based on and targeted to Cuckoo Sandbox. It is clear that to apply the
machine learning algorithms to any problem, it is essential to represent the data in
some form. For this purpose, Cuckoo Sandbox was used. The reports generated by the
sandbox, describing the behavioral data of each sample, were preprocessed, and
malware features were extracted from there. However, it is
4
important to understand the functionality of the sandbox and the structure of the
reports first.
Cuckoo Sandbox is the open-source malware analysis tool that allows getting the
detailed behavioral report of any file or URL in a matter of seconds. According to
Cuckoo Foundation (2015), currently, supported file formats include:
The main components of Cuckoo’s infrastructure are a host machine (the management
software) and a number of guest machines (virtual or physical machines for analysis).
Its operation scenario is quite straightforward: as soon as the new file is submitted to
the server, a virtual environment is dynamically allocated for it, the file is executed,
and all the actions performed in the system are recorded.
4
As shown in Figure 17, the sandbox generates the report which outlines all the
behavior of the file in the system. The report is represented as a JSON file, and
currently, it is capable of detecting the following features (Cuckoo Foundation 2015):
After getting the behavior of the file, Cuckoo Sandbox makes a decision on the level
of maliciousness of the file using some pre-defined signatures. This functional part of
the sandbox is only interesting to us as the way to compare the performance of the
machine learning methods to the currently implemented signature-based methods.
4
The Cuckoo analysis score is an indication of how malicious an analyzed file is. The
score is determined by measuring how many malicious actions are performed. Cuckoo
uses a set of summarized malicious actions, called signatures, to identify the malicious
behavior. Each of these signatures has its score, which indicates the severity of the
performed action.
In total, there are three levels of severity and all levels have their score of severity: 1
for low, 2 for medium and 3 for high. An example of a low severity signature is the
action of performing a query on a computer name. An example of a medium severity
signature is the creation of an executable file. An example of a high severity signature
is the removal of a shadow copy.
During analysis, all actions are stored to be processed afterwards. In the end, multiple
modules, including the signatures module, are used to examine the stored actions. The
signatures module examines all the collected data and finds patterns that match a
signature. If the signature matches, a counter is incremented by the score of the
severity of the signature (1, 2, or 3). When all signatures have been processed, the
value of the counter is divided by 5.0 to create a floating point score. This score is the
Cuckoo analysis score. An example of the signatures of different severity can be
found in Figure 18.
The average scores of the malware families used in this project are outlined in Table
2. The color indicates the maliciousness level corresponding to the score.
It is hard to measure the accuracy of the detection since there is no threshold value
indicating whether the sample is malicious or not. Moreover, determining the specific
class to which malware belongs is beyond the functionality of the sandbox. In the
graphical user interface, there are indicators of green, yellow and red colors, outlined
in Figures 18 - 19, indicating how reliable the file is. The green indicator is used for
samples with a score of 4 and lower, yellow for samples with score 4-7, red for scores
7-10. However, this feature is only an interface part and is not very reliable, as it is
still in the alpha state. Moreover, it has some bugs, as outlined in Figure 20.
To apply machine learning algorithms to the problem, we need to figure out what kind
of data should be extracted and how it should be presented.
Some works in the field are utilizing string properties or file formats properties as a
basis for feature representation. For example, for Windows-based malware samples,
the data contained in PE headers is often used as a base for analysis. However,
implementing format-specific feature extraction is not the best solution, since formats
of analyzed files can vary dramatically. (Hung 2011).
Other works rely on the so-called n-grams. Byte n-grams are overlapping substrings,
collected in a sliding-window fashion where the windows of fixed size slides one byte
at a time. Word n-grams and character n-grams are widely used in natural language
processing, information retrieval, and text mining. (Reddy and Pujari 2006).
Files
Registry keys
Mutexes
Processes
IP addresses and DNS queries
API calls
This section discusses which of the above-mentioned features should be used in our
work.
4
Files
The reports contain information about opened files, written files, and created
files. This kind of information is good in predicting the malware family since
any malware files trigger many modifications to the file systems. It can be
used for the quite accurate malware classification in most cases. However, for
example in the cases of ransomware, relying solely on the file modifications
might result in the algorithm not being able to distinguish different families.
This is because ransomware encrypts every file on the system. Therefore the
feature set consists mostly of the encrypted files. The differences between
ransomware families would be defined by the files with malware settings, the
amount of which is vastly lower than the whole feature set and, therefore, it
would be very hard to make predictions based on this data.
Registry keys
On Windows systems, the registry stores the low-level system settings of the
operation system and its applications. Any sample that is run on the system
triggers a high amount of the registry changes – the Cuckoo reports can outline
the registry keys opened, read, written, deleted. The information on the registry
modifications can be a good source of information on the system changes
caused by malware and can be used for malware detection.
Mutexes
The mutex stands for the Mutual Exclusion. This is a program object that
allows multiple threads to share the same resource. Every time a program is
started, a mutex with a unique name is created. Mutex names can be good
identifiers of specific malware samples. However, for the families, they cannot
result in the accurate result on a large scale, since the number of mutexes
created per sample is dramatically lower than the dataset. That is why the
small change related to the bug or non-started process would result in the
dramatic change of the prediction results.
Processes
Common identifier of the specific malware sample is the name of the created
process. However, very rarely it can be used for identification of
4
the malware family since in the common cases the process names are the same
as the hash of the sample. As an alternative, the malware sample can inject
itself into the system process. That is why this feature is bad for the family
identification.
API calls
API stands for Application Programming Interface and refers to the set of tools
that provide an interface for communication between different software
components. API calls are recorded during the execution of the malware and
refer to the specific process. They outline everything happening to the
operating system, including the operations on the files, registry, mutexes,
processes and other features mentioned earlier. For example, API calls
OpenFile, OpenFileEx, CreateFile, CopyFileEx, etc. define the file operations,
calls OpenMutex, CreateMutex and CreateMutexEx describe mutexes opened
and created, etc. API call traces present the wide description of the sample
behavior, including all the properties mentioned above. In addition to that, they
include a wide set of distinct values. Moreover, they are simple to describe in
numeric format, and that is why they were chosen as features. Here, the feature
set will be defined by the number of unique API calls and the return codes.
The next section describes the representation way in more detail.
Having familiarized ourselves with the features presented in the Cuckoo Sandbox
reports, we can now think about the way to represent the features to be used for the
machine learning algorithms. Since the feature set, containing
4
the failed and successful APIs as well as the return codes, is quite large, we have to
find a way to present it in a clear, compact and non-redundant way. The representation
chosen for this task is the Frequency (Binary) matrix, discussed in detail in the
following section.
The binary representation is the most simple and straightforward way to represent the
features of the failed and successful API calls. Here, a matrix is created, where the
rows represent the samples, and the columns represent the API calls. A value of 0
represents the ‘failed’ state of the API call, and the value of 1 represents the successful
API call.
Although this approach is simple and straightforward, it does not take into account the
return codes generated, as well as a number of times the certain API call was
triggered, resulting in lower accuracy. (Pirscoveanu 2015).
Here, the horizontal axis represents the samples and the vertical axis represents the API
call, where each number represents a number of times the
5
API call was triggered. This approach clearly provides more details than the binary
representation, resulting in better accuracy. (Pirscoveanu 2015).
To utilize the maximum amount of useful data presented in the API calls information,
the best approach is to combine the features of the previous representation methods.
The resulting matrix would outline the frequency of failed APIs, successful APIs, and
the return codes.
Here the rows represent the samples, the columns 𝑃𝑎𝑠𝑠1…𝑃𝑎𝑠𝑠𝑛 represent a number
of times each API call in [𝑃𝑎𝑠𝑠1; 𝑃𝑎𝑠𝑠𝑛] was called, where n is a total number of API
calls triggered. Similarly, columns 𝐹𝑎𝑖𝑙1…𝐹𝑎𝑖𝑙𝑛 represent a number of times each API
call failed. Columns 𝑅𝑒𝑡𝐶1 … 𝑅𝑒𝑡𝐶𝑛 represent a number of times each return code
was returned. (Pirscoveanu 2015).
This approach results in a fair performance, and that is why it is chosen for our
problem. Obviously, the usage of the combination method resulted in the dramatic
increase in the number of features, since they are now represented by the combination
of passed APIs, failed APIs and return codes, instead of relying solely on the APIs
triggered. Since the feature set became more than two times bigger, some feature
selection should be performed.
The goal of the feature selection is to remove the non-important features from the
feature set as it gets too big. Bigger feature sets are harder to operate with, but some
features in this set might not put any weight on the decision of the algorithm and,
therefore, can be removed. For example, in our case, some API call might only be
triggered in one sample once. In a case of a wide and variate feature set, this unique
API call will not play any role in the algorithm and, therefore, removing it will not
affect an accuracy in any way.
5
After extracting the features and representing them as a combination matrix, we ended
up with 70518 features. This amount is too large for processing and accurate
predictions. For example, with such a large feature set, it takes approximately two-
three hours to load the dataset, preprocess it and run the k- nearest neighbors
algorithm on an x64 8GB RAM machine. This amount of resources is unacceptable,
and there is a need for removing irrelevant features.
Three general classes of feature selection methods are filtering methods, wrapper
methods, and embedded methods (Guyon and Elisseef 2006).
Filter methods
Filter methods statistically score the features. The features with higher scores
are kept in the dataset, while the features with the low scores are removed.
Wrapper methods
Here, the different feature combinations are tried with a prediction model and
the combination that leads to the highest accuracy are chosen.
Embedded methods
These methods evaluate the features used while the model is being created.
4.5 Implementation
During this step the research plan is designed and can be implemented in
practice.
To get the malware behavioral reports and to ensure that malware runs correctly,
including all of its functionality, it is important to configure Cuckoo Sandbox. In the
real world different malware samples exploit different vulnerabilities that might be
part of certain software products. Therefore, it is important to include a broad range of
services in the virtual machines created by the sandbox.
The hypervisor used for the virtual machines for Cuckoo is Virtualbox. The virtual
machines will be created by using VMcloak, an automated virtual machine generation
and cloaking tool for Cuckoo Sandbox. (Bremer 2015).
As discussed in the previous section, the chosen feature representation method is the
combining matrix that includes successful APIs, failed APIs and their return codes.
This data is extracted from the reports generated by the sandbox.
of minimum API calls can be specified in the algorithm, e.g. all reports which
triggered less than five API calls can be skipped. The file includes the timestamp of
the extraction, and the logs, outlining the successful and unsuccessful operations are
stored in a separate file.
As described in the previous chapter, feature selection is used for removing redundant
and irrelevant features to improve the accuracy of the prediction. In our case, the
feature set is extremely large, and the need for feature selection is, therefore, high.
The R language will be used for performing the feature selection and applying the
machine learning methods. R is a free software environment for statistical computing
and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows,
and MacOS. (Venables and Smith 2016).
A good and simple algorithm for feature selection in classification problems is the
Boruta package. Roughly speaking, it is a wrapper method that works around the
Random Forest algorithm. Its algorithm can be described as follows (Kursa and
Rudnicki 2010):
1. Create shuffled copies of all features (to add more randomness). These are
referred to as shadow copies.
2. Train a Random Forest classifier on the new dataset and apply a feature
importance measure in the form of the Mean Decrease Accuracy algorithm.
The importance of each feature is measured at this stage, and the weights are
assigned.
3. On each iteration check if the feature from the initial feature set has a higher
weight than the highest weight of this feature’s shadow copy. Remove the
features that are ranked as unimportant at each iteration.
Unlike other feature selection methods, Boruta allows identifying all features that are
somehow relevant to the result. Other methods, in turn, rely on a small feature subset
that results in the minimal error. (Kursa and Rudnicki 2010).
The problem arises when we start implementing the feature selection. Having 70 518
features, the Boruta package exhausts, as it is not able to allocate enough memory and
is not able to run. Therefore, we need to divide the dataset
5
randomly into the subsets that can fit into the memory and run feature selection on all
of them. Then, we collect all the features that were ranked as relevant and merge the
subsets, leaving out all the non-important features. The next step is to run the feature
selection again on the whole dataset. After running the feature selection algorithm, we
ended up with 306 features. The performance of this change was evaluated based on
the KNN accuracy with the given feature set. KNN was chosen for this problem, as it
is the only algorithm that can process the whole feature set – it does not store any
other information other than the dataset and does not build models, unlike other
algorithms. After removing irrelevant features, the accuracy of detection based on
KNN improved by approximately 1% and the prediction took approximately three
seconds.
After the features were extracted and selected, we can apply the machine learning
methods to the data that we obtained. The machine learning methods to be applied, as
discussed previously, are K-Nearest Neighbours, Support Vector Machines, J48
Decision Tree, Naive Bayes, Random Forest. The general process is outlined in Figure
22.
This chapter discusses the results of the assessment of the implemented machine
learning methods. The accuracy of detection is measured as the percentage of
correctly identified instances:
The result of the K-Nearest Method can be inferred from the cross table in Figure 23.
The results outlined there should be understood as follows: rows represent the actual
classes of the tested samples, while columns represent the predicted values. Therefore,
the cell of the 1st row and 1st column will show the number of correct instances for
the 1st class. The cell of the 1st row and 2nd column will show the number of 1st class
instances, that were marked as 2nd class, etc.
5
As it can be seen, the test set consists of 371 samples, and 1 sample had an error
resulting in a “0” class. The classification accuracy can be seen in Table 3.
The total accuracy of the K-Nearest Neighbors depends on the k value. In our case,
different values were tested. They produced the following accuracy:
k=1: 87%
k=2: 84.63%
k=3: 81.3%
k=4: 80%
k=5: 80%
k=6: 80%
k=10: 77.8%
As it can be seen, the best accuracy was achieved with k=1, and the accuracy was
87%. This is an unusual case – when the best accuracy is achieved with k=1 it can be a
sign of one of the following:
In our case, the train and test set were selected randomly from the dataset with 2/3
ratio. This means that the data cannot be the same. The most probable reason for this
is that the classes are distributed in a way that the boundaries are very clear when the
KNN algorithm was applied.
Two-class classification into malware and benign files was also performed. The
resulting cross-table can be seen in Figure 24. In the table, class 1 represents
5
the benign files, while class 2 represents malicious files. Again, predictions were made
with different k values:
k=1: 94.6%
k=2: 94.3%
k=3: 93.5%
k=5: 93.5%
k=7: 92.7%
The best accuracy was achieved with k=1 - 94.6%. The detailed accuracy can be
found in the Tables 4.1 and 4.2.
Overall, the KNN algorithm resulted in a good accuracy of 87% for multi-class
classification and 94.6% for two-class classification. We can conclude that the
algorithm provided good results. Classes are distributed evenly in the case of multi-
class classification, which also affected the good accuracy of the predictions. Even
though the distribution is not even in the case of two-class classification (310 vs. 61),
the results are still accurate.
The next algorithm that was tested was Support Vector Machines. The result of the
predictions can be outlined in Figure 25. The overall accuracy achieved was 87.6% for
multi-class classification and 94.6% for binary classification.
6
The detailed information about the accuracy of each class can be found in Table
5.
Class Family Correctly Incorrectly Accuracy Average Cuckoo
classified classified score
1 Benign 56 5 91.8% 1.04
2 Dridex 32 5 86.5% 5.26
3 Locky 21 6 77.8% 6.41
4 TeslaCrypt 37 7 84% 6.27
5 Vawtrak 10 8 55.6% 2.66
6 Zeus 31 9 77.5% 6.46
7 DarkComet 48 1 98% 5.15
8 CyberGate 37 1 97.4% 6.57
9 Xtreme 31 3 91.2% 5.15
10 CTB-Locker 22 0 100% 4.76
Figure 26 outlines the cross-table for binary classification. The detailed information
about binary classification can be found as well in Tables 6.1 and
6.2. As we can see, the number of correctly identified benign instances (true
negatives) was equal to 41, correctly identified malicious instances (true positives) –
310, incorrectly identified benign instances (false positives) – 20, incorrectly
identified malicious instances (false negatives) – 0.
Overall, the resulted accuracies of 87.6% for multi-class classification and 94.6% for
binary classification are almost equal to the results of the K-Nearest Neighbors. In
turn, this algorithm resulted in 0 false negatives in binary classification – this means
that no malware samples were identified as benign. Therefore, it can prevent malware
infections more effectively than K-Nearest Neighbors.
The third tested algorithm was the J48 Decision Tree. The advantage of the Decision
Tree method is that it operates in the ”white box” approach and we can see which
decisions resulted in our prediction. The decision trees for multi- class classification
and binary classification can be found in Figures 27 and 28 respectively.
6
The overall accuracy was 93.3% for multiclass classification and 94.6% for binary
classification. The cross-table outlining the results of multiclass classification can be
found in Figure 29.
For the binary classification problem, the algorithm resulted in 46 correctly identified
instances for benign samples (true negatives), 305 correctly identified malware
samples (true positives), 15 incorrectly identified benign samples (false positives) and
5 incorrectly classified benign samples (false negatives). The details are introduced in
Figure 30 and Tables 8.1 and 8.2.
The overall accuracy of J48 Decision Tree was good: 93.3% for multiclass
classification and 94.6% for binary classification. For multiclass classification, this
result is sufficiently better than the one obtained with the K-Nearest Neighbors and
Support Vector Machines. For binary classification, the result is the same, however.
The fourth algorithm that was tested was Naive Bayes. The resulted accuracy was
72.23% for multiclass classification and 55% for binary classification. The cross table
related to the Naive Bayes classification can be found in Figure 31.
The detailed results that outline the accuracy of each of the malware families can be
found in Table 9.
For binary classification, the algorithm performed poorly. The number of correctly
identified benign instances (true negatives) was 61, correctly identified malware
instances (true positives) 143, incorrectly identified benign instances (false positives)
0, incorrectly identified malware instances (false negatives)
167. The detailed results can be found in Figure 32 and in Tables 10.1 and 10.2.
Overall, the Naive Bayes algorithm performed poorly. The accuracy of multiclass
classification was 72.23% and of binary classification only 55%. This result is
insusceptible for real world detection. In addition to that, a number of false negatives,
in other words, malware files that were incorrectly marked as benign filed, reached
167 – 45% of the total number of files. In a real environment, such result would cause
a huge malware epidemics in a short amount of time.
Most likely, such a bad accuracy is the result of having a high dependability between
features. As we know, the main drawback of the Naive Bayes algorithm is that each
feature is treated independently, although in most cases this cannot be true. In our
case, most likely certain APIs are dependent on each other, i.e. 𝐴𝑃𝐼𝑛 cannot be
triggered without 𝐴𝑃𝐼𝑚. That is the most probable reason of a bad result of the Naive
Bayes algorithm.
The last algorithm that was implemented was the Random Forest algorithm. The
algorithm resulted in a good accuracy of predictions, 95.69% for multi-class
classification and 96.8% for binary classification. The cross-table related to the
multiclass predictions can be found in Figure 33.
7
The detailed information about the performance of each class can be found in Table
11.
Class Family Correctly Incorrectly Accuracy Average
classified classified Cuckoo score
1 Benign 58 3 95% 1.04
2 Dridex 35 2 94.6% 5.26
3 Locky 25 2 92.6% 6.41
4 TeslaCrypt 44 0 100% 6.27
5 Vawtrak 15 3 83.3% 2.66
6 Zeus 35 5 87.5% 6.46
7 DarkComet 49 0 100% 5.15
8 CyberGate 38 0 100% 6.57
9 Xtreme 34 0 100% 5.15
10 CTB-Locker 22 0 100% 4.76
In the binary classification problem, the result achieved reached 96.8%. More
specifically, the number of correctly identified benign instances (true negatives)
reached 52, correctly identified malware instances (true positives) 307, incorrectly
identified benign instances (false positives) 9, and incorrectly identified malware
instances (false negatives) 3. The detailed information can be found in Figure 34 and
Tables 12.1 and 12.2.
The Random Forest algorithm resulted in the highest accuracy among the other
algorithms. It achieved 95.69% and 96.8% accuracy for multiclass and binary
classifications respectively. However, some false negatives are still present – their
number is equal to three.
7
6 CONCLUSIONS
Overall, the goals defined for this study were achieved. The desired feature extraction
and representation methods were selected and the selected machine learning
algorithms were applied and evaluated.
The desired feature representation method was selected to be the combined matrix,
outlining the frequency of successful and failed API calls along with the return codes
for them. This was chosen, because it outlines the actual behavior of the file. Unlike
other methods, it combines information about different changes in the system,
including the changes in the registry, mutexes, files, etc.
The result achieved by Random Forest is more accurate than the one achieved by the
sandbox. It is hard to compare the results quantitively, since the sandbox does not
classify the samples into malicious or benign. The classification into malware family
is beyond its functionality as well. Instead, the maliciousness of the file is seen as a
regression problem, and the severity score is its output. However, the difference in the
accuracy can be easily seen. Table 2, outlined in Chapter 4.2.1, shows that none of the
malware families were labeled with the “red” severity level, and one was labeled as
“green”. This result is very inaccurate in comparison to the 95.69% and 96.8%
achieved by Random Forest.
concern, the number of false-negatives is an important factor as well, since they can
result in massive infections. Random Forest, despite its high accuracy, resulted in 3
false negatives. Support Vector Machines, in turn, resulted in 0 false-negatives, while
the accuracy is lower by only 2%. That is why it is recommended to consider
implementing Support Vector Machines for binary classification.
Performance
Multi-class
negatives
y
The study performed in this project was a proof-of-concept. Therefore, several future
improvements related to the practical implementation of this project can be identified:
extract the features as they are processed by the sandbox, so that there
will be no need to go through the reports again.