0% found this document useful (0 votes)

52 views77 pages

40 - Malware Detection Using Machine Learning and Performance Evaluation

The document discusses malware detection using machine learning techniques. It describes different types of malware and machine learning methods for classification. The practical part involves using a sandbox to gather behavioral data on malware samples and applying machine learning algorithms including k-nearest neighbors, support vector machines, decision trees and random forests for detection. Results of the different methods are also presented.

Uploaded by

Venkat Karthik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views77 pages

40 - Malware Detection Using Machine Learning and Performance Evaluation

Uploaded by

Venkat Karthik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 77

Malware Detection using machine learning and performance

Evaluation

A project report submitted in partial fulfillment

of the requirements for the award of the degree of

Master
of
Computer Application
Submitted by
STUDENT_NAME
ROLL_NO
Under the esteemed guidance of
GUIDE_NAME
Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE &ENGINEERING

ST.MARY’S GROUP OF INSTITUTIONS GUNTUR (Affiliated
to JNTU Kakinada, Approved by AICTE, Accredited by NBA)
CHEBROLU-522 212, A.P, INDIA
2014-16
ST. MARY’S GROUP OF INSTITUTIONS, CHEBROLU, GUNTUR
(Affiliated to JNTU Kakinada)

DEPARTMENT OF COMPUTER SCIENCE & ENGINEEING

CERTIFICATE

This is to certify that the project report entitled PROJECT NAME” is the bonafied record of
project work carried out by STUDENT NAME, a student of this college, during the academic
year 2014 - 2016, in partial fulfillment of the requirements for the award of the degree of Master
Of Computer Application from St.Marys Group Of Institutions Guntur of Jawaharlal Nehru
Technological University, Kakinada.

GUIDE_NAME,
Asst. Professor Associate. Professor
(Project Guide) (Head of Department, CSE)
DECLARATION

We, hereby declare that the project report entitled “PROJECT_NAME” is an original work done at
St.Mary„s Group of Institutions Guntur, Chebrolu, Guntur and submitted in fulfillment of the
requirements for the award of Master of Computer Application, to St.Mary„s Group of Institutions
Guntur, Chebrolu, Guntur.

STUDENT_NAME
ROLL_NO

ACKNOWLEDGEMENT

We consider it as a privilege to thank all those people who helped us a lot for successful
completion of the project “PROJECT_NAME” A special gratitude we extend to our guide
GUIDE_NAME, Asst. Professor whose contribution in stimulating suggestions and
encouragement ,helped us to coordinate our project especially in writing this report, whose
valuable suggestions, guidance and comprehensive assistance helped us a lot in presenting
the project “PROJECT_NAME”.
We would also like to acknowledge with much appreciation the crucial role of our Co-
Ordinator GUIDE_NAME, Asst.Professor for helping us a lot in completing our project.
We just wanted to say thank you for being such a wonderful educator as well as a person.
We express our heartfelt thanks to HOD_NAME, Head of the Department, CSE, for his
spontaneous expression of knowledge, which helped us in bringing up this project through
the academic year.

STUDENT_NAME
ROLL_NO
3

CONTENTS
1 INTRODUCTION................................................................................................5
2 THEORETICAL BACKGROUND.....................................................................6
2.1 Malware types.............................................................................................6
2.2 Detection methods.......................................................................................8
2.3 Need for machine learning........................................................................10
2.4 Related work.............................................................................................11
3 MACHINE LEARNING METHODS................................................................12
3.1 Machine Learning Basics..........................................................................12
3.1.1 Feature extraction........................................................................14
3.1.2 Supervised and Unsupervised Learning......................................15
3.2 Classification methods..............................................................................16
3.2.1 K-nearest neighbours...................................................................17
3.2.2 Support Vector Machines............................................................19
3.2.3 Naive Bayes.................................................................................21
3.2.4 J48 Decision Tree........................................................................22
3.2.5 Random Forest.............................................................................24
3.3 Cross-validation........................................................................................26
4 PRACTICAL PART...........................................................................................27
4.1 Data...........................................................................................................28
4.1.1 Dridex..........................................................................................28
4.1.2 Locky...........................................................................................30
4.1.3 Teslacrypt....................................................................................32
4.1.4 Vawtrak........................................................................................34
4.1.5 Zeus..............................................................................................36
4.1.6 DarkComet...................................................................................37
4.1.7 CyberGate....................................................................................38
4.1.8 Xtreme.........................................................................................39
4.1.9 CTB-Locker.................................................................................40
4.2 Cuckoo Sandbox.......................................................................................41
4.2.1 Scoring system.............................................................................44
4.2.2 Reports and features....................................................................46
4.3 Feature representation...............................................................................48
4.3.1 Binary representation...................................................................49
4.3.2 Frequency representation.............................................................49
4
4.3.3 Combining representation............................................................50
4.4 Feature selection.......................................................................................50
4.5 Implementation.........................................................................................51
4.5.1 Sandbox configuration.................................................................52
4.5.2 Feature extraction........................................................................52
4.5.3 Feature selection..........................................................................54
4.5.4 Application of machine learning methods...................................55
5 RESULTS AND DISCUSSION........................................................................56
5.1 K-Nearest Neighbors.................................................................................56
5.2 Support Vector Machines..........................................................................59
5.3 J48 Decision Tree......................................................................................62
5.4 Naive Bayes..............................................................................................67
5.5 Random Forest..........................................................................................69
6 CONCLUSIONS................................................................................................72
6.1. Future Work..............................................................................................73
BIBLIOGRAPHY.........................................................................................................75
APPENDICES..............................................................................................................80
1. Feature Extraction Code (python).............................................................80
2. Feature selection code (R).........................................................................85
3. Classification code (R)..............................................................................86
4. List of MD5 hashes of malware samples..................................................89
5

1 INTRODUCTION

With the rapid development of the Internet, malware became one of the major cyber
threats nowadays. Any software performing malicious actions, including information
stealing, espionage, etc. can be referred to as malware. Kaspersky Labs (2017) define
malware as “a type of computer program designed to infect a legitimate user's
computer and inflict harm on it in multiple ways.”

While the diversity of malware is increasing, anti-virus scanners cannot fulfill the
needs of protection, resulting in millions of hosts being attacked. According to
Kaspersky Labs (2016), 6 563 145 different hosts were attacked, and 4 000 000
unique malware objects were detected in 2015. In turn, Juniper Research (2016)
predicts the cost of data breaches to increase to $2.1 trillion globally by 2019.

In addition to that, there is a decrease in the skill level that is required for malware
development, due to the high availability of attacking tools on the Internet nowadays.
High availability of anti-detection techniques, as well as ability to buy malware on the
black market result in the opportunity to become an attacker for anyone, not
depending on the skill level. Current studies show that more and more attacks are
being issued by script-kiddies or are automated. (Aliyev 2010).

Therefore, malware protection of computer systems is one of the most important

cybersecurity tasks for single users and businesses, since even a single attack can
result in compromised data and sufficient losses. Massive losses and frequent attacks
dictate the need for accurate and timely detection methods. Current static and dynamic
methods do not provide efficient detection, especially when dealing with zero-day
attacks. For this reason, machine learning-based techniques can be used. This paper
discusses the main points and concerns of machine learning-based malware detection,
as well as looks for the best feature representation and classification methods.

The goal of this project is to develop the proof of concept for the machine learning
based malware classification based on Cuckoo Sandbox. This sandbox will be utilized
for the extraction of the behavior of the malware samples, which will be used as an
input to the machine learning algorithms. The goal is to
6

determine the best feature representation method and how the features should be
extracted, the most accurate algorithm that can distinguish the malware families with
the lowest error rate.

The accuracy will be measured both for the case of detection of wheher the file is
malicious and for the case of classification of the file to the malware family. The
accuracy of the obtained results will also be assessed in relation to current scoring
implemented in Cuckoo Sandbox, and the decision of which method performs better
will be made. The study conducted will allow building an additional detection module
to Cuckoo Sandbox. However, the implementation of this module is beyond the scope
of this project and will not be discussed in this paper.

2 THEORETICAL BACKGROUND

This chapter provides the background that is essential to understand the malware
detection and the need for machine learning methods. The malware types relevant to
the study are described first, followed by the standard malware detection methods.
After that, based on the knowledge gained, the need for machine learning is discussed,
along with the relevant work performed in this field.

2.1 Malware types

To have a better understanding of the methods and logic behind the malware, it is
useful to classify it. Malware can be divided into several classes depending on its
purpose. The classes are as follows:

 Virus. This is the simplest form of software. It is simply any piece of software
that is loaded and launched without user’s permission while reproducing itself
or infecting (modifying) other software (Horton and Seberry 1997).

 Worm. This malware type is very similar to the virus. The difference is that
worm can spread over the network and replicate to other machines (Smith, et
al. 2009).
7

 Trojan. This malware class is used to define the malware types that aim to
appear as legitimate software. Because of this, the general spreading vector
utilized in this class is social engineering, i.e. making people think that they
are downloading the legitimate software (Moffie, et al. 2006).

 Adware. The only purpose of this malware type is displaying advertisements

on the computer. Often adware can be seen as a subclass of spyware and it will
very unlikely lead to dramatic results.

 Spyware. As it implies from the name, the malware that permorms espionage
can be referred to as spyware. Typical actions of spyware include tracking
search history to send personalized advertisements, tracking activities to sell
them to the third parties subsequently (Chien 2005).

 Rootkit. Its functionality enables the attacker to access the data with higher
permissions than is allowed. For example, it can be used to give an
unauthorized user administrative access. Rootkits always hide its existence and
quite often are unnoticeable on the system, making the detection and therefore
removal incredibly hard. (Chuvakin 2003).

 Backdoor. The backdoor is a type of malware that provides an additional

secret “entrance” to the system for attackers. By itself, it does not cause any
harm but provides attackers with broader attack surface. Because of this,
backdoors are never used independently. Usually, they are preceding malware
attacks of other types.

 Keylogger. The idea behind this malware class is to log all the keys pressed
by the user, and, therefore, store all data, including passwords, bank card
numbers and other sensitive information (Lopez, et al. 2013).

 Ransomware. This type of malware aims to encrypt all the data on the
machine and ask a victim to transfer some money to get the decryption key.
Usually, a machine infected by ransomware is “frozen” as the user cannot open
any file, and the desktop picture is used to provide information on attacker’s
demands. (Savage, Coogan and Lau 2015).
8

 Remote Administration Tools (RAT). This malware type allows an

attacker to gain access to the system and make possible modifications as if it
was accessed physically. Intuitively, it can be described in the example of the
TeamViewer, but with malicious intentions.

2.2 Detection methods

All malware detection techniques can be divided into signature-based and behavior-
based methods. Before going into these methods, it is essential to understand the
basics of two malware analysis approaches: static and dynamic malware analysis. As
it implies from the name, static analysis is performed “statically”, i.e. without
execution of the file. In contrast, dynamic analysis is conducted on the file while it is
being executed for example in the virtual machine.

Static analysis can be viewed as “reading” the source code of the malware and
trying to infer the behavioral properties of the file. Static analysis can include various
techniques (Prasad, Annangi and Pendyala 2016) :

1. File Format Inspection: file metadata can provide useful information. For
example, Windows PE (portable executable) files can provide much
information on compile time, imported and exported functions, etc.

2. String Extraction: this refers to the examination of the software output (e.g.
status or error messages) and inferring information about the malware
operation.

3. Fingerprinting: this includes cryptographic hash computation, finding the

environmental artifacts, such as hardcoded username, filename, registry
strings.

4. AV scanning: if the inspected file is a well-known malware, most likely all

anti-virus scanners will be able to detect it. Although it might seem irrelevant,
this way of detection is often used by AV vendors or sandboxes to “confirm”
their results.
9

5. Disassembly: this refers to reversing the machine code to assembly language

and inferring the software logic and intentions. This is the most common and
reliable method of static analysis.

Static analysis often relies on certain tools. Beyond the simple analysis, they can
provide information on protection techniques used by malware. The main advantage
of static analysis is the ability to discover all possible behavioral scenarios.
Researching the code itself allows the researcher to see all ways of malware
execution, that are not limited to the current situation. Moreover, this kind of analysis
is safer than dynamic, since the file is not executed and it cannot result in bad
consequences for the system. On the other hand, static analysis is much more time-
consuming. Because of these reasons it is not usually used in real-world dynamic
environments, such as anti-virus systems, but is often used for research purposes, e.g.
when developing signatures for zero-day malware. (Prasad, Annangi and Pendyala
2016).

Another analysis type is dynamic analysis. Unlike static analysis, here the behavior
of the file is monitored while it is executing and the properties and intentions of the
file are inferred from that information. Usually, the file is run in the virtual
environment, for example in the sandbox. During this kind of analysis, it is possible to
find all behavioral attributes, such as opened files, created mutexes, etc. Moreover, it
is much faster than static analysis. On the other hand, the static analysis only shows
the behavioral scenario relevant to the current system properties. For example, if our
virtual machine has Windows 7 installed, the results might be different from the
malware running under Windows 8.1. (Egele, et al. 2012).

Now, having the background on malware analysis, we can define the detection
methods. The signature-based analysis is a static method that relies on pre-
defined signatures. These can be file fingerprints, e.g. MD5 or SHA1 hashes, static
strings, file metadata. The scenario of detection, in this case, would be as follows:
when a file arrives at the system, it is statically analyzed by the anti- virus software. If
any of the signatures is matched, an alert is triggered, stating that this file is
suspicious. Very often this kind of analysis is enough since well- known malware
samples can often be detected based on hash values.
1

However, attackers started to develop malware in a way that it can change its
signature. This malware feature is referred to as polymorphism. Obviously, such
malware cannot be detected using purely signature-based detection techniques.
Moreover, new malware types cannot be detected using signatures, until the signatures
are created. Therefore, AV vendors had to come up with another way of detection –
behavior-based also referred to as heuristics- based analysis. In this method, the
actual behavior of malware is observed during its execution, looking for the signs of
malicious behavior: modifying host files, registry keys, establishing suspicious
connections. By itself, each of these actions cannot be a reasonable sign of malware,
but their combination can raise the level of suspiciousness of the file. There is some
threshold level of suspiciousness defined, and any malware exceeding this level raises
an alert. (Harley and Lee 2009).

The accuracy level of heuristics-based detection highly depends on the

implementation. The best ones utilize the virtual environment, e.g. the sandbox to run
the file and monitor its behavior. Although this method is more time- consuming, it is
much safer, since the file is checked before actually executing. The main advantage of
behavior-based detection method is that in theory, it can identify not only known
malware families but also zero-day attacks and polymorphic viruses. However, in
practice, taking into account the high spreading rate of malware, such analysis cannot
be considered effective against new or polymorphic malware.

2.3 Need for machine learning

As stated before, malware detectors that are based on signatures can perform well on
previously-known malware, that was already discovered by some anti- virus vendors.
However, it is unable to detect polymorphic malware, that has an ability to change its
signatures, as well as new malware, for which signatures have not been created yet. In
turn, the accuracy of heuristics-based detectors is not always sufficient for adequate
detection, resulting in a lot of false-positives and false-negatives. (Baskaran and
Ralescu 2016).

Need for the new detection methods is dictated by the high spreading rate of
polymorphic viruses. One of the solutions to this problem is reliance on the
1

heuristics-based analysis in combination with machine learning methods that offer a

higher efficiency during detection.

When relying on heuristics-based approach, there has to be a certain threshold for

malware triggers, defining the amount of heuristics needed for the software to be
called malicious. For example, we can define a set of suspicious features, such as
“registry key changed”, “connection established”, “permission changed”, etc. Then we
can state, that any software, that triggers at least five features from that set can be
called malicious. Although this approach provides some level of effectiveness, it is not
always accurate, since some features can have more “weight” than others, for
example, “permission changed” usually results in more severe impact to the system
than “registry key changed”. In addition to that, some feature combinations might be
more suspicious than features by themselves. (Rieck, et al. 2011).

To take these correlations into account and provide more accurate detection, machine
learning methods can be used.

2.4 Related work

Although not widely implemented, the concept of machine learning methods for
malware detection is not new. Several types of studies were carried out in this field,
aiming to figure the accuracy of different methods.

In his paper “Malware Detection Using Machine Learning” Dragos Gavrilut aimed for
developing a detection system based on several modified perceptron algorithms. For
different algorithms, he achieved the accuracy of 69.90%- 96.18%. It should be stated
that the algorithms that resulted in best accuracy also produced the highest number of
false-positives: the most accurate one resulted in 48 false positives. The most
”balanced”s algorithm with appropriate accuracy and the low false-positive rate had
the accuracy of 93.01%. (Gavrilut, et al. 2009).

The paper “Malware Detection Module using Machine Learning Algorithms to Assist
in Centralized Security in Enterprise Networks” discusses the detection method based
on modified Random Forest algorithm in combination with Information Gain for
better feature representation. It should be noticed, that the data set consists purely of
portable executable files, for which feature extraction
1

is generally easier. The result achieved is the accuracy of 97% and 0.03 false- positive
rate. (Singhal and Raul 2015).

“A Static Malware Detection System Using Data Mining Methods” proposed

extraction methods based on PE headers, DLLs and API functions and methods based
on Naive Bayes, J48 Decision Trees, and Support Vector Machines. Highest overall
accuracy was achieved with the J48 algorithm (99% with PE header feature type and
hybrid PE header&API function feature type, 99.1% with API function feature type).
(Baldangombo, Jambaljav and Horng 2013).

In “Zero-day Malware Detection based on Supervised Learning Algorithms of API

call Signatures”, the API functions were used for feature representation again. The
best result was achieved with Support Vector Machines algorithm with normalized
polykernel. The precision of 97.6% was achieved, with a false- positive rate of 0.025.
(Alazab, et al. 2011).

As it can be seen, all studies ended up with different results. From here, we can
conclude that no unified methodology was created yet neither for detection nor feature
representation. The accuracy of each separate case depends on the specifics of
malware families used and on the actual implementation.

3 MACHINE LEARNING METHODS

This chapter gives a theoretical background on machine learning methods, needed for
understanding the practical implementation. First, the overview of the machine
learning field is discussed, followed by the description of methods relevant to this
study. These methods include k-Nearest Neighbors, Decision Trees, Random Forests,
Support Vector Machines and Naive Bayes.

3.1 Machine Learning Basics

The rapid development of data mining techniques and methods resulted in Machine
Learning forming a separate field of Computer Science. It can be viewed as a subclass
of the Artificial Intelligence field, where the main idea is the ability of a system
(computer program, algorithm, etc.) to learn from its own actions. It was firstly
referred to as "field of study that gives computers the ability to learn without being
explicitly programmed" by Arthur Samuel in 1959. A more formal definition is given
by T. Mitchell: "A computer program is said to learn
1

from experience E with respect to some class of tasks T and performance measure P if
its performance at tasks in T, as measured by P, improves with experience E."
(Mitchell 1997).

The basic idea of any machine learning task is to train the model, based on some
algorithm, to perform a certain task: classification, clusterization, regression, etc.
Training is done based on the input dataset, and the model that is built is subsequently
used to make predictions. The output of such model depends on the initial task and the
implementation. Possible applications are: given data about house attributes, such as
room number, size, and price, predict the price of the previously unknown house;
based on two datasets with healthy medical images and the ones with tumor, classify a
pool of new images; cluster pictures of animals to several clusters from an unsorted
pool.

To develop a deeper understanding, it is worth going through the general workflow of

the machine learning process, which is shown in Figure 1.

Figure 1. General workflow process

As it can be seen, the process consists of 5 stages:

1. Data intake. At first, the dataset is loaded from the file and is saved in
memory.

2. Data transformation. At this point, the data that was loaded at step 1 is
transformed, cleared, and normalized to be suitable for the algorithm. Data is
converted so that it lies in the same range, has the same format, etc. At this
point feature extraction and selection, which are discussed further, are
performed as well. In addition to that, the data is separated into sets – ‘training
set’ and ‘test set’. Data from the training set is used to build the model, which
is later evaluated using the test set.
1

3. Model Training. At this stage, a model is built using the selected algorithm.

4. Model Testing. The model that was built or trained during step 3 is tested
using the test data set, and the produced result is used for building a new
model, that would consider previous models, i.e. “learn” from them.

5. Model Deployment. At this stage, the best model is selected (either after the
defined number of iteration or as soon as the needed result is achieved).

3.1.1 Feature extraction

In any of the examples mentioned above, we should be able to extract the attributes
from the input data, so that it can be fed to the algorithm. For example, for the housing
prices case, data could be represented as a multidimensional matrix, where each
column represents an attribute and rows represent the numerical values for these
attributes. In the image case, data can be represented as an RGB value of each pixel.

Such attributes are referred to as features, and the matrix is referred to as feature
vector. The process of extracting data from the files is called feature extraction. The
goal of feature extraction is to obtain a set of informative and non-redundant data. It is
essential to understand that features should represent the important and relevant
information about our dataset since without it we cannot make an accurate prediction.
That is why feature extraction is often a non-obvious task, which requires a lot of
testing and research. Moreover, it is very domain-specific, so general methods apply
here poorly.

Another important requirement for a decent feature set is non-redundancy. Having

redundant features i.e. features that outline the same information, as well as redundant
information attributes, that are closely dependent on each other, can make the
algorithm biased and, therefore, provide an inaccurate result.

In addition to that, if the input data is too big to be fed into the algorithm (has too
many features), then it can be transformed to a reduced feature vector
1

(vector, having a smaller number of features). The process of reducing the vector
dimensions is referred to as feature selection. At the end of this process, we expect the
selected features to outline the relevant information from the initial set so that it can
be used instead of initial data without any accuracy loss.

Other possible transformations are:

1. Normalization
An example of normalization can be dividing an image x, where xis are the
number of pixels with color i, by the total number of counts to encode the
distribution and remove the dependence on the size of the image.
𝑥
This translates into the formula: 𝑥′ = (Guyon and Elisseef 2006).
||𝑥 ||

2. Standardization
Sometimes, even while referring to comparable objects, features can have
different scales. For example, consider the housing prices example. Here,
feature ‘room size’ is an integer, probably not exceeding 5 and feature ‘house
size’ is measured in square meters. Although both values can be compared,
added, multiplied, etc., the result would be unreasonable before normalization.
The following scaling is often done:

x'i= (xi−µi)/σi , where µi and σi are the mean and the standard deviation of
feature xi over training examples. (Guyon and Elisseef 2006).

3. Non-linear expansions
Although in most cases we want to reduce the dimensionality of data, in some
cases it might make sense to increase it. This can be useful for complex
problems, where first-order interactions are not sufficient for accurate results.

3.1.2 Supervised and Unsupervised Learning

So far we have discussed the machine learning concepts from the point of view, where
we have initial data, on which the model can be trained. However, this is not always
the case. Here we want to introduce the two machine learning approaches - supervised
and unsupervised learning.
1

In Supervised Learning, learning is based on labeled data. In this case, we have an

initial dataset, where data samples are mapped to the correct outcome. The housing
prices case is an example of supervised learning: here we have an initial dataset with
houses, its attributes, and its prices. The model is trained on this dataset, where it
”knows” the correct results. Examples of supervised learning are regression and
classification problems:

1. Regression
Predict the value based on previous observations, i.e. values of the samples
from the training set. Usually, we can say that if the output is a real number/is
continuous, then it is a regression problem.

2. Classification
Based on the set of labeled data, where each label defines a class, that the
sample belongs to, we want to predict the class for the previously unknown
sample. The set of possible outputs is finite and usually small. Generally, we
can say that if the output is a discrete/categorical variable, then it is a
classification problem.

In contrast to Supervised Learning, in Unsupervised Learning, there is no initial

labeling of data. Here the goal is to find some pattern in the set of unsorted data,
instead of predicting some value. A common subclass of Unsupervised Learning is
Clustering:

3. Clustering
Find the hidden patterns in the unlabeled data and separate it into clusters
according to similarity. An example can be the discovery of different customer
groups inside the customer base of the online shop.

3.2 Classification methods

From machine learning perspective, malware detection can be seen as a problem of

classification or clusterization: unknown malware types should be clusterized into
several clusters, based on certain properties, identified by the algorithm. On the other
hand, having trained a model on the wide dataset of malicious and benign files, we can
reduce this problem to classification. For known malware families, this problem can
be narrowed down to classification only – having a limited set of classes, to one of
which malware sample certainly
1

belongs, it is easier to identify the proper class, and the result would be more accurate
than with clusterization algorithms. In this section, the theoretical background is given
on all the methods used in this project.

3.2.1 K-nearest neighbours

K-Nearest Neighbors (KNN) is one of the simplest, though, accurate machine learning
algorithms. KNN is a non-parametric algorithm, meaning that it does not make any
assumptions about the data structure. In real world problems, data rarely obeys the
general theoretical assumptions, making non-parametric algorithms a good solution
for such problems. KNN model representation is as simple as the dataset – there is no
learning required, the entire training set is stored.

KNN can be used for both classification and regression problems. In both problems,
the prediction is based on the k training instances that are closest to the input instance.
In the KNN classification problem, the output would be a class, to which the input
instance belongs, predicted by the majority vote of the k closest neighbors. In the
regression problem, the output would be the property value, which is generally a mean
value of the k nearest neighbors. The schematic example is outlined in Figure 2.

Figure 2. KNN example

Different distance measurement methods are used for finding the closest neighbors.
The popular ones include Hamming Distance, Manhattan Distance, Minkowski
distance:
1

𝐻𝑎𝑚𝑚𝑖𝑛𝑔 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑𝑖𝑗 = ∑ |𝑥𝑖𝑘 − 𝑥𝑗𝑘| [1]

𝑘=1
𝑛

𝑀𝑎𝑛ℎ𝑎𝑡𝑡𝑎𝑛 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑1(𝑝, 𝑞) = ||𝑝 − 𝑞||1 = ∑ |𝑝𝑖 − 𝑞𝑖| [2]

𝑖=1

𝑛 1⁄

𝑀𝑖𝑛𝑘𝑜𝑤𝑠𝑘𝑖 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = (∑ |𝑥𝑖 − 𝑦𝑖|𝑝) 𝑝 [3]

𝑖=1

The most used method for continuous variables is generally the Euclidean
Distance, which is defined by the formulae below:

𝐸𝑢𝑐𝑙𝑖𝑑𝑖𝑎𝑛𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = √∑(𝑞𝑖 − 𝑝𝑖)2 ; 𝑝 𝑎𝑛𝑑 𝑞 𝑎𝑟𝑒 𝑡ℎ𝑒 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑛 − 𝑠𝑝𝑎𝑐𝑒 [4]

𝑖=1

Euclidian distance is good for the problems, where the features are of the same type.
For the features of different types, it is advised to use, for example, Manhattan
Distance.

For the classification problems, the output can also be presented as a set of
probabilities of an instance belonging to the class. For example, for binary
𝑁0
problems, the probabilities can be calculated like 𝑃(0) = , where P(0) is
𝑁0+𝑁1

the probability of the 0 class membership and 𝑁0, 𝑁1 are numbers of neighbors
belonging to the classes 0 and 1 respectively. (Thirumuruganathan 2010).

The value of k plays a crucial role in the prediction accuracy of the algorithm.
However, selecting the k value is a non-trivial task. Smaller values of k will most
likely result in lower accuracy, especially in the datasets with much noise, since every
instance of the training set now has a higher weight during the decision process.
Larger values of k lower the performance of the algorithm. In addition to that, if the
value is too high, the model can overfit, making the class boundaries less distinct and
resulting in lower accuracy again. As a general approach, it is advised to select k using
the formula below:

𝑘 = √𝑛 [5]
1

For classification problems with an even number of classes, it is advised to choose an

odd k since this will eliminate the possibility of a tie during the majority vote.

The drawback of the KNN algorithm is the bad performance on the unevenly
distributed datasets. Thus, if one class vastly dominates the other ones, it is more
likely to have more neighbors of that class due to their large number, and, therefore,
make incorrect predictions. (Laaksonen and Oja 1996).

3.2.2 Support Vector Machines

Support Vector Machines (SVM) is another machine learning algorithm that is

generally used for classification problems. The main idea relies on finding such a
hyperplane, that would separate the classes in the best way. The term ’support vectors’
refers to the points lying closest to the hyperplane, that would change the hyperplane
position if removed. The distance between the support vector and the hyperplane is
referred to as margin.

Intuitively, we understand that the further from the hyperplane our classes lie, the
more accurate predictions we can make. That is why, although multiple hyperplanes
can be found per problem, the goal of the SVM algorithm is to find such a hyperplane
that would result in the maximum margins.

Figure 3. SVM scheme

On Figure 3, there is a dataset of two classes. Therefore, the problem lies in a two-
dimensional space, and a hyperplane is represented as a line. In general, hyperplane
can take as many dimensions as we want.

The algorithm can be described as follows:

1. We define X and Y as the input and output sets respectively. (𝑥1, 𝑦1),
…,(𝑥𝑚, 𝑦𝑚) is the training set.
2. Given x, we want to be able to predict y. We can refer to this problem as to
learning the classifier y=f(x, a), where a is the parameter of the classification
function.
3. F(x, a) can be learned by minimizing the training error of the function that
learns on training data. Here, L is the loss function, and 𝑅𝑒𝑚𝑝 is referred to as
empirical risk.
𝑚
1
[6]
𝑅𝑒𝑚𝑝(𝑎) = ∑ 𝑙(𝑓(𝑥𝑖, 𝑎), 𝑦𝑖) = 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝐸𝑟𝑟𝑜𝑟
𝑚 𝑖=1

4. We are aiming at minimizing the overall risk, too. Here, P(x,y) is the joint
distribution function of x and y.

𝑅(𝑎) = ∫ 𝑙(𝑓(𝑥, 𝑎), 𝑦)𝑑𝑃(𝑥, 𝑦) = 𝑇𝑒𝑠𝑡 𝐸𝑟𝑟𝑜𝑟 [7]

5. We want to minimize the Training Error + Complexity term. So, we

choose the set of hyperplanes, so f(x) = (w⸱x)+b:
𝑚
1 ∑ 𝑙(𝑤 ⋅ 𝑥 + 𝑏, 𝑦 ) + ||𝑤||2 subject to 𝑚𝑖𝑛 |𝑤 ⋅ 𝑥 | = 1 [8]
𝑚 𝑖 𝑖 𝑖 𝑖
𝑖=1

SVMs are generally able to result in good accuracy, especially on ”clean” datasets.
Moreover, it is good with working with the high-dimensional datasets, also when the
number of dimensions is higher than the number of the samples. However, for large
datasets with a lot of noise or overlapping classes, it can be more effective. Also, with
larger datasets training time can be high. (Jing and Zhang 2010).
2

3.2.3 Naive Bayes

Naive Bayes is the classification machine learning algorithm that relies on the Bayes
Theorem. It can be used for both binary and multi-class classification problems. The
main point relies on the idea of treating each feature independently. Naive Bayes
method evaluates the probability of each feature independently, regardless of any
correlations, and makes the prediction based on the Bayes Theorem. That is why this
method is called ”naive” – in real-world problems features often have some level of
correlation between each other.

To understand the algorithm of Naive Bayes, the concepts of class probabilities and
conditional probabilities should be introduced first.

a. Class Probability is a probability of a class in the dataset. In other

words, if we select a random item from the dataset, this is the
probability of it belonging to a certain class.

b. Conditional Probability is the probability of the feature value given

the class.

1. Class probability is calculated simply as the number of samples in the class

divided by the total number of samples:

𝑐𝑜𝑢𝑛𝑡(𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝐶)
𝑃(𝐶) = [9]
𝑐𝑜𝑢𝑛𝑡(𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝑁𝑡𝑜𝑡𝑎𝑙)

2. Conditional probabilities are calculated as the frequency of each attribute value

divided by the frequency of instances of that class.

𝑐𝑜𝑢𝑛𝑡(𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑤𝑖𝑡ℎ 𝑉 𝑎𝑛𝑑 𝐶)

𝑃(𝑉|𝐶) = [10]
𝑐𝑜𝑢𝑛𝑡(𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑤𝑖𝑡ℎ 𝑉)

3. Given the probabilities, we can calculate the probability of the instance

belonging to a class and therefore make decisions using the Bayes Theorem:

𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃(𝐴|𝐵) = [11]
𝑃(𝐵)
2

4. Probabilities of the item belonging to all classes are compared and the class
with the highest probability if selected as a result.

The advantages of using this method include its simplicity and easiness of
understanding. In addition to that, it performs well on the data sets with irrelevant
features, since the probabilities of them contributing to the output are low. Therefore
they are not taken into account when making predictions. Moreover, this algorithm
usually results in a good performance in terms of consumed resources, since it only
needs to calculate the probabilities of the features and classes, there is no need to find
any coefficients like in other algorithms. As already mentioned, its main drawback is
that each feature is treated independently, although in most cases this cannot be true.
(Bishop 2006).

3.2.4 J48 Decision Tree

As it implies from the name, decision trees are data structures that have a structure of
the tree. The training dataset is used for the creation of the tree, that is subsequently
used for making predictions on the test data. In this algorithm, the goal is to achieve
the most accurate result with the least number of the decisions that must be made.
Decision trees can be used for both classification and regression problems. An
example can be seen in Table 1:

Table 1. Decision tree example dataset

Figure 4. Decision tree example

As it can be seen in Figure 4, the model was trained based on the dataset and can now
classify the tennis playing decision to “yes” or “no”. Here, the tree consists of the
decision nodes and leaf nodes. Decision nodes have several branches leading to leaf
nodes. Leaf nodes represent the decisions or classifications. The topmost initial node
is referred to as root node.

The common algorithm for decision trees is ID3 (Iterative Dichotomiser 3). It
relies on the concepts of the Entropy and Information Gain. Entropy here refers to
the level of uncertainty in the data content. For example, the entropy of the coin toss
would be indefinite, since there is no way to be sure in the result. Contrarily, a coin
toss of the coin with two heads on both sides would result in zero entropy, since we
can predict the outcome with 100% probability before each toss. (Mitchell 1997).

In simple words, the ID3 algorithm can be described as follows: starting from the root
node, at each stage we want to partition the data into homogenous (similar in their
structure) dataset. More specifically, we want to find the attribute that would result in
the highest information gain, i.e. return the most homogenous branches (Swain and
Hauska 1977):

1. Calculate the entropy of the target.

𝐸(𝑇, 𝑋) = ∑ 𝑃(𝑐)𝐸(𝑐) [12]

𝑐∈𝑋

𝐸(𝑆) = ∑𝑐 −𝑝𝑖 log2

𝑖
[13]
𝑝𝑖
2

2. Split the dataset and calculate the entropy of each branch. Then calculate the
information gain of the split, that is the differences in the initial entropy and
the proportional sum of the entropies of the branches.

𝐺𝑎𝑖𝑛(𝑇, 𝑋) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑇) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑇, 𝑋) [14]

3. The attribute with the highest Gain value is selected as the decision node.
4. If one of the branches of the selected decision node has an entropy of 0, it
becomes the leaf node. Other branches require further splitting.
5. The algorithm is run recursively until there is nothing to split anymore.

J48 is the implementation of the ID3 algorithm, that is included in one of the R
packages, and this is the implementation we are going to use in our study.

Decision tree method achieved its popularity because of its simplicity. It can deal well
with large datasets and can handle the noise in the datasets very well. Another
advantage is that unlike other algorithms, such as SVM or KNN, decision trees
operate in a “white box”, meaning that we can clearly see how the outcome is
obtained and which decisions led to it. These facts made it a popular solution for
medical diagnosis, spam filtering, security screening and other fields. (Mitchell 1997).

3.2.5 Random Forest

Random Forest is one of the most popular machine learning algorithms. It requires
almost no data preparation and modeling but usually results in accurate results.
Random Forests are based on the decision trees described in the previous section.
More specifically, Random Forests are the collections of decision trees, producing a
better prediction accuracy. That is why it is called a ’forest’ – it is basically a set of
decision trees.

The basic idea is to grow multiple decision trees based on the independent subsets of
the dataset. At each node, n variables out of the feature set are selected randomly, and
the best split on these variables is found.
In simple words, the algorithm can be described as follows (Biau 2013):
2

1. Multiple trees are built roughly on the two third of the training data (62.3%).
Data is chosen randomly.
2. Several predictor variables are randomly selected out of all the predictor
variables. Then, the best split on these selected variables is used to split the
node. By default, the amount of the selected variables is the square root of the
total number of all predictors for classification, and it is constant for all trees.
3. Using the rest of the data, the misclassification rate is calculated. The total
error rate is calculated as the overall out-of-bag error rate.
4. Each trained tree gives its own classification result, giving its own ”vote”. The
class that received the most ”votes” is chosen as the result.
The scheme of the algorithm is seen in Figure 5.

Figure 5. Random Forest scheme

As in the decision trees, this algorithm removes the need for feature selection for
removing irrelevant features – they will not be taken into account in any case. The
only need for any feature selection with the random forest algorithms arises
2

when there is a need for dimensionality reduction. Moreover, the out-of-bag error rate,
which was mentioned earlier can be considered the algorithm’s own cross-validation
method. This removes the need for tedious cross-validation measures, that would have
to be taken otherwise. (Mitchell 1997).

Random forests inherit many of the advantages of the decision trees algorithms. They
are applicable to both regression and classification problems; they are easy to compute
and quick to fit. They also usually result in the better accuracy. However, unlike
decision trees, it is not very easy to interpret the results. In decision trees, by
examining the resulting tree, we can gain valuable information about which variables
are important and how they affect the result. This is not possible with random forests.
It can also be described as a more stable algorithm than the decision trees – if we
modify the data a little bit, decision trees will change, most likely reducing the
accuracy. This will not happen in the random forest algorithms – since it is the
combination of many decision trees, the random forest will remain stable. (Louppe
2014).

3.3 Cross-validation

The drawback of the accuracy evaluation methods that are present in the machine
learning methods themselves is that they cannot predict how the model will perform
on the new data. The approach to overcoming this drawback relies on the cross-
validation. The idea is to split the initial dataset. The model is trained on the biggest
part of the dataset and then subsequently tested on the smaller part. There are three
different classes of cross-validation:

1. Holdout method – here, the dataset is separated into two parts: a training set
and test set. The model is fit on the training set. The model is then tested on the
test set, which it has not seen before. The resulting errors are used to compute
the mean absolute test error, that is used for model evaluation. The advantage
of this method is its high speed. On the other hand, the evaluation result
depends highly on how the test set was selected since the variance is usually
high. Therefore, the evaluation result can differ significantly between different
test sets.
2

2. The k-fold method can be seen as the improvement over the holdout
method. Here, the k subsets are selected, and the holdout method is repeated k
times, where each time one of the k subsets is used as a training set, and the k-
1 subset is used as the test set. The average error is then computed over all k
runs of holdout method. With the increase of k, the variance is reduced,
ensuring that the accuracy will not change with different datasets. The
disadvantage is the complexity and the running time, which is higher as
compared to the holdout method.

3. The leave-one-out method is the extreme case of the k-fold method, where
the k is as big as the sample universe. On each run of the holdout method, data
is trained on all the data points except from one, and that one point is
subsequently used for testing. The variance, in this case, is as small as
possible. The computing complexity, on the other hand, is high. (Schneider
1997).

This chapter provided background on the machine learning that is essential for
understanding the practical implementation of the project, that is described in the next
chapter. The concepts of feature set, feature extraction, and selection methods were
discussed along with the machine learning algorithms that will be used in practical
part. The chosen algorithms are K-Nearest Neighbours, Support Vector Machines,
Decision Trees, Random Forests and Naive Bayes.

4 PRACTICAL PART

As a reminder, the goal of the project lies in the determination of the most suitable
feature representation and extraction methods, the most accurate algorithm that can
distinguish the malware families with the lowest error rate and how this accuracy
relates to the current scoring system accuracy. This chapter discusses the practical
aspects of the project implementation. This includes data gathering, description of
malware families that represent the dataset, selection of the features that will be used
for the algorithm and finding the optimal feature representation method, evaluation
method, and the implementation process.
2

4.1 Data

For this project, a total of 2 140 files were collected. For most of them, hashes, which
uniquely identify files were found in incidence reports or malware reverse engineering
reports, and these hashes were subsequently used to get the corresponding samples
from the VirusTotal service with the help of external malware researchers. (VirusTotal
2017) To be able to operate with a diverse dataset, nine malware families were used,
resulting in 1 156 malicious files and 984 benign files. Malicious families that were
used are Dridex, Locky, TeslaCrypt, Vawtrak, Zeus, DarkComet, CyberGate, Xtreme,
CTB-Locker. They are discussed in detail further in this chapter. Benign files were
mainly software installers of the .exe format, but also included several files of
.pdf,
.docx, etc. formats, as they are often used as malware spreading vectors. To achieve
the most meaningful and up-to-date results only malware that has appeared in the last
two years is used.

4.1.1 Dridex

The first malware family with a total of 172 unique files is Dridex. This malware
belongs to the Trojan class, specifically, banking trojan. It caused a huge infection in
2015, resulting in 3 000 - 5 000 infections per month.

Dridex is derived from Cridex, malware that spread in 2012. Cridex was also a
banking credentials stealer, but more specifically, it was a worm, that utilized attached
storage devices as a spreading vector. In 2014 a renewed version appeared, switching
from command and control communications to peer-to- peer and therefore becoming
more resilient to takedown operations.

The Dridex attack was targeted to users of specific banks, aiming to steal their
credentials during banking sessions. It is said to be target over 300 institutions and 40
regions, mostly focusing on English-speaking countries with high income rates: most
infections happened in the United States, the United Kingdom, and Australia.
(O’Brien 2016).

Most of the Dridex malware files were distributed during a massive-scale spam
campaign, by using real company names as the sender addresses, but fake top level
domains, matching the location of the targeted users. Most emails were either
invoices or orders. Attackers behind Dridex showed a high level of
2

attention to details: emails with real company names also utilized real employee
names and were sent during business hours.

Figure 6. Dridex operation scheme (Aquino 2014)

The operation scheme of Dridex is outlined in Figure 6. From a technical perspective,

Dridex malware was embedded into macros of Word documents. After running these
macros, a file of .vbs format was run and executed, downloading and installing Dridex
Trojan. Dridex performs a man-in-the-middle attack, embedding itself into the
Chrome, Firefox or Internet Explorer web browsers and subsequently monitoring
traffic and seeking for online banking connections. After finding one, Dridex steals
data from keylogs, screenshots, and input forms. (O’Brien 2016).

Dridex has a modular architecture, allowing for the attackers to easily add additional
functionality. According to Symantec, there are the following modules (O’Brien
2016):

1. Loader module’s only purpose is to install the main module. The loader will
find one of the servers inside of its configuration and request a binary and
configuration data using HTTPS request.
3

2. Main module performs the most functionality of the Dridex malware,

including taking screenshots, logging keystrokes, stealing data input forms,
deleting files, stealing cookies, etc. For communication, it uses HTTPS with
gzipped and XOR-encrypted data.
3. VNC (Virtual Network Computing) module, which is available on both
x86 and x64 architectures provides a graphical interface for the remote control
of the computer. It supports a wide variety of functions, including command
prompt, disk management, system settings, etc.
4. SOCKS module is also available for x86 and x64 architectures, supporting
remote command execution, file download, command and control, etc.
5. The mod4 module is used for the creation of new processes.
6. The mod6 module provides an ability to send emails via Outlook and is
used for spam campaigns.

4.1.2 Locky

The second malware family, represented by 115 unique files, is Locky. This is
ransomware that encrypts all data on the victim’s system using the RSA-2048 and
AES-256 ciphers and adds a .locky extension to it. Locky emerged in February 2016
and has been distributed aggressively since then. The most common distribution
vectors are spam campaigns, specifically, fake invoices and phishing websites. These
spam campaigns were extremely similar to the ones used to distribute Dridex in its
size, utilization of financial documents and macros, which gives a sign of the Dridex
group being responsible for this malware. The price for decryption of system files
varied from 0.5 to 1 bitcoin. (Symantec Security Response 2016). The operation
scheme of Locky can be seen in Figure 7.
3

Figure 7. Locky operation

Upon delivery to the system, the macros embedded into a .docx or .xls file runs and
downloads the Locky malware. Malware file, in turn, injects itself into the
%temp% folder with a random name and .exe or .dll file format. A “Run” registry key
with value “Locky” will subsequently be added to the registry, pointing out the .exe
file in the %temp% folder. The initial file will be deleted at this point. A new process
will be started after that, exploring the volume properties and deleting shadow copies
present on the volume. Recovery instructions and the public key are retrieved with a
POST request from a command and control server. After that, all files on the
system are encrypted, and the desktop
3

background is changed to the image with the decryption instructions. (McAfee Labs
2016). An additional registry key is created, allowing the malware to run every time
the system is started. Figure 8 shows the decryption instructions for Locky.

Figure 8. Recovery instructions of the Locky malware (Symantec Security Response 2016)

4.1.3 Teslacrypt

Teslacrypt is the third malware family, consisting of 115 files and belonging to the
ransomware class. Main distribution vector is compromised websites and emails with
links leading to malicious websites that download the malware once they are visited.
Upon download, the file is executed immediately. The operation scheme of Teslacrypt
can be found in Figure 9.
3

Figure 9. TeslaCrypt operation

Upon execution TeslaCrypt is copied to the /AppData/Roaming/ folder. Malware is

compiled with a C++ compiler and the screen outlined in Figure 10 pops up upon the
encryption is finished (McAfee Labs 2016):

Figure 10. Decryption instructions for TeslaCrypt (McAfee Labs 2016)

Payment for a decryption key is requested to be made via PayPal or Bitcoin (1 000
USD or 1.5 bitcoin). Unlike other ransomware families, TeslaCrypt encrypted not
only obvious data files, such as .pdf, .doc, .jpg etc., but also game-related files,
including Call of Duty, World of Tanks, Minecraft and World of Warcraft.

Interestingly, in May 2016, the attackers behind TeslaCrypt announced that they
closed the project and released the master decryption key. Several days later, ESET
antivirus released a free decryption tool. More details can be found in Figure 11.

Figure 11. Payment page of TeslaCrypt with the master decryption key (Mimoso 2016)

4.1.4 Vawtrak

Fourth malware family that consists of 74 unique files is Vawtrak. Also referred to as
Neverquest or Snifula, Vawtrak is another example of banking Trojan. The most
infections happened in Czech Republic, USA, UK, and Germany. Spreading vectors
include malware downloaders, spam with malicious links or other drive-by
downloads. After downloading, Vawtrak is capable of gaining access to banking
accounts of a victim, as well as stealing credentials, passwords, private keys, etc.

The operation process of this malware family is outlined in Figure 12. The execution
of the initial file, downloaded to the drive, results in the installation of a dropper file
into %ProgramData% folder with a randomly created extension and filename. The
initial file is deleted after that. (Křoustek 2015). This dropper file is a DLL that is
responsible for unpacking the Vawtrak module and injecting it to the running
processes. To do that, the DLL firstly decrypts the payload with
3

the hardcoded key and decompresses itself, resulting in a new DLL, which replaces
the initial one. This DLL, in turn, extracts the final module, which turns out to be a
compressed version of two DLLs: 64 and 32-bit modifications. These DLLs are
injected into the system processes and are responsible for the Vawtrak’s functionality.

Figure 12. Vawtrak operation

After successful execution, Vawtrak is capable of performing a wide range of malicious

actions (Křoustek 2015):
 Disabling the antivirus protection
 Communication with CnC servers
 Stealing passwords, cookies, digital certificates
 Creation of a proxy server on the host system
 Keylogging and screenshots taking
 Changing web browser settings (Internet Explorer, Firefox, Google Chrome)
and modifying communications with web servers
3

4.1.5 Zeus

Zeus is the fifth malware family and is represented by 116 unique files. It is a botnet
package, which can be easily traded on the black market for around 700 USD. After
its appearance in 2007, Zeus has evolved and remains one of the most common botnet
malware representatives.

Figure 13. Zeus operation

The summary of Zeus operation can be found in Figure 13. Infection vectors of Zeus
vary dramatically, starting from spam emails, and ending with drive-by downloads.
After the download, the malware injects itself into the sdra64.exe process and
modifies the registry values so that it is executed upon system startup. After that, Zeus
injects itself into the winlogon.exe process and terminates the initial executable.
Winlogon injected code injects additional data into the svchost.exe process and creates
two files: local.ds contains the up-to- date configurations, and user.ds contains data to
be transmitted to the command and control server. (Falliere and Chien 2009).
3

The functionality of Zeus includes stealing of system information, online

credentials, storage information. Specification of data to be stolen is either hard
coded into the binary or is retrieved from the command and control server.
(Falliere and Chien 2009).

The popularity of Zeus malware is related to the fact that it is relatively cheap and
easy to use. Moreover, it comes as a ready-to-deploy package and as a result can be
used by novices and script kiddies.

4.1.6 DarkComet

DarkComet is an example of the Remote Administration Tool (RAT). It was utilized

in several attacks in 2012-2015. Initially, DarkComet was not developed as a
malicious tool, however, because of its nature and functionality, it was eventually used
by the Syrian government for espionage, followed by several other attacks in the
following years.

During Syrian conflict in 2014, it was used by the Syrian government for espionage on
Syrian citizens that were bypassing government’s censorship on the Internet. In 2015,
the ”Je Suis Charlie” slogan was used to trick people into downloading the
DarkComet: it was disguised as a picture, which compromised the users once
downloaded.

As most of the RATs, the DarkComet includes two components: the client and the
server. However, they have a reverse meaning from the perspective of the attacker,
where the ’server’ is the machine with malware, and the ’client’ is the attacker. The
DarkComet relies on the remote-connection architecture: once it executes, the server
connects to the client, which has a GUI, allowing it to control the server. (Kujawa
2012) The functionality of DarkComet is broad, including, but not limited to (Kujawa
2012):
 Webcam and sound capture
 Keylogging
 Power off/Shutdown/Restart
 Remote Desktop functionality
 Active ports discovery
 LAN computers discovery
 URL download
3

 WiFi Access Points discovery

 Remote Edit Service
 Update server from file or URL
 Lock computer
 Redirect IP/port

The communication between the server and the client is outlined in Figure 14.

Figure 14. DarkComet communication scheme

4.1.7 CyberGate

CyberGate is another example of the Remote Administration Tool (RAT). Written in

Delphi, it is constantly being developed, resulting in stability and extensive
functionality. It should be mentioned that CyberGate can be considered “legal”
malware since it was initially developed for legal purposes and is used in legal
problems. However, it is often used for malicious activity, such as espionage.
3

CyberGate provides the ability to:

 Log into the victim’s machine

 Retrieve the screenshots of the machine
 Connect to the multiple users at the same time.
 Lock computer
 Restart, shutdown
 Read and modify the registry
 Interact via shell
 Capture data from connected input devices

The operation of the CyberGate is guided by the attacker, and the communication
happens with a client-server model. Again, here the attacker is referred to as a client
and the infected machine is a server. The communication happens in a way similar to
the one outlined in Figure 14.

In addition to that, there are plenty of the tutorials that can be found on the Internet,
allowing people with a limited set of skills to take advantage of this RAT for
malicious purposes. (Aziz 2014).

4.1.8 Xtreme

Another example of RAT is Xtreme. Developed in Delphi, it is available for free and
shares the source code with several other Delphi RAT malware, including CyberGate.

Xtreme was used in several governmental attacks, as well as several attacks targeting
Israel and Palestina. The architecture of Xtreme relies on the client- server
architecture, where the attacker is considered to be a client. The configurations are
written to the %APPDATA%\Microsoft\Windows folder or the folder named after the
mutex created. The data is subsequently encrypted using RC4 and ”CONFIG” or
”CYBERGATEPASS” as a password. The configurations are stored in the file of
”.ngo” or ”.cfg” extensions. The configuration data includes the name of installed file,
an injection process, FTP and CnC information, mutex name. (Villeneuve and Bennett
2014). The communication between the infected machine and the attacker happens in
a way similar to the one of the DarkComet, which is outlined in Figure 14.
4

The functionality of Xtreme allows the attacker to (Villeneuve and Bennett 2014):
 Read and modify the registry
 Interact via the remote shell
 Desktop capturing
 Capture data from connected devices, such as a microphone, webcam, etc.
 Manipulate running processes
 Upload and download files

4.1.9 CTB-Locker

The last malware family used was CTB-Locker, and it was represented by 79 unique
files. This is another example of ransomware which encrypts user’s files asking for
money for the decryption key. CTB is an acronym for Curve Tor Bitcoin, referring to
Elliptic Curve algorithm that was used for encryption.

Figure 15. CTB-Locker operation

The propagation of the CTB-Locker samples was happening through the e- mails with
malicious attachments. Attachments represented .zip files with the downloader inside.
The initial operation of CTB-Locker is outlined in Figure 15. Upon execution malware
drops itself to the %temp% folder with a random name and injects itself into the
svchost.exe process. Moreover, a mutex of random name is created, ensuring that
there is only one instance of CTB-Locker running on the machine.

Upon successful completion of malware, a pop-up screen will appear, providing

information on payment and encryption details. This pop-up screen is shown in Figure
16. CTB-Locker targeted mostly Spain, France and Austria. (McAfee Labs 2015).

Figure 16. CTB-Locker decryption instructions (McAfee Labs 2015).

4.2 Cuckoo Sandbox

The study is based on and targeted to Cuckoo Sandbox. It is clear that to apply the
machine learning algorithms to any problem, it is essential to represent the data in
some form. For this purpose, Cuckoo Sandbox was used. The reports generated by the
sandbox, describing the behavioral data of each sample, were preprocessed, and
malware features were extracted from there. However, it is
4

important to understand the functionality of the sandbox and the structure of the
reports first.

Cuckoo Sandbox is the open-source malware analysis tool that allows getting the
detailed behavioral report of any file or URL in a matter of seconds. According to
Cuckoo Foundation (2015), currently, supported file formats include:

 Generic Windows executables

 DLL files
 PDF documents
 Microsoft Office documents
 URLs and HTML files
 PHP scripts
 CPL files
 Visual Basic (VB) scripts
 ZIP files
 Java JAR
 Python files
 Almost anything else

Cuckoo has a highly customizable modular architecture, allowing it to be used both as

a standalone application as well as integrated into the larger frameworks.

The main components of Cuckoo’s infrastructure are a host machine (the management
software) and a number of guest machines (virtual or physical machines for analysis).
Its operation scenario is quite straightforward: as soon as the new file is submitted to
the server, a virtual environment is dynamically allocated for it, the file is executed,
and all the actions performed in the system are recorded.
4

Figure 17. Cuckoo Sandbox architecture (Cuckoo Foundation 2015).

As shown in Figure 17, the sandbox generates the report which outlines all the
behavior of the file in the system. The report is represented as a JSON file, and
currently, it is capable of detecting the following features (Cuckoo Foundation 2015):

 Traces of calls performed by all processes spawned by the malware

 Files being created, deleted and downloaded by the malware during its
execution
 Memory dumps of the malware processes
 Network traffic trace in the PCAP format
 Screenshots that were taken during the execution of the malware
 Full memory dumps of the machines

After getting the behavior of the file, Cuckoo Sandbox makes a decision on the level
of maliciousness of the file using some pre-defined signatures. This functional part of
the sandbox is only interesting to us as the way to compare the performance of the
machine learning methods to the currently implemented signature-based methods.
4

4.2.1 Scoring system

The Cuckoo analysis score is an indication of how malicious an analyzed file is. The
score is determined by measuring how many malicious actions are performed. Cuckoo
uses a set of summarized malicious actions, called signatures, to identify the malicious
behavior. Each of these signatures has its score, which indicates the severity of the
performed action.

In total, there are three levels of severity and all levels have their score of severity: 1
for low, 2 for medium and 3 for high. An example of a low severity signature is the
action of performing a query on a computer name. An example of a medium severity
signature is the creation of an executable file. An example of a high severity signature
is the removal of a shadow copy.

During analysis, all actions are stored to be processed afterwards. In the end, multiple
modules, including the signatures module, are used to examine the stored actions. The
signatures module examines all the collected data and finds patterns that match a
signature. If the signature matches, a counter is incremented by the score of the
severity of the signature (1, 2, or 3). When all signatures have been processed, the
value of the counter is divided by 5.0 to create a floating point score. This score is the
Cuckoo analysis score. An example of the signatures of different severity can be
found in Figure 18.

Figure 18. Severity levels of cuckoo signatures

The average scores of the malware families used in this project are outlined in Table
2. The color indicates the maliciousness level corresponding to the score.

Family Average Cuckoo score

Benign 1.04
Dridex 5.26
Locky 6.41
Teslacrypt 6.27
Vawtrak 2.66
Zeus 6.46
DarkComet 5.15
CyberGate 6.57
Xtreme 5.15
CTB-Locker 4.76

Table 2. Cuckoo scores for malware families

It is hard to measure the accuracy of the detection since there is no threshold value
indicating whether the sample is malicious or not. Moreover, determining the specific
class to which malware belongs is beyond the functionality of the sandbox. In the
graphical user interface, there are indicators of green, yellow and red colors, outlined
in Figures 18 - 19, indicating how reliable the file is. The green indicator is used for
samples with a score of 4 and lower, yellow for samples with score 4-7, red for scores
7-10. However, this feature is only an interface part and is not very reliable, as it is
still in the alpha state. Moreover, it has some bugs, as outlined in Figure 20.

Figure 19. Color labels of the severity of reports

Figure 20. Bugs in the Cuckoo scoring systems

4.2.2 Reports and features

To apply machine learning algorithms to the problem, we need to figure out what kind
of data should be extracted and how it should be presented.

Some works in the field are utilizing string properties or file formats properties as a
basis for feature representation. For example, for Windows-based malware samples,
the data contained in PE headers is often used as a base for analysis. However,
implementing format-specific feature extraction is not the best solution, since formats
of analyzed files can vary dramatically. (Hung 2011).

Other works rely on the so-called n-grams. Byte n-grams are overlapping substrings,
collected in a sliding-window fashion where the windows of fixed size slides one byte
at a time. Word n-grams and character n-grams are widely used in natural language
processing, information retrieval, and text mining. (Reddy and Pujari 2006).

However, such approach has several disadvantages. The major difficulty in

considering byte n-grams as a feature is that the set of all byte n-grams obtained from
the set of byte strings of malware as well as of the benign programs is very large and it
is not useful to apply classification techniques directly on these. (Reddy and Pujari
2006). In addition to that, such approach limits the ability of detection of polymorphic
malware. In this case, the samples generating the same behavior will result in different
strings, and, therefore, different n-grams.

Because of the above-mentioned reasons, in this study, it is decided to rely on the

actual behavior of the files, that is monitored by the sandbox. Overall, we can identify
the following features extracted by the sandbox:

 Files
 Registry keys
 Mutexes
 Processes
 IP addresses and DNS queries
 API calls

This section discusses which of the above-mentioned features should be used in our
work.
4

 Files
The reports contain information about opened files, written files, and created
files. This kind of information is good in predicting the malware family since
any malware files trigger many modifications to the file systems. It can be
used for the quite accurate malware classification in most cases. However, for
example in the cases of ransomware, relying solely on the file modifications
might result in the algorithm not being able to distinguish different families.
This is because ransomware encrypts every file on the system. Therefore the
feature set consists mostly of the encrypted files. The differences between
ransomware families would be defined by the files with malware settings, the
amount of which is vastly lower than the whole feature set and, therefore, it
would be very hard to make predictions based on this data.

 Registry keys
On Windows systems, the registry stores the low-level system settings of the
operation system and its applications. Any sample that is run on the system
triggers a high amount of the registry changes – the Cuckoo reports can outline
the registry keys opened, read, written, deleted. The information on the registry
modifications can be a good source of information on the system changes
caused by malware and can be used for malware detection.

 Mutexes
The mutex stands for the Mutual Exclusion. This is a program object that
allows multiple threads to share the same resource. Every time a program is
started, a mutex with a unique name is created. Mutex names can be good
identifiers of specific malware samples. However, for the families, they cannot
result in the accurate result on a large scale, since the number of mutexes
created per sample is dramatically lower than the dataset. That is why the
small change related to the bug or non-started process would result in the
dramatic change of the prediction results.

 Processes
Common identifier of the specific malware sample is the name of the created
process. However, very rarely it can be used for identification of
4

the malware family since in the common cases the process names are the same
as the hash of the sample. As an alternative, the malware sample can inject
itself into the system process. That is why this feature is bad for the family
identification.

 IP addresses and DNS queries

Cuckoo provides information about the network traffic in the PCAP format,
from which the data about contacted IP addresses as well as DNS queries can
be extracted. This data accurately identifies the IP addresses of the command
and control servers of attackers and, therefore, can accurately identify the
malware family in most cases. However, often the attackers change the domain
names or IP addresses of their servers or spoof them. Therefore, it is unreliable
to rely solely on this kind of information.

 API calls
API stands for Application Programming Interface and refers to the set of tools
that provide an interface for communication between different software
components. API calls are recorded during the execution of the malware and
refer to the specific process. They outline everything happening to the
operating system, including the operations on the files, registry, mutexes,
processes and other features mentioned earlier. For example, API calls
OpenFile, OpenFileEx, CreateFile, CopyFileEx, etc. define the file operations,
calls OpenMutex, CreateMutex and CreateMutexEx describe mutexes opened
and created, etc. API call traces present the wide description of the sample
behavior, including all the properties mentioned above. In addition to that, they
include a wide set of distinct values. Moreover, they are simple to describe in
numeric format, and that is why they were chosen as features. Here, the feature
set will be defined by the number of unique API calls and the return codes.
The next section describes the representation way in more detail.

4.3 Feature representation

Having familiarized ourselves with the features presented in the Cuckoo Sandbox
reports, we can now think about the way to represent the features to be used for the
machine learning algorithms. Since the feature set, containing
4

the failed and successful APIs as well as the return codes, is quite large, we have to
find a way to present it in a clear, compact and non-redundant way. The representation
chosen for this task is the Frequency (Binary) matrix, discussed in detail in the
following section.

4.3.1 Binary representation

The binary representation is the most simple and straightforward way to represent the
features of the failed and successful API calls. Here, a matrix is created, where the
rows represent the samples, and the columns represent the API calls. A value of 0
represents the ‘failed’ state of the API call, and the value of 1 represents the successful
API call.

𝐴𝑃𝐼1 𝐴𝑃𝐼2 … 𝐴𝑃𝐼𝑛

𝑆1 0
𝖥 1 1 ⋯ 1
𝐴𝑃𝐼𝑏𝑖𝑛 = 𝑆2 I 1 0 ⋯ 1 I
I I
⋮ I ⋮ ⋱ ⋱ ⋮ I
𝑆𝑛 [ 1 1 ⋯ 1 ]

Although this approach is simple and straightforward, it does not take into account the
return codes generated, as well as a number of times the certain API call was
triggered, resulting in lower accuracy. (Pirscoveanu 2015).

4.3.2 Frequency representation

The frequency representation approach is close to the binary representation approach

in its structure. However, instead of marking each API call as ‘failed’ or ‘successful’,
it outlines the frequency of each API call, showing a number of times it was triggered.

𝐴𝑃𝐼1 𝐴𝑃𝐼2 … 𝐴𝑃𝐼𝑛

𝑆1
𝖥 112 312 ⋯ 72 1
𝑆
𝐴𝑃𝐼𝑓𝑟𝑒𝑞 = 2 I 16 23 ⋯ 315 I
I I
⋮ I ⋮ ⋱ ⋱ ⋮ I
𝑆𝑛 [ 157 1 ⋯ 567
]

Here, the horizontal axis represents the samples and the vertical axis represents the API
call, where each number represents a number of times the
5

API call was triggered. This approach clearly provides more details than the binary
representation, resulting in better accuracy. (Pirscoveanu 2015).

4.3.3 Combining representation

To utilize the maximum amount of useful data presented in the API calls information,
the best approach is to combine the features of the previous representation methods.
The resulting matrix would outline the frequency of failed APIs, successful APIs, and
the return codes.

𝑃𝑎𝑠𝑠1 ... 𝑃𝑎𝑠𝑠𝑛 𝐹𝑎𝑖𝑙1 … 𝐹𝑎𝑖𝑙𝑛 𝑅𝑒𝑡𝐶1 …

𝑆1 𝑅𝑒𝑡𝐶𝑛 23 ⋯ 3 224 ⋯ 123 23 ⋯ 27
𝖥 1
𝐶𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 𝑆2 I 52 ⋯ 21 224 ⋯ 57 224 ⋯ 1 I
I I
⋮ ⋮ ⋱ ⋱ ⋮ ⋱ ⋮ ⋮ ⋱ ⋮
I I
𝑆𝑛 [ ]
52 ⋯ 22 210 ⋯ 46 72 ⋯ 111

Here the rows represent the samples, the columns 𝑃𝑎𝑠𝑠1…𝑃𝑎𝑠𝑠𝑛 represent a number
of times each API call in [𝑃𝑎𝑠𝑠1; 𝑃𝑎𝑠𝑠𝑛] was called, where n is a total number of API
calls triggered. Similarly, columns 𝐹𝑎𝑖𝑙1…𝐹𝑎𝑖𝑙𝑛 represent a number of times each API
call failed. Columns 𝑅𝑒𝑡𝐶1 … 𝑅𝑒𝑡𝐶𝑛 represent a number of times each return code
was returned. (Pirscoveanu 2015).

This approach results in a fair performance, and that is why it is chosen for our
problem. Obviously, the usage of the combination method resulted in the dramatic
increase in the number of features, since they are now represented by the combination
of passed APIs, failed APIs and return codes, instead of relying solely on the APIs
triggered. Since the feature set became more than two times bigger, some feature
selection should be performed.

4.4 Feature selection

The goal of the feature selection is to remove the non-important features from the
feature set as it gets too big. Bigger feature sets are harder to operate with, but some
features in this set might not put any weight on the decision of the algorithm and,
therefore, can be removed. For example, in our case, some API call might only be
triggered in one sample once. In a case of a wide and variate feature set, this unique
API call will not play any role in the algorithm and, therefore, removing it will not
affect an accuracy in any way.
5

After extracting the features and representing them as a combination matrix, we ended
up with 70518 features. This amount is too large for processing and accurate
predictions. For example, with such a large feature set, it takes approximately two-
three hours to load the dataset, preprocess it and run the k- nearest neighbors
algorithm on an x64 8GB RAM machine. This amount of resources is unacceptable,
and there is a need for removing irrelevant features.

Three general classes of feature selection methods are filtering methods, wrapper
methods, and embedded methods (Guyon and Elisseef 2006).

 Filter methods
Filter methods statistically score the features. The features with higher scores
are kept in the dataset, while the features with the low scores are removed.

 Wrapper methods
Here, the different feature combinations are tried with a prediction model and
the combination that leads to the highest accuracy are chosen.

 Embedded methods
These methods evaluate the features used while the model is being created.

4.5 Implementation

During this step the research plan is designed and can be implemented in
practice.

The whole implementation process can be outlined in the following steps:

1. Sandbox configuration
2. Feature extraction (using Python 2.7)
3. Feature selection (using R)
4. Application of the machine learning methods (using R)
5. Evaluation of the results
Each of these steps is discussed in detail further in this chapter.
5

4.5.1 Sandbox configuration

To get the malware behavioral reports and to ensure that malware runs correctly,
including all of its functionality, it is important to configure Cuckoo Sandbox. In the
real world different malware samples exploit different vulnerabilities that might be
part of certain software products. Therefore, it is important to include a broad range of
services in the virtual machines created by the sandbox.

The hypervisor used for the virtual machines for Cuckoo is Virtualbox. The virtual
machines will be created by using VMcloak, an automated virtual machine generation
and cloaking tool for Cuckoo Sandbox. (Bremer 2015).

All virtual machines will have the following specifications:

 1 CPU core 3.2 Ghz
 2 GB RAM
 Internet connection
The installed software on all the virtual machines are:
 Windows 7 Professional 64bit without any updates, including Service Pack
1
 Adobe PDF reader 9.0
 Adobe Flashplayer 11.7.700.169
 Visual Studio redistributable packages 2005 - 2013.
 Java JRE 7
 .NET framework 4.0

4.5.2 Feature extraction

As discussed in the previous section, the chosen feature representation method is the
combining matrix that includes successful APIs, failed APIs and their return codes.
This data is extracted from the reports generated by the sandbox.

The detailed process of feature extraction is outlined in Figure 21. In our

implementation, the reports are stored locally after they were processed by the
sandbox. Then, these reports are used as an input to the feature extraction script which
produces the .csv file with the combining matrix inside. The number
5

of minimum API calls can be specified in the algorithm, e.g. all reports which
triggered less than five API calls can be skipped. The file includes the timestamp of
the extraction, and the logs, outlining the successful and unsuccessful operations are
stored in a separate file.

Figure 21. Feature extraction process

4.5.3 Feature selection

As described in the previous chapter, feature selection is used for removing redundant
and irrelevant features to improve the accuracy of the prediction. In our case, the
feature set is extremely large, and the need for feature selection is, therefore, high.

The R language will be used for performing the feature selection and applying the
machine learning methods. R is a free software environment for statistical computing
and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows,
and MacOS. (Venables and Smith 2016).

A good and simple algorithm for feature selection in classification problems is the
Boruta package. Roughly speaking, it is a wrapper method that works around the
Random Forest algorithm. Its algorithm can be described as follows (Kursa and
Rudnicki 2010):

1. Create shuffled copies of all features (to add more randomness). These are
referred to as shadow copies.

2. Train a Random Forest classifier on the new dataset and apply a feature
importance measure in the form of the Mean Decrease Accuracy algorithm.
The importance of each feature is measured at this stage, and the weights are
assigned.

3. On each iteration check if the feature from the initial feature set has a higher
weight than the highest weight of this feature’s shadow copy. Remove the
features that are ranked as unimportant at each iteration.

4. Stop after classifying all features as ‘selected’ or ‘rejected’, or after a certain

number of iterations of random forest is achieved.

Unlike other feature selection methods, Boruta allows identifying all features that are
somehow relevant to the result. Other methods, in turn, rely on a small feature subset
that results in the minimal error. (Kursa and Rudnicki 2010).

The problem arises when we start implementing the feature selection. Having 70 518
features, the Boruta package exhausts, as it is not able to allocate enough memory and
is not able to run. Therefore, we need to divide the dataset
5

randomly into the subsets that can fit into the memory and run feature selection on all
of them. Then, we collect all the features that were ranked as relevant and merge the
subsets, leaving out all the non-important features. The next step is to run the feature
selection again on the whole dataset. After running the feature selection algorithm, we
ended up with 306 features. The performance of this change was evaluated based on
the KNN accuracy with the given feature set. KNN was chosen for this problem, as it
is the only algorithm that can process the whole feature set – it does not store any
other information other than the dataset and does not build models, unlike other
algorithms. After removing irrelevant features, the accuracy of detection based on
KNN improved by approximately 1% and the prediction took approximately three
seconds.

4.5.4 Application of machine learning methods

After the features were extracted and selected, we can apply the machine learning
methods to the data that we obtained. The machine learning methods to be applied, as
discussed previously, are K-Nearest Neighbours, Support Vector Machines, J48
Decision Tree, Naive Bayes, Random Forest. The general process is outlined in Figure
22.

The packages used for the implementation of algorithms are:

 K-Nearest Neighbours – class
 Support Vector Machines – kernlab
 J48 Decision Tree – RWeka
 Naive Bayes – e1071
 Random Forest – randomForest
 CrossTable plotting – gmodels
5

Figure 22. Machine learning classification scheme

5 RESULTS AND DISCUSSION

This chapter discusses the results of the assessment of the implemented machine
learning methods. The accuracy of detection is measured as the percentage of
correctly identified instances:

𝑐𝑜𝑢𝑛𝑡(𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒𝑠)

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = [15]
𝑐𝑜𝑢𝑛𝑡(𝑇𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒𝑠)

5.1 K-Nearest Neighbors

The result of the K-Nearest Method can be inferred from the cross table in Figure 23.
The results outlined there should be understood as follows: rows represent the actual
classes of the tested samples, while columns represent the predicted values. Therefore,
the cell of the 1st row and 1st column will show the number of correct instances for
the 1st class. The cell of the 1st row and 2nd column will show the number of 1st class
instances, that were marked as 2nd class, etc.
5

Figure 23. CrossTable for KNN multi-class classification

As it can be seen, the test set consists of 371 samples, and 1 sample had an error
resulting in a “0” class. The classification accuracy can be seen in Table 3.

Class Family Correctly Incorrectly Accuracy Average

classified classified Cuckoo score
1 Benign 49 12 80.3% 1.04
2 Dridex 31 6 83.8% 5.26
3 Locky 22 5 81.5% 6.41
4 TeslaCrypt 43 1 97.7% 6.27
5 Vawtrak 15 3 83.3% 2.66
6 Zeus 30 10 75% 6.46
7 DarkComet 47 2 95.9% 5.15
8 CyberGate 38 0 100% 6.57
9 Xtreme 31 3 91.2% 5.15
10 CTB-Locker 17 5 77.3% 4.76

Table 3. KNN multi-class accuracy

The total accuracy of the K-Nearest Neighbors depends on the k value. In our case,
different values were tested. They produced the following accuracy:

 k=1: 87%
 k=2: 84.63%
 k=3: 81.3%
 k=4: 80%
 k=5: 80%
 k=6: 80%
 k=10: 77.8%

As it can be seen, the best accuracy was achieved with k=1, and the accuracy was
87%. This is an unusual case – when the best accuracy is achieved with k=1 it can be a
sign of one of the following:

1. The test data is the same as the traininig data.

2. The test data is very similar to the training data.
3. Boundaries between different classes are very clear.

In our case, the train and test set were selected randomly from the dataset with 2/3
ratio. This means that the data cannot be the same. The most probable reason for this
is that the classes are distributed in a way that the boundaries are very clear when the
KNN algorithm was applied.

Figure 24. CrossTable for KNN binary classification

Two-class classification into malware and benign files was also performed. The
resulting cross-table can be seen in Figure 24. In the table, class 1 represents
5

the benign files, while class 2 represents malicious files. Again, predictions were made
with different k values:

 k=1: 94.6%
 k=2: 94.3%
 k=3: 93.5%
 k=5: 93.5%
 k=7: 92.7%

The best accuracy was achieved with k=1 - 94.6%. The detailed accuracy can be
found in the Tables 4.1 and 4.2.

Class Correctly identified Incorrectly identified Accuracy

instances instances
Benign 49 12 80.3%
Malicious 302 8 97.4%

Table 4.1. KNN binary accuracy

True positives True negatives False positives False negatives

302 49 12 8

Table 4.2. KNN binary accuracy

Overall, the KNN algorithm resulted in a good accuracy of 87% for multi-class
classification and 94.6% for two-class classification. We can conclude that the
algorithm provided good results. Classes are distributed evenly in the case of multi-
class classification, which also affected the good accuracy of the predictions. Even
though the distribution is not even in the case of two-class classification (310 vs. 61),
the results are still accurate.

5.2 Support Vector Machines

The next algorithm that was tested was Support Vector Machines. The result of the
predictions can be outlined in Figure 25. The overall accuracy achieved was 87.6% for
multi-class classification and 94.6% for binary classification.
6

Figure 25. SVM CrossTable

The detailed information about the accuracy of each class can be found in Table
5.
Class Family Correctly Incorrectly Accuracy Average Cuckoo
classified classified score
1 Benign 56 5 91.8% 1.04
2 Dridex 32 5 86.5% 5.26
3 Locky 21 6 77.8% 6.41
4 TeslaCrypt 37 7 84% 6.27
5 Vawtrak 10 8 55.6% 2.66
6 Zeus 31 9 77.5% 6.46
7 DarkComet 48 1 98% 5.15
8 CyberGate 37 1 97.4% 6.57
9 Xtreme 31 3 91.2% 5.15
10 CTB-Locker 22 0 100% 4.76

Table 5. SVM multiclass accuracy

Figure 26. SVM binary classification CrossTable

Figure 26 outlines the cross-table for binary classification. The detailed information
about binary classification can be found as well in Tables 6.1 and
6.2. As we can see, the number of correctly identified benign instances (true
negatives) was equal to 41, correctly identified malicious instances (true positives) –
310, incorrectly identified benign instances (false positives) – 20, incorrectly
identified malicious instances (false negatives) – 0.

Class Correctly identified Incorrectly identified Accuracy

instances instances
Benign 41 20 67.2%
Malicious 310 0 100%

Table 6.1. SVM binary classification accuracy

True positives True negatives False positives False negatives

310 41 20 0

Table 6.2. SVM binary classification accuracy

Overall, the resulted accuracies of 87.6% for multi-class classification and 94.6% for
binary classification are almost equal to the results of the K-Nearest Neighbors. In
turn, this algorithm resulted in 0 false negatives in binary classification – this means
that no malware samples were identified as benign. Therefore, it can prevent malware
infections more effectively than K-Nearest Neighbors.

5.3 J48 Decision Tree

The third tested algorithm was the J48 Decision Tree. The advantage of the Decision
Tree method is that it operates in the ”white box” approach and we can see which
decisions resulted in our prediction. The decision trees for multi- class classification
and binary classification can be found in Figures 27 and 28 respectively.
6

Figure 27. Multiclass Decision Tree

Figure 28. Binary Decision Tree

The overall accuracy was 93.3% for multiclass classification and 94.6% for binary
classification. The cross-table outlining the results of multiclass classification can be
found in Figure 29.

Figure 29. Decision Tree multi-class CrossTable

The detailed results of each malware family can be found in Table 7.

Class Family Correctly Incorrectly Accuracy Average

classified classified Cuckoo score
1 Benign 54 7 88.5% 1.04
2 Dridex 37 0 100% 5.26
3 Locky 24 3 88.9% 6.41
4 TeslaCrypt 44 0 100% 6.27
5 Vawtrak 16 2 88.9% 2.66
6 Zeus 33 7 82.5% 6.46
7 DarkComet 47 2 95.9% 5.15
8 CyberGate 38 0 100% 6.57
9 Xtreme 32 2 94.1% 5.15
10 CTB-Locker 21 1 95.5% 4.76

Table 7. Decision Tree multi-class accuracy

For the binary classification problem, the algorithm resulted in 46 correctly identified
instances for benign samples (true negatives), 305 correctly identified malware
samples (true positives), 15 incorrectly identified benign samples (false positives) and
5 incorrectly classified benign samples (false negatives). The details are introduced in
Figure 30 and Tables 8.1 and 8.2.

Figure 30. Decision Tree binary classification CrossTable

Class Correctly identified Incorrectly identified Accuracy

instances instances
Benign 46 15 75.4%
Malicious 305 5 98.4%

Table 8.1. Decision Tree binary classification accuracy

True positives True negatives False positives False negatives

305 46 15 5

Table 8.2. Decision Tree binary classification accuracy

The overall accuracy of J48 Decision Tree was good: 93.3% for multiclass
classification and 94.6% for binary classification. For multiclass classification, this
result is sufficiently better than the one obtained with the K-Nearest Neighbors and
Support Vector Machines. For binary classification, the result is the same, however.

5.4 Naive Bayes

The fourth algorithm that was tested was Naive Bayes. The resulted accuracy was
72.23% for multiclass classification and 55% for binary classification. The cross table
related to the Naive Bayes classification can be found in Figure 31.

Figure 31. Naive Bayes multi-class classification cross-table

The detailed results that outline the accuracy of each of the malware families can be
found in Table 9.

Class Family Correctly Incorrectly Accuracy Average

classified classified Cuckoo score
1 Benign 34 27 55.8% 1.04
2 Dridex 1 36 2.7% 5.26
3 Locky 25 2 92.6% 6.41
4 TeslaCrypt 33 11 75% 6.27
5 Vawtrak 8 10 44.4% 2.66
6 Zeus 28 12 70% 6.46
7 DarkComet 49 0 100% 5.15
8 CyberGate 37 1 97.4% 6.57
9 Xtreme 31 3 91.2% 5.15
10 CTB-Locker 22 0 100% 4.76

Table 9. Naive Bayes multi-class classification accuracy

For binary classification, the algorithm performed poorly. The number of correctly
identified benign instances (true negatives) was 61, correctly identified malware
instances (true positives) 143, incorrectly identified benign instances (false positives)
0, incorrectly identified malware instances (false negatives)
167. The detailed results can be found in Figure 32 and in Tables 10.1 and 10.2.

Figure 32. Naive Bayes binary classification cross-table

Class Correctly identified Incorrectly identified Accuracy

instances instances
Benign 61 0 100%
Malicious 143 167 46.1%

Table 10.1. Naive Bayes binary classification accuracy

True positives True negatives False positives False negatives

143 61 0 167

Table 10.2. Naive Bayes binary classification accuracy

Overall, the Naive Bayes algorithm performed poorly. The accuracy of multiclass
classification was 72.23% and of binary classification only 55%. This result is
insusceptible for real world detection. In addition to that, a number of false negatives,
in other words, malware files that were incorrectly marked as benign filed, reached
167 – 45% of the total number of files. In a real environment, such result would cause
a huge malware epidemics in a short amount of time.

Most likely, such a bad accuracy is the result of having a high dependability between
features. As we know, the main drawback of the Naive Bayes algorithm is that each
feature is treated independently, although in most cases this cannot be true. In our
case, most likely certain APIs are dependent on each other, i.e. 𝐴𝑃𝐼𝑛 cannot be
triggered without 𝐴𝑃𝐼𝑚. That is the most probable reason of a bad result of the Naive
Bayes algorithm.

5.5 Random Forest

The last algorithm that was implemented was the Random Forest algorithm. The
algorithm resulted in a good accuracy of predictions, 95.69% for multi-class
classification and 96.8% for binary classification. The cross-table related to the
multiclass predictions can be found in Figure 33.
7

Figure 33. Random Forest multiclass classification cross-table

The detailed information about the performance of each class can be found in Table
11.
Class Family Correctly Incorrectly Accuracy Average
classified classified Cuckoo score
1 Benign 58 3 95% 1.04
2 Dridex 35 2 94.6% 5.26
3 Locky 25 2 92.6% 6.41
4 TeslaCrypt 44 0 100% 6.27
5 Vawtrak 15 3 83.3% 2.66
6 Zeus 35 5 87.5% 6.46
7 DarkComet 49 0 100% 5.15
8 CyberGate 38 0 100% 6.57
9 Xtreme 34 0 100% 5.15
10 CTB-Locker 22 0 100% 4.76

Table 11. Random Forest multiclass classification accuracy

In the binary classification problem, the result achieved reached 96.8%. More
specifically, the number of correctly identified benign instances (true negatives)
reached 52, correctly identified malware instances (true positives) 307, incorrectly
identified benign instances (false positives) 9, and incorrectly identified malware
instances (false negatives) 3. The detailed information can be found in Figure 34 and
Tables 12.1 and 12.2.

Figure 34. Random Forest binary classification cross-table

Class Correctly identified Incorrectly identified Accuracy

instances instances
Benign 52 9 85.2%
Malicious 307 3 99%

Table 12.1. Random Forest binary classification accuracy

True positives True negatives False positives False negatives

307 52 9 3

Table 12.2. Random Forest binary classification accuracy

The Random Forest algorithm resulted in the highest accuracy among the other
algorithms. It achieved 95.69% and 96.8% accuracy for multiclass and binary
classifications respectively. However, some false negatives are still present – their
number is equal to three.
7

6 CONCLUSIONS

Overall, the goals defined for this study were achieved. The desired feature extraction
and representation methods were selected and the selected machine learning
algorithms were applied and evaluated.

The desired feature representation method was selected to be the combined matrix,
outlining the frequency of successful and failed API calls along with the return codes
for them. This was chosen, because it outlines the actual behavior of the file. Unlike
other methods, it combines information about different changes in the system,
including the changes in the registry, mutexes, files, etc.

In classification problems, different models gave different results. The lowest

accuracy was achieved by Naive Bayes (72.34% and 55%), followed by k- Nearest-
Neighbors and Support Vector Machines (87%, 94.6% and 87.6%, 94.6%
respectively). The highest accuracy was achieved with the J48 and Random Forest
models, and it was equal to 93.3% and 95.69% for multi-class classification and
94.6% and 96.8% for binary classification respectively.

The result achieved by Random Forest is more accurate than the one achieved by the
sandbox. It is hard to compare the results quantitively, since the sandbox does not
classify the samples into malicious or benign. The classification into malware family
is beyond its functionality as well. Instead, the maliciousness of the file is seen as a
regression problem, and the severity score is its output. However, the difference in the
accuracy can be easily seen. Table 2, outlined in Chapter 4.2.1, shows that none of the
malware families were labeled with the “red” severity level, and one was labeled as
“green”. This result is very inaccurate in comparison to the 95.69% and 96.8%
achieved by Random Forest.

Based on the results described before, it is recommended to implement the

classification based on the Random Forest method for multi-class classification, as it
resulted in the best accuracy and high performance. Although this method achieved
the highest result for binary classification as well, it is recommended to consider
implementing Support Vector Machines instead. This is because this method resulted
in 0 false-negatives, i.e. no malware samples were classified as benign. Although in
the binary problem accuracy is still the main
7

concern, the number of false-negatives is an important factor as well, since they can
result in massive infections. Random Forest, despite its high accuracy, resulted in 3
false negatives. Support Vector Machines, in turn, resulted in 0 false-negatives, while
the accuracy is lower by only 2%. That is why it is recommended to consider
implementing Support Vector Machines for binary classification.

Classifier KNN SVM Naive Bayes J48 Random Forest

Performance
Multi-class

Accuracy 87% 87.6% 72.34% 93.3% 95.69%

Accuracy 94.6% 94.6% 55% 94.6% 96.8%

False- 12 20 0 15 9
positives
False- 8 0 167 5 3
Binar

negatives
y

True- 302 310 143 305 307

positives
True- 49 41 61 46 52
negatives

Table 53. Results

6.1. Future Work

The study performed in this project was a proof-of-concept. Therefore, several future
improvements related to the practical implementation of this project can be identified:

 Implement feature extraction in the inline mode

Currently, the feature extraction is performed after the files were run in the
sandbox and the reports were generated. This approach will result in delays in
the file analysis when implemented. Instead, it is advised to
7

extract the features as they are processed by the sandbox, so that there
will be no need to go through the reports again.

 Use a wider dataset

Although the dataset that was used in this study is broad, covering most of the
malware types that are relevant to the modern world, it does not cover all
possible types. Collecting a malware dataset is a tedious task that requires a lot
of time and effort. For more accurate evaluation of the predictors, it is advised
to test the models on all the possible types of malware: spyware, adware,
rootkits, backdoor, banking malware, etc. In addition to that, it is important to
understand that the model will only be able to predict the samples of the
families that it has seen earlier. In other words, in a real-world application, the
maximum amount of possible families should be used before the launch of the
project for real-world environments.

 Use pre-selected APIs

In this work, the big overhead in the data processing was created by the need
of selecting the relevant API calls and removing the redundant ones. For
further implementation, only the APIs that were identified as relevant in this
study can be used. This will decrease the amount of time required for data
preprocessing, reduce the performance requirements of the machine on which
the analysis is being done and decrease the level of feature selection to be
made. However, it should be noted, that for more accurate description, the
relevant APIs should be extracted from the biggest possible dataset. Also, it is
advised to select the relevant APIs per malware family, as this will result in
another level of flexibility and accuracy.
7

Major Project Report
No ratings yet
Major Project Report
31 pages
NTLS Brouchure-3
100% (1)
NTLS Brouchure-3
1 page
CERTIFICATES
No ratings yet
CERTIFICATES
4 pages
Dayananda Sagar University: I I - Phase Major Project
No ratings yet
Dayananda Sagar University: I I - Phase Major Project
7 pages
Internship Documentation Format
No ratings yet
Internship Documentation Format
8 pages
2659 Emmela Venkata Pavan Kalyan Compressed
No ratings yet
2659 Emmela Venkata Pavan Kalyan Compressed
34 pages
Final - Project - Report First Page
No ratings yet
Final - Project - Report First Page
4 pages
Cybersecurity System
No ratings yet
Cybersecurity System
71 pages
Leela Certificates
No ratings yet
Leela Certificates
10 pages
SRPDT Project Report Template
No ratings yet
SRPDT Project Report Template
21 pages
Project Final
No ratings yet
Project Final
78 pages
Final Document
No ratings yet
Final Document
118 pages
Project Merged-1-66 Merged
No ratings yet
Project Merged-1-66 Merged
68 pages
Visvesvaraya Technological University: "Machine Learning Based Approach To Detect Phishing Attacks"
No ratings yet
Visvesvaraya Technological University: "Machine Learning Based Approach To Detect Phishing Attacks"
78 pages
Document For Final Project
No ratings yet
Document For Final Project
57 pages
Automated Malware Analysis Update
No ratings yet
Automated Malware Analysis Update
61 pages
Technical Seminar Report 565
No ratings yet
Technical Seminar Report 565
22 pages
Project Title: Bachelor of Technology IN Computer Science and Engineering
No ratings yet
Project Title: Bachelor of Technology IN Computer Science and Engineering
58 pages
FrontPagesformats For Project
No ratings yet
FrontPagesformats For Project
6 pages
Sradesh Vac
No ratings yet
Sradesh Vac
19 pages
Proposal Fina
No ratings yet
Proposal Fina
10 pages
Malware Analysis Using Machine Learning (Paper Presented)
No ratings yet
Malware Analysis Using Machine Learning (Paper Presented)
69 pages
Final Report 2 PDF
No ratings yet
Final Report 2 PDF
54 pages
Final Report 2
No ratings yet
Final Report 2
50 pages
Guidelines For Preparing The Project Report & Sample
No ratings yet
Guidelines For Preparing The Project Report & Sample
15 pages
Vinodhini Project
No ratings yet
Vinodhini Project
66 pages
Mini Report 18H522
No ratings yet
Mini Report 18H522
63 pages
Aaaaaaaaaaa
No ratings yet
Aaaaaaaaaaa
59 pages
REPORT EDIT 1 Final
No ratings yet
REPORT EDIT 1 Final
53 pages
Finall Python Project Is Done 1
No ratings yet
Finall Python Project Is Done 1
29 pages
Nettwork Intruder
No ratings yet
Nettwork Intruder
74 pages
MCA FRONT PAGE Muthu2 - 053019
No ratings yet
MCA FRONT PAGE Muthu2 - 053019
9 pages
Ids Fi 1
No ratings yet
Ids Fi 1
8 pages
R1 Final
No ratings yet
R1 Final
4 pages
Golden File 2ND Draft
No ratings yet
Golden File 2ND Draft
45 pages
Comparative Analysis of Intrusion - Zahedi Azam 012201068 - V3
No ratings yet
Comparative Analysis of Intrusion - Zahedi Azam 012201068 - V3
93 pages
Rpadml Documentation
No ratings yet
Rpadml Documentation
65 pages
Major Report
No ratings yet
Major Report
53 pages
A Malware Detection Method
No ratings yet
A Malware Detection Method
74 pages
24UG-AI006 PROJECT DOCUMENTATION FINALv
No ratings yet
24UG-AI006 PROJECT DOCUMENTATION FINALv
68 pages
Younus College of Engineering and Technology: Vadakkevila, Kollam-691 010
No ratings yet
Younus College of Engineering and Technology: Vadakkevila, Kollam-691 010
4 pages
Format For Project Report 2021-22
No ratings yet
Format For Project Report 2021-22
10 pages
Sample
No ratings yet
Sample
95 pages
Project Guidelines & Format (Final)
No ratings yet
Project Guidelines & Format (Final)
17 pages
777
No ratings yet
777
220 pages
Major Project 112
No ratings yet
Major Project 112
60 pages
Report 1 Crim
No ratings yet
Report 1 Crim
73 pages
Industrial Oriented Mini Project Doc Template
No ratings yet
Industrial Oriented Mini Project Doc Template
97 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
50 pages
Smart Survielance Dsce
No ratings yet
Smart Survielance Dsce
55 pages
Final Project Report PDF
No ratings yet
Final Project Report PDF
35 pages
Final ML Report
No ratings yet
Final ML Report
34 pages
Intro 2
No ratings yet
Intro 2
10 pages
Malware Detection in Android App Using Static and Dynamic Analysis
No ratings yet
Malware Detection in Android App Using Static and Dynamic Analysis
165 pages
Final Report2 8
No ratings yet
Final Report2 8
82 pages
B2 Salma Fayaz
No ratings yet
B2 Salma Fayaz
56 pages
Final Major Project File
No ratings yet
Final Major Project File
36 pages
Update
No ratings yet
Update
8 pages
A Project Report On Project Name: Academy of Technolog
No ratings yet
A Project Report On Project Name: Academy of Technolog
11 pages
Final Document
No ratings yet
Final Document
61 pages
Blog Smarter, Not Harder: SEO, Blogging, and AI Strategies to Skyrocket Your Traffic
From Everand
Blog Smarter, Not Harder: SEO, Blogging, and AI Strategies to Skyrocket Your Traffic
Jay Nans
No ratings yet
27 - Optimize The Storage Volume Using Data Mining Techniques
No ratings yet
27 - Optimize The Storage Volume Using Data Mining Techniques
71 pages
6 - Comparative Analysis of Liver Dieases by Using Machine Learning Techniques
No ratings yet
6 - Comparative Analysis of Liver Dieases by Using Machine Learning Techniques
67 pages
13 - Construct Food Safety Traceability System For People's Health Under The Internet of Things and Big Data
No ratings yet
13 - Construct Food Safety Traceability System For People's Health Under The Internet of Things and Big Data
89 pages
8 - Asymmetric Hash Code Learning For Remote Sensing Image Retreval
No ratings yet
8 - Asymmetric Hash Code Learning For Remote Sensing Image Retreval
63 pages
3 - Image Forgery Detection Based On Fussion of Light Weight Deep Learning Models
No ratings yet
3 - Image Forgery Detection Based On Fussion of Light Weight Deep Learning Models
78 pages
IEEE
No ratings yet
IEEE
6 pages
Cyber Security For Beginners PDF
No ratings yet
Cyber Security For Beginners PDF
28 pages
English Grade 6
No ratings yet
English Grade 6
192 pages
Humanities and Social Sciences (Humss) Grade 11 Grade 12: ST Century From The Philippines and The World
83% (6)
Humanities and Social Sciences (Humss) Grade 11 Grade 12: ST Century From The Philippines and The World
1 page
Teknik Dislokasi Mencit
No ratings yet
Teknik Dislokasi Mencit
7 pages
Dissertation Sukanya
No ratings yet
Dissertation Sukanya
53 pages
Lo and Rs 2025 Grade 12 Preparatory Exam Timetable Draft
No ratings yet
Lo and Rs 2025 Grade 12 Preparatory Exam Timetable Draft
1 page
SEA 2024 Media Release
No ratings yet
SEA 2024 Media Release
2 pages
KPMG
No ratings yet
KPMG
8 pages
Intent Letter Food Packs
No ratings yet
Intent Letter Food Packs
4 pages
Pengaruh Perawatan Perianal Hygiene Dengan Minyak Zaitun Terhadap Pencegahan Ruam Popok Pada Bayi
No ratings yet
Pengaruh Perawatan Perianal Hygiene Dengan Minyak Zaitun Terhadap Pencegahan Ruam Popok Pada Bayi
9 pages
Classes Time Table II IV VI Even 2023 2024
No ratings yet
Classes Time Table II IV VI Even 2023 2024
13 pages
Narrative Report in Elln
100% (7)
Narrative Report in Elln
2 pages
Diarrhea
No ratings yet
Diarrhea
35 pages
Interpreting SNT TC 1a - Part7
No ratings yet
Interpreting SNT TC 1a - Part7
2 pages
Summary of Promotional Vacancies 2024
No ratings yet
Summary of Promotional Vacancies 2024
2 pages
Chapt 1
No ratings yet
Chapt 1
38 pages
TAROT - The Royal Road - 6 SIX OF SWORDS VI
No ratings yet
TAROT - The Royal Road - 6 SIX OF SWORDS VI
12 pages
Master of Arts in Education Maed
No ratings yet
Master of Arts in Education Maed
18 pages
New Prof Ed Monkayo June 14 2019
100% (2)
New Prof Ed Monkayo June 14 2019
148 pages
Conceptual Metaphors in Mylo Xyloto Album by Coldplay: Selvia Neilil Kamaliah
No ratings yet
Conceptual Metaphors in Mylo Xyloto Album by Coldplay: Selvia Neilil Kamaliah
10 pages
1144 DIY Engineering PDF
100% (3)
1144 DIY Engineering PDF
353 pages
Venkatesh Resume
No ratings yet
Venkatesh Resume
2 pages
The Problem and Its Background: Thesis Title: Learning Virtues Through Literary Selections in English
No ratings yet
The Problem and Its Background: Thesis Title: Learning Virtues Through Literary Selections in English
12 pages
DLL Cpar
No ratings yet
DLL Cpar
3 pages
Nomination Form18
No ratings yet
Nomination Form18
6 pages
Municipal Social Welfare and Development Office Social Case Study
No ratings yet
Municipal Social Welfare and Development Office Social Case Study
2 pages
My Internship Overview1
No ratings yet
My Internship Overview1
15 pages
Physics Pratical
No ratings yet
Physics Pratical
12 pages
Completed Research L Senior High School Department
No ratings yet
Completed Research L Senior High School Department
80 pages
Variations of Love by Margaret Atwood
No ratings yet
Variations of Love by Margaret Atwood
3 pages