0% found this document useful (0 votes)
27 views17 pages

Abusitta 2021

Uploaded by

shravanichanikya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views17 pages

Abusitta 2021

Uploaded by

shravanichanikya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Journal of Information Security and Applications 59 (2021) 102828

Contents lists available at ScienceDirect

Journal of Information Security and Applications


journal homepage: www.elsevier.com/locate/jisa

Malware classification and composition analysis: A survey of recent


developments
Adel Abusitta ∗, Miles Q. Li, Benjamin C.M. Fung
McGill University, Montréal, Canada

ARTICLE INFO ABSTRACT

Keywords: Malware detection and classification are becoming more and more challenging, given the complexity of
Malware analysis malware design and the recent advancement of communication and computing infrastructure. The existing
Malware classification malware classification approaches enable reverse engineers to better understand their patterns and categoriza-
Security
tions, and to cope with their evolution. Moreover, new compositions analysis methods have been proposed to
Anti-analysis techniques
analyze malware samples with the goal of gaining deeper insight on their functionalities and behaviors. This, in
Composition analysis
turn, helps reverse engineers discern the intent of a malware sample and understand the attackers’ objectives.
This survey classifies and compares the main findings in malware classification and composition analyses.
We also discuss malware evasion techniques and feature extraction methods. Besides, we characterize each
reviewed paper on the basis of both algorithms and features used, and highlight its strengths and limitations.
We furthermore present issues, challenges, and future research directions related to malware analysis.

1. Introduction centers [6–8]. These attacks lead to severe damage and significant
financial loss [9–11].
In the recent years, many cyber-security mechanisms have been Most antivirus engines detect and classify malware by continuously
designed and developed to defend against evolving security threats. scanning files and comparing their signatures with known malware
Nevertheless, recent statistics [1] indicate that malware are still evolv- signatures. The malware signatures are typically created by human
ing and becoming more sophisticated than ever. As a result, they antivirus experts (known as malware defenders) who examine the
become harder to detect and understand their innerworkings. This
collected malware samples. These malware signatures can be filename,
mainly stems from two essential reasons. The first is that attackers
text strings, or regular expressions of byte code [12,13]. Obviously,
have now become more proficient in launching attacks and hiding their
signature-based methods can only detect traditional malware that do
malicious behavior using anti-analysis techniques such as obfuscation
and packing. The second reason is that the current communication not change significantly. However, malware can hide its malicious
and computing infrastructure is becoming more and more dynamic and behavior using anti-analysis techniques such as obfuscation, packing,
heterogeneous, which enables a single malware to take various forms polymorphism and metamorphism, in such a way that the code would
that are semantically but not structurally similar. This, in turn, makes look quite different from its original version. Thus, the primary short-
malware analysis even more challenging. coming of the signature-based method is that they entail high precision
Malware (or Malicious software) is a software that is designed to but low recall. Also, the process of creating malware signatures is labor-
harm users, organizations, and telecommunication and computer sys- intensive. Considering that there is a large number of new malware that
tem. More specifically, malware can block internet connection, corrupt appear every day, there is a pressing need to develop new intelligent
an operating system, steal a user’s password and other private informa- malware analysis methods to tackle the challenges.
tion, and/or encrypt important documents on a computer and demand To alleviate the burden of manual signature crafting, researchers
ransom. For the latest years, malware has been a growing threat to propose automatic signature generation methods [14,15]. The content
computer users and in 2017 the number of new malware increased
of the signatures can be Windows system call combinations [16],
by 22,9% over 2016 to reach 8,400,058 [2–5]. Moreover, malware
control flow graph [15], and functions [14].
has become the primary medium to launch large-scale attacks, such
Researchers also propose to use machine learning models to de-
as compromising computers, bringing down hosts and servers, sending
tect and classify malware [12,17–27]. Different from other machine
out spam emails, crippling critical infrastructures and penetrating data

∗ Corresponding author.
E-mail addresses: [email protected] (A. Abusitta), [email protected] (M.Q. Li), [email protected] (B.C.M. Fung).

https://doi.org/10.1016/j.jisa.2021.102828

Available online 26 April 2021


2214-2126/© 2021 Elsevier Ltd. All rights reserved.
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

learning-driven classification tasks, such as image classification, there is 1.3. Organization


a competition between malware creators and defenders. When malware
defenders propose a new malware analysis system using some features The rest of this paper is organized as follows. In Section 2, we
and machine learning models, malware creators often update their discuss the related survey papers. In Section 3 and Section 4, we
malware design to avoid being detected. Then malware defenders present the proposed taxonomy for organizing reviewed malware clas-
would propose new systems to detect and analyze the new generation sification and composition analysis approaches, respectively. Section 5
of malware and so forth. The race between malware defenders and characterizes reviewed papers according to the proposed taxonomy.
attackers may never come to an end. The challenges and current issues are pointed out in Section 6. Section 7
Recently, many researchers have started to use deep learning mod- suggests possible research topics in malware analysis. Finally, Section 8
els to enhance the detection and classification accuracy of malware concludes the paper.
classification [24–27]. Although promising results have been achieved
through the ability to extract robust and useful features using the 2. Related surveys
state-of-the-art deep learning architectures, the proposed models were
shown to be highly vulnerable to adversarial examples, which can Other works have already surveyed contributions in malware classi-
be easily designed (simply by perpetuating parts of the inputs) by fication. For example, Bazrafshan et al. [33] classify malware detection
attackers to fool Artificial Intelligence (AI)-driven malware analysis and classify methods into three types: signature-based, behavior-based
systems and make them generate erroneous decisions [24–29]. As a re- and heuristic-based methods. Also, they recognize five classes of fea-
sult, several methods have been proposed to defend against adversarial tures based on the proposed heuristic-based method: opcodes, API
examples [28,29]. calls, control flow graphs, n-grams, and hybrid features. Another work
In addition to malware classification, researchers in malware analy- presented by Shabtai et al. [34], which studies how to detect malware
sis have improved new techniques and methods to analyze the com- using static features. In this paper, we study more features (static and
position of malware samples by matching their functionalities and dynamic features) used for malware classification.
behaviors to multiple known malware families. This, in turn, helps re- Ucci et al. [30] survey the literature on machine learning ap-
verse engineers discern the intent of a malware sample and the attacker. proaches for malware detection and analysis. They classify the surveyed
Moreover, these composition methods enable the reverse engineers and articles into three categories: objectives (expected output), features,
organizations to effectively triage their resources. and algorithm used. They also highlight a set of problems and chal-
lenges and identify the new research directions. Similarly, the survey
presented by [31] presents a comparative analysis on intelligence-
1.1. The scope
based malware classification. In particular, they report cons, pros and
problems associated with each machine learning-based malware clas-
This literature review classifies and compares the recent and main
sification technique. Souri and Hosseini [32] also provide a taxonomy
findings in malware classification. Unlike other similar works which
of AI-driven malware detection techniques. Our paper looks at a larger
only focus either on AI-driven malware classification [30–32] or on
range of articles by including many works on malware classification
non-AI-driven malware classification [33,34], this paper includes both
and composition analysis. We also include other works related to non-
AI-driven and non-AI-driven recent works. We are also surveying meth-
AI-driven classification techniques. Furthermore, We also include new
ods and approaches that recently have been proposed to analyze the challenges related to AI-driven malware classification techniques.
composition of malware samples, in order to understand their function- Also, Basu et al. [35] study different works relying on AI-powered
alities and behaviors. To the best of our knowledge, this is the first work malware classification techniques. In particular, they coin five types of
that survey the existing composition analysis techniques. This survey features: a PI call graph, byte sequence, PE header and sections, assem-
also aims at identifying the main issues and challenges related to recent bly code frequency and system calls. Also, Ye et al. [36] study many
malware classification and composition analysis techniques. In partic- different aspects of malware classification processes. More specifically,
ular, our analysis leads to recognize three major problems to address. they spot the light on a number of issues such as incremental learning,
The first is the need to overcome modern evading techniques (or anti- and adversarial learning. Recently, Ori et al. [37] survey the literature
analysis techniques) such as metamorphism. The second relates to the on techniques used for dynamic malware analysis, which includes a
efficiency and scalability of malware search engines as the number of description of each technique. In particular, they present an overview
functions in the repository might need to scale up to millions. The third of machine-learning methods used to improve the capability of dynamic
concerns the vulnerability of malware classification system to evolving malware analysis. Compared to the above-motioned works, this paper
adversarial examples. We also uncover possible topics that need further determines the main issues and challenges on malware classification
study and investigation, such as sustainable malware analysis system. and composition analysis. Also, we identify a number of trends on the
In this regard, we propose a few guidelines to prepare efficient and topic, with guidelines on how to improve solutions to address new and
trustworthy malware detection and analysis system. continuing challenges.
In addition, Barriga and Yoo [38] survey the literature on malware
1.2. Contribution evasion techniques and their impact on malware analysis techniques.
This paper extends beyond that and includes recent AI-driven works
The main contributions of this survey are: used to overcome malware evasion techniques.

• Proposing a new taxonomy for describing and comparing the re- 3. Taxonomy of malware classification
cent and main findings in malware classification and composition
analysis. We present in this section the taxonomy of malware classification.
• Designing a new framework for analyzing the existing malware We define two categories (or dimensions) to organize the existing
classification and composition analysis techniques. works. The first category presents the features that our work is based
• Identifying and presenting open issues and challenges related to on. In particular, we discuss the different methodologies used for
malware analysis. extracting features, e.g., dynamic and/or static techniques, and what
• Identifying a number of trends on the topic, with guidelines on types of features are used, e.g., assembly code. The second is concerned
how to improve existing solutions to address new and continuing with the type of algorithm that is adopted for the detection and
challenges. analysis, e.g., artificial inelegance-driven algorithm.

2
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

Fig. 1 shows the proposed taxonomy. The rest of this section is orga- Hybrid method. This method is used to achieve higher detection rate by
nized as follows (according to the proposed taxonomy). Section 3.1 de- merging some of the static feature extraction characteristics with some
scribes malware analysis features, while Section 3.2 discusses existing of the dynamic feature extraction characteristics [39].
algorithms. Our survey has revealed that most of the surveyed papers were
based on the dynamic feature extraction approach [21,24,46–63].
3.1. Malware analysis features while the others adopt, in equal proportions, either the static approach
alone [64–83] or a hybrid approach [22,23,41,47,84–86].
This subsection presents the features of samples that are used for the
analysis. In Section 3.1.1, we show how features are extracted, while
3.1.2. Type of features
in Section 3.1.2, we show type of features that are taken into account.
In this section, we classify the features that are used by mal-
3.1.1. Feature extraction methods ware analysts and explain how each type is practically extracted and
In this section, we review the following three feature extraction represented.
methods: static, dynamic and hybrid methods. Printable strings. A printable string is a sequence of ASCII characters
Static method. Static feature extraction is a method to extract features terminated with a null character. Schultz et al. [12] find that malware
from the content of the executables without running them [39]. The have some similar strings that distinguish it from and that Goodware
static features can be extracted using the file format, e.g., Portable Ex- also has some common strings that distinguish them from malware.
ecutable (PE) and Common Object File Format (COFF) [12,18,22,25]. Printable strings are represented as binary features, where ‘‘1’’ repre-
The static features can also be extracted without any knowledge of the sents a string that is present in an executable and ‘‘0’’ represents that
format. Features extracted this way can be byte sequences, file size, it is absent from all systems [12,22,24,26].
byte entropy, etc. [12,17,20,25]. The advantage of the static feature Schultz et al. [12] extract printable strings from the headers of PE
extraction method is that it covers the complete binary content. But the files. The extraction is straight-forward since the header is in plain text
problem is that static features are prone to packing and polymorphism format.
since most of the features that are statically extracted come from Dahl et al. [24] and Huang and Stokes [26] extract null-terminated
encrypted contents rather than the original program body [40]. objects dumped from images of a file in memory [24,26] as printable
strings. The coverage of their methods is better than just extract print-
Dynamic method. Dynamic feature extraction consists of running the
executable usually in an insulated environment which can be a virtual able strings from header [12] but their could be some false positive
machine (VM) or an emulator and then extract features from the results.
memory image of the executable or from its behaviors [39]. Since Islam et al. [22] use the strings utility in IDA Pro5 to extract
malware equipped with packing and polymorphism has to exhibit the printable strings from the whole file.
real malicious code to achieve their goals, dynamic feature extraction Different from other works, Saxe and Berlin [25] do not take
is more resistant to those malware techniques compared with static printable strings as binary features but use their hash values and the
feature extraction method [40]. logarithm of the string lengths to create a histogram and use the counts
Anderson et al. [21,41] use Xen1 and Royal et al. [42], Dai et al. of printable strings in each bin of the histogram as features. They take
[19], and Islam et al. [22] use VMWare2 to create their VMs and all the byte sequences of length six or more that are in the ASCII code
perform dynamic analysis. Kolosnjaji et al. [27] use Cuckoo sandbox3 range as printable strings which is also slightly different from other
which is an open source automated malware analysis system to extract works.
API calls. Other researchers who work for an anti-virus engine use Essentially, the functionality of most malware does not rely on
the VMs as parts of their anti-virus engines to dynamically extract printable strings. Thus, when malware creators find that some strings
features [24,26]. accidentally are used by malware detectors, they can eliminate them
In fact, there are two categories of an emulator: a full-system or even if the printable strings are necessary, they can break them
emulator and application level emulator. A full-system emulator is into characters that are distributed in different positions. Therefore,
a computer program that emulates every component of a computer, printable strings are not reliable features.
including its memory, processor, graphics card, hard disk, etc., with the
purpose of running an unmodified operating system. Qemu4 is a full- Byte sequences (byte code). Executable files consist of byte sequences
system emulator used by several systems [23,40,43]. Considering the (also known as byte code). A byte sequence may belong to the meta-
time-consuming of full-system emulator, Cesare and Xiang [15] propose data, code, or data of an executable file. As has been stated, byte
to use application level emulation to unpack malware more efficiently sequences are important signatures of malware since malware may
so that only the parts which are necessary to execute the file including share some common sequences that are exactly the same or follow the
instruction set, API, virtual memory, thread and process management, same regular expression. Thus, byte sequences are also appropriate to
and OS specific structures are implemented. be features for malware analysis systems [12,17,25,41].
One problem of dynamic feature extraction methods is that it does Schultz et al. [12] use bigram byte sequences in the form of binary
not reveal all the possible execution paths [40]. Malware may have features and they claim byte sequence feature is the most informative
detection routines to check whether it is executed in a virtual machine feature because it represents the machine code in an executable. In fact,
or emulator. When malware finds itself executing in such an environ- this is not entirely true since some byte sequences come from metadata
ment, it will halt its execution so dynamic models will fail to recognize or data section. Even if a byte sequence is from code section, since
it as malware. The methods to detect whether an executable is executed instructions have variable length in some architectures, byte sequences
inside a VM can be found from several papers [44,45]. Another problem may not match machine code. And their byte sequence feature has
of dynamic methods lies in its execution time which takes much more the problem of dimension explosion since there are too many different
than static feature extraction [40]. bigram byte sequences and it is too large to fit into memory so they
could only split the byte sequence set into several sets and feed them
1
https://www.xenproject.org/.
to multiple native bayes models.
2
https://www.vmware.com/.
3
https://cuckoosandbox.org/.
4 5
https://www.qemu.org/. https://www.hex-rays.com/products/ida/.

3
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

Fig. 1. The proposed taxonomy.

To solve the dimension explosion problem, Kolter and Maloof [17] Assembly code. Machine code and assembly code can be translated to
use information gain to select the top 500 informative 4-gram byte one another through assembly and disassembly. Assembly code has
sequences as binary features from 255 million distinct 4-grams. some advantages over machine code as a feature for malware analysis.
Different from the above two works, Anderson et al. [41] do not use First, assembly code can be understood by a programmer and therefore
byte sequences per se as features but fit byte sequences into a Markov as a kind of feature, assembly code is more convenient to be prepro-
Model so essentially the feature they use is transition probability from cessed (e.g., grouped into categories according to the function, filtered,
one byte to another. truncated etc.) to appear as a more informative feature. In addition,
Chen et al. [25] use the byte entropy of each 1024 byte window and malicious code is often encrypted by packing or polymorphism so
the occurrence of each byte to form a histogram and evenly separate it is impossible to get it from the original byte sequence, however,
each axis into 16 bins to form a 256 length feature vector. dynamically extracted assembly code has been decrypted so it includes
Nataraj et al. [20] convert the whole byte sequence of a file into a the malicious code.
picture in which each byte represents the gray scale of a pixel. They find Moskovitch et al. [18] propose that assembly code can be more
that the malware that belongs to the same family appear very similar robust than machine code for the analysis of malware since the same
in layout and image. The width of the image that is used to transform malicious engine may locate in different locations of a file, and thus
the 1D byte sequence into a 2D matrix is determined by the size of the may be linked to different addresses in RAM or even perturbed slightly
file. The image feature of the malware image is computed using the so by dropping the oprands and just using opcode the robustness is
algorithm proposed by Oliva and Torralbat [87]. The main advantage improved. They extract assembly code by dissembling the executables
of image-based techniques is that they are robust against many types with IDA Pro. They try both term frequency (TF) and term frequency–
of obfuscations [88]. inverse document frequency (TF-IDF) of each opcode n-gram (n=1,2,
Byte sequences are not reliable in most cases. This is due to the fact . . . ,6) as features and use document frequency (DF), information gain
that obfuscation techniques such as instruction substitution and register ratio, or Fisher score to select features. Their best result is achieved
reassignment can change the opcodes and oprands respectively, which using TF values of opcode bigram as features filtered by Fisher score.
means that the machine code is changed. In all these works, the byte One disadvantage of their method is that it is still prone to dead
code is statically extracted but the main program body encrypted with code insertion, operation transpositions, packing, and polymorphism.
different algorithms or keys through Packing and Polymorphism will Another one is dropping operands causes loss of information which may
change the byte sequences. subsequently lead to loss of precision.

4
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

To counter packing and polymorphism, Dai et al. [19] run malware Anderson et al. [41] and Huang and Stokes [26] group the system
in a VM and record the sequence of the running byte code which will calls into high-level categories where each category represents func-
be disassembled to assembly code. They use three kinds of two-opcode tionally similar groups of system calls, such as painting to the screen
combinations: unordered opcodes in a block, ordered but not necessar- or writing to files. Anderson et al. [41] then feed the trace of groups of
ily consecutive opcodes in a block, consecutive opcodes in a block. This system calls to a Markov chain so that they use transition probability
way their features is more resistant to dead code insertion and reorder of system calls to be the feature. Huang and Stokes [26] use those
of operations. They use the association between the frequency of a high-level API call events as binary features.
feature in training dataset and a class as criterion and apply a variant of Islam et al. [22] and Dahl et al. [24] extract Windows API function
Apriori [89] to select top 𝐿 features. Although unordered opcodes and calls and their parameters by running an executable in a VM. Islam
ordered (but not necessarily consecutive opcodes) in a block improve et al. [22] treat Windows API functions and parameters as separate
the resistance to dead code insertion and reorder of operations, those entities and use the occurrence frequency of each entity as their feature.
features are too flexible so they also bring more false positive situations. Dahl et al. [24] use combination of a single system API call, one input
Royal et al. [42] is another work aiming to detect code that is parameter, and API tri-grams which consist of three consecutive API
hidden and can only be seen dynamically. The way they do it is to function calls, as binary features which are subsequently selected using
store the static code of an executable and check whether each operation mutual information.
Kolosnjaji et al. [27] use the dynamic malware analysis system
executed is within the stored static code area. If it is not, it is a
Cuckoo sandbox to extract the sequence of the Windows system calls
part of hidden-code. They claim that the main malware engine should
invoked by an executable. They use one-hot representation of them and
be in the hidden-code if both of them exist and experiment results
feed the full sequence of system calls with the order to a sequential deep
also illustrate the hidden-code enhances the accuracy of ClamAV6 and
learning model.
McAfee Antivirus.7
Similar to assembly code, Windows API call sequences can also
Anderson et al. [21,41] use the transition probability from one
be obfuscated. For instance, malware authors can make an executable
opcode to another as features, which is similar to how they use byte
invoke some irrelevant API calls and submerge the API calls they use
sequence feature. In their paper [21], they just extract assembly code
to fulfill their purpose in them. Thus, this feature is not reliable in most
by recording the execution of an executable in a VM which is similar cases.
to the way Royal et al. [42] use. In their second paper [21], they also
use IDA Pro to disassemble the executable, and the assembly code from Control flow graphs. A control flow graph is a directed graph that
the two sources are used as two independent feature sets. In addition, represents the flow of the program, where nodes are the instructions
they also group instructions into categories in several granularities while the edge between two nodes represents the order of sequence
according to the functions of the instructions to reduce the impact of of execution of the two instructions. A vertex in the graph is a basic
instruction substitution in their second paper [21]. In their preliminary block in the middle of which there is no jump or branch instructions.
experiment, they also find if they use instructions with oprands, the A directed edge represents jumps in the control flow. Control flow
performance will be worse [21]. graphs are used as features or signatures to detect malware in several
Santos et al. [23] disassemble executables to acquire their assemble papers [15,41].
code and then use weighted opcode n-gram frequencies as one of their Cesare and Xiang [15] state that similar malware usually have sim-
features. The weight is the product of the information gain of all ilar high-level structured control flows. They find that compressed and
opcodes in the n-gram times the normalized TF of the n-gram. encrypted data have relatively high entropy so they first use entropy
of byte sequence to detect whether an executable is packed or not.
API/DLL system call. DLL files and functions of DLL files used by an If so, they use an application level emulator to extract hidden code.
executable expose the system services they use. Native system calls and They still use entropy of byte sequence to detect completion of hidden
Windows API calls an executable invokes are shown by the functions code extraction. Then the memory image of the binary is disassembled
of DLL files it depends on. Therefore, what behaviors it may intend to using speculative disassembly [90]. Finally, they use the process of
do or what it would be able to do can be inferred. structuring to recover high-level structured control flows from control
Schultz et al. [12] extract the DLL files by an executable used, the flow graphs of procedures and represent them using strings of character
functions in DLL files, and the number of function of each DLL as tokens. The strings representing control flow graphs are all saved as
features from metadata in order to understand how resources affected signatures. An example of the relation between a control flow graph
an executable’s behavior and how heavily each DLL is used. The first and the signature string is shown in Fig. 2.
two are used as binary features and the third is a real-valued feature. Anderson et al. [41] also find that it is largely not easy for a
Bayer et al. [40] and Santos et al. [23] extract calls to Win- polymorphic virus to build a semantically similar version of itself while
dows API functions dynamically using an emulator. Then, they use changing its control flow graph enough to avoid detection. Therefore,
they use control flow graphs as features. More specifically, they use the
those API functions to acquire actions of an executable during exe-
occurrence frequency of each k-graphlet (a subgraph of k nodes) in the
cution including I/O activity, registry modification activity, process
control flow graph to represent control flow graph.
creation/termination activity, network connection activity of an ex-
To counter the detection using control flow graphs, malware authors
ecutable, self-protection behavior, system information stealing, er-
can use control flow flattening and bogus control flow obfuscation
rors caused by the execution, and interactions with Windows Service
techniques to change the control flow without affecting the function-
Manager.
ality so that the effectiveness of control flow graph feature will be
Fredrikson et al. [43] also use an emulator to monitor system calls.
harmed [91,92].
Then, they use the relations between system calls and their parameters
to form a dependency graph in which nodes are system calls and edges Function. Some papers (e.g., Islam et al. [22] and Chen et al. [14]) use
connect system calls sharing some parameter. They define a behavior function level features for malware classification.
to be a subgraph of it and behaviors that can be adopted to distinguish In particular, Islam et al. [22] find function length that consists of
malware from Goodware will be mined and used to detect malware. statistically useful information in distinguishing between families of
malware. After obtaining the assembly code of each executable, they
calculate the length of them by measuring the number of bytes of code
6
http://www.clamav.net/. and use the occurrence frequency of each function lengths as a feature.
7
https://www.mcafee.com/en-us/index.html. However, obviously, function length is the least robust feature against

5
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

Fig. 2. The relationship between a control flow graph, a high level structured graph, and a signature.

obfuscation. Function length can be arbitrarily increased by inserting More specifically, the antivirus engines detect or classify malware by
dead code or decreased by splitting them into multiple functions. checking whether the files to be analyzed contain malware signatures.
One should note that two functions which are semantically similar The signatures of malware can take many formate including filename,
to each other are considered to be clones of each other. To this end, text strings, or regular expressions of byte code [12,13]. Signatures are
Chen et al. [14] assume that some files that belong to the same malware usually also hashing of the entire file. One should note that signature-
family share some functions which are connected using clone relation. based techniques can only detect malware originates from known
So they cluster functions to groups in which any two functions can malware which does not change significantly. As a result, attackers can
be connected directly or indirectly using clone relation and pick one exploit these techniques by hiding the malicious behavior of malware
function from each group as an exemplar to be a signature. They use using anti-analysis techniques such as packing, obfuscation, polymor-
NiCad [93] to detect whether two functions are clone to each other. phism, and metamorphism (Section 6 provides more details about these
However, to use one function to represent a group of functions is techniques). Therefore, the code looks quite different from its original
problematic. Since the same function evolves over generations, the version. The main shortcoming of signature-based method is it has high
newest version may look quite different from the original one. If the precision but low recall and the other one is labor-intensive.
older version is picked as the exemplar, the clone detector may fail Some works [14–16] address the problem of manual signature
to identify some unknown new generation of it. Although their system crafting by proposing automatic signature generation techniques. The
works on Android APK files, the methodology can be directly applied content of the signatures can be windows system call combinations,
to classifying executable malware. control flow graph, and functions.
Miscellaneous file information. Some miscellaneous file properties can
help engineers distinguish malware from Goodware since the average 3.2.2. Artificial intelligence-based approaches
or majority values of them are significantly different between the two The section discusses artificial intelligence-based malware classi-
groups. So that those properties are also used as features. They are file fication approaches. These approaches can be categorized as tradi-
size [40,41], exit code [40], time consumption [40], entropy [41,94], tional machine learning models, deep learning models, association
packed or not [41], number of static/dynamic instructions [41], and mining, graph mining and concept analysis, and signature creation and
number of vertices/edges in control flow graph [41]. These features search methods. The existing artificial intelligence-based approaches
may be helpful but obviously not informative enough. also can be classified according to the learning method used as follows:
supervised, unsupervised or semi-supervised.
Conclusive remarks. The effectiveness of using all the aforementioned In a supervised malware classification model [21–25,46,50,54,55,
features can be somehow diminished or they are not informative 57–65,67,69,71,72,74,76,80–82,85,95–99], the classification algorithm
enough. So many papers use multiple features [12,22–26,41]. The learns on a labeled dataset, which enable the algorithm to evaluate its
intuition is that any single feature source can be obfuscated to evade accuracy on training data. In contrast, an unsupervised malware classi-
the detection but it is extremely difficult to obfuscate all features fication model [47,49,53,62,69,75,83,84,100–102], provides unlabeled
simultaneously without hindering the functionality [22,41]. data that the algorithm tries to make sense of by extracting patterns
without guidance. Semi-supervised malware classification models [68,
3.2. Malware classification algorithms 75,78,103] combine both labeled and unlabeled data.

The extracted features introduced in the previous section are fed Traditional machine learning models. The most popular traditional ma-
into malware detection/classification systems. They can be catego- chine learning models used by surveyed papers are Naive Bayes clas-
rized as signature-based approaches and artificial intelligence-based sifier (NBC) [50,58,60,63–65,81], rule-based classifier [46,59,64,81,
approaches. 95,96], decision tree (DT) [22,23,50,55,58,60,62,65,72,74,80,82,96],
K-nearest neighbors (K-NN) [22,50,60,62,71,72,96,97], Bayesian Net-
3.2.1. Signature-based approaches work [23,72,85], Neural Network (NN) [24,25], Random Forest (RF)
Signature-based detection is the most papular approach used in [22,54,58,60,63,67,76,80,98,99], Hidden Markov Models (HMM) [9,
most antivirus engines. Those signatures are created by human malware 104–106] and Support Vector Machine (SVM) [21–23,50,54,57,58,60–
defenders through examining the collected malware samples [12,13]. 63,65,69,71,72,76,81,96]. Those papers which use traditional machine

6
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

learning models normally try multiple machine learning models [12, a normal fully connected layer is the weight of the projection matrix
17–19,22,23]. is not updated. The entries of it are sampled following an independent
Below, we briefly introduce the above mentioned machine learning and identically distribution over -1,0,1. On top of that, they apply 1 to
models. 3 fully connected layers with sigmoid activation functions and a 136-
Naive Bayes Classifier (NBC) An NBC [107] uses Bayes’ theorem way softmax layer as output. They also try using a Gaussian–Bernoulli
to determine the conditional probability of a sample belonging to a restricted Boltzmann machine (RBM) to pre-train the hidden layers.
class given the input features which can be formally described in the The best result is achieved by the model with 1-hidden layer without
following equation: pre-training which is 9.53% test error rate. They also find the random
𝑃 (𝑥|𝐶𝑖 ) projection performs better than Principal Component Analysis (PCA).
𝑃 (𝐶𝑖 |𝑥) = 𝑃 (𝐶𝑖 ) (1) Saxe and Berlin [25] propose a deep feed-forward neural network
𝑃 (𝑥)
consisting of four fully connected layers, where the dimensions of the
where 𝑥 is a sample and 𝐶𝑖 is the probability the sample belongs to class first three layers are 1024 followed by a dense layer to get the output.
𝑖. It is based on the Naive Bayes conditional independence assumption They apply dropout to the first three layers. The activation functions
that all the features are independent to each other given the class it of the first two layers are parametric rectified linear units (PReLU) to
belongs to: yield improved convergence rate without loss of performance and the
𝑃 ((𝑥1 , 𝑥2 , … , 𝑥𝑛 )|𝐶𝑗 ) = 𝑃 (𝑥1 |𝐶𝑗 )𝑃 (𝑥2 |𝐶𝑗 )...𝑃 (𝑥𝑛 |𝐶𝑗 ) (2) activation function of the third layer is sigmoid. They also use Bayesian
Calibration to calculate the unbiased probability that an executable is
where 𝑥𝑗 is a feature of 𝑥. Although the assumption do not hold, malware. They achieve a detection rate of 95% and a false positive rate
the prediction results are good in many occasions and the result is of 0.1% on a dataset of 431,926 samples.
explainable which means how much each feature contributes is visible. Huang and Stokes [26] propose a neural network for multi-task
Decision Tree (DT) A DT classifier [108] uses a tree structure to training. One task is a malware detection to predict whether an un-
represent the classification process. Internal nodes of a DT are tested known software is malicious or benign and the other is to predict
on the values of features and edges correspond to a choice on values if it belongs to one of 98 important malware families. Huang and
of a variable. Leaf nodes represent the final class of samples fall into Stokes [26] also use a random projection layer to reduce the dimension
it. The tree structure is constructed based on the informativeness of to 4,000 from 50,000 and then they normalize each of the 4,000
each feature conditioned on the current choices such as information dimension to be zero mean and unit variance. Then they use 4 hidden
gain ratio and Gini index. A DT is also an interpretable classifier and a layers with dropout and RELU activation. On top of it is two single
DT can be translated sets of if-else-then rules. layers for each of the two classification task. The final loss function is
K-Nearest Neighbor (KNN) A KNN [109] is an instance-based a weighted sum of each of the individual loss functions. Experiment
classifier. The model finds the K nearest neighbors of a given sample results show that multi-task learning only improve the performance of
with some distance metrics (e.g., Euclidian, cosine), and predict it to be malware detection and harm the performance of malware classification
the (weighted) majority vote of the classes of the k nearest neighbors. in most experiment settings. Specifically, the best result for malware
Support Vector Machine (SVM) An SVM [110] is a binary clas- detection is 0.3577% test error which uses two hidden layers and
sifier which calculates a hyperplane that separates samples from two multi-task learning and the best result for malware classification is
classes with the largest margin. An important characteristic of an SVM 2.935% test error which uses one hidden layer and either single task
is it can utilize kernel trick to map samples from the original feature or multi-task learning.
space to a high-dimensional (even infinite) feature space to perform Kolosnjaji et al. [27] propose a combination of convolutional neural
non-linear classification. network (CNN) and Long Short-Term Memory (LSTM) networks to
Bayesian Network (BN) A BN [111] is a probabilistic graphical predict the family of an executable using the dynamically extracted
model which represents variables as vertices and the dependencies as system call sequence. They first use two convolution layers to capture
directed edges. The graph is used for the inference of probability of any the correlation between consecutive API calls and then apply max-
variable. pooling to reduce the dimensionality. The output sequence is fed to a
Rule-based classifier A rule-based classification [112] refers to LSTM layer to model the sequential dependencies of API calls. Then a
any classification method that allows us to use of IF-THEN rules for mean-pooling layer is used to extract important features from the LSTM
prediction. An example of a rule-based classification is RIPPER [113], output. They also use Dropout to prevent overfitting and a softmax
which is used to build a set of rules to classify samples while minimizing layer to output the probability of each class. Their proposed deep
the error of the number of misclassified training samples. learning model significantly outperforms feed-forward neural networks,
Neural Network (NN) An NN [114] is a biologically-inspired CNN, SVM, and Hidden Markov Model and achieves 85.6% on precision
programming paradigm that allows a computer to learn from obser- and 89.4% on recall. The advantage of their model is it can fully utilize
vational data. It consists of a network of functions (i.e., parameters) the order of system calls which may also be a drawback if the system
which enables the computer to learn, and to fine tune itself, through call sequence is obfuscated. One problem of their model is they use
analyzing new data. mean-pooling rather than max-pooling to extract features of highest
Random Forest (RF) An RF classifier [115] constructs a set of importance produced by LSTM is not quite reasonable.
DTs from the subset of training set (selected randomly). The votes are
then aggregated from trees in order to decide the final class of the test Associative classifier. An associative classifier relies on association rules
sample. that can be used to distinguish samples between two classes to perform
classification. It is a special case of association rule mining where
Deep learning models. Deep learning models allow us to automatically only the class of a sample can be the consequent (a.k.a. right-hand-
abstract and extract robust and useful features for efficient and reliable side) of a rule. Ye et al. [16] proposes to use hierarchical associative
malware classification. This can be done using multiple layers of ab- classifiers (HAC) to classify executables based on API calls. There are
straction to learn the ‘‘good’’ representation of the data [116]. An exam- three techniques regarding the creation of an associative classifier:
ple of deep learning models are autoencoder [117], stacked denosing (1) adopt FP-Growth algorithm to find candidate association rules
autoencoder [116], restricted Boltzmann Machine (RBM) [118]. (i.e., combination of API calls) (2) prune the candidate rules based on
Dahl et al. [24] applies their 179,000 binary features to a deep 𝜒 2 , data coverage, pessimistic error estimation, significance w.r.t to
learning model. The first layer is a random projection layer which its ancestors (3) reorder rules: first rank the rules whose confidences
maps the input features to a much lower dimensional space (4000 are 100 by confidence support size of antecedent (CSA) and then re-
dimension). The difference between the random projection layer and order the remaining rules by 𝜒 2 measure. Using those three techniques,

7
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

they create a 2-level associative classifier to detect malware from a


gray list labeled by a signature-based anti-virus engine. The first-level
associative classifier is aimed for higher recall of malware. It only keeps
the rules of Goodware with 100% confidence and the rules of malware
with confidence greater than a pre-defined threshold; then it uses the
rule pruning technique to decrease the generated rules and create the
classifier; finally uses ‘‘Best First Rule’’ technique to find samples from
the gray list. The samples labeled to be malware by the first associative
classifier are fed to the second level associative classifier which is aimed
at optimizing the precision. It works with the following steps: select
those samples whose prediction rules of malware have 100% confi-
dences, marking them as ‘‘confident’’ malware; ranking the remaining
minority class files in an descending order based on their prediction
rules’ 𝜒 2 values; select the first k files from the remaining ranking list
and marking them as ‘‘candidate’’ malware; mark the remaining files
as ‘‘deep gray’’ files. Experiment results show the proposed HAC is
effective. In addition, HAC is also an interpretable classifier which can
be easily represented as simple if-then rules.

Graph mining and concept analysis. Fredrikson et al. [43] extract be-
haviors (dependency graphs of system calls and their parameters)
that can distinguish malware from Goodware using structural leap
mining [119]. Then they use the behaviors to form discriminative
specifications. A specification is a set of behaviors and a characteristic
function that describes one or more subsets of the set. A software
matches a specification if it matches all of the behaviors in at least
one characteristic subset. A specification is entirely discriminative if
it matches malicious software but does not match benign software.
They use formal concept analysis [120] and Simulated Annealing al-
gorithm [121] to find an approximate optimal specification which has
Fig. 3. The proposed taxonomy.
true positive larger than a threshold and lowest false positive among
all specification larger than that true positive rate. During test, if a
program matches a specification, it will be classified to be malware. The
created specification can be used in the detection of unseen malware a symmetric similarity calculated as the product of two asymmetric
with a 86% true positive rate and 0 false positives on a dataset of 961 similarities, it cannot handle asymmetric situations. For instance, if
samples. a very large unknown executable contains the whole program of a
malware sample in the database but that malicious program only take
Signature search methods. Cesare and Xiang [15] first convert the con- up 1% of its whole content, the similarity would still be small and it
trol flow graphs of each procedure in an unknown executable to char- cannot be predicted to be malware.
acter strings in the same way they create signatures. Each procedure is Chen et al. [14] uses NiCad [93] to detect whether an APK file
assigned a weight using the length of its string: contains any function that is clone of an exemplar function which
len(𝑠𝑥 ) represents a signature of a malware family. If a match is found, the
𝑤𝑒𝑖𝑔ℎ𝑡𝑥 = ∑ (3)
file is predicted to be an instance of that malware family. They achieve
𝑖 len(𝑠𝑖 )
96.88% accuracy on a dataset of 1170 APK files from 19 malware
Then they use BK Trees to retrieve the strings in the signature database
families.
which have less Levenshtein distance with strings representing pro-
cedures of the target file than a threshold. For a particular malware,
once a matching graph is found, this graph is ignored for subsequent 4. Taxonomy of composition analysis techniques
searches of the remaining graphs in the input binary. If a graph has
multiple matches in a particular malware and it is uncertain which This section introduces the taxonomy of malware composition anal-
procedure should be selected as a match, the greedy solution is taken. ysis techniques. We identify two major dimensions along which sur-
The graph that is weighted the most is selected. For each malware that veyed papers can be conveniently organized. The first one shows the
has matching signatures, the similarity ratios of those signatures: steps used for composition analysis. The second dimension identifies
𝑒𝑑(𝑥, 𝑦) the objective (i.e., strategy) of the analysis. Fig. 3 shows a graphical
𝑤𝑒𝑑 = 1 − (4) representation of the proposed taxonomy.
max(len(𝑥), len(𝑦))
are accumulated proportional to the weights of the procedure. The
final similarity between the unknown executable and a malware in the 4.1. Steps
database is the product of two asymmetric similarities: a similarity that
identifies how much of the input binary is approximately found in the Composition analysis allows reverse engineers to analyze the com-
database malware, and a similarity to show how much of the database position of malware samples in order to understand their functionalities
malware is approximately found in the input binary. If the program and behaviors. This, in turn, allows engineers to discern the intent of
similarity of the examined program to any malware in the database malware samples and the attackers. Moreover, it allows reverse engi-
equals or exceeds a threshold of 0.6, then it is deemed to be a variant. neers to rank the malware by severity and allows them to effectively
Experiment results show that their method achieves 86% detection rate triage their resources.
with 0 false positives which is better than 55 for commercial signature- Basically, there are three main steps used for composition analysis:
based antivirus (AV) and 62–64 for behavior-based AV. Since they use disassembling, representation, and classification.

8
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

4.1.1. Disassembling Hidden Markov models (HMM) [127], and Support Vector Machine
Most software programs are delivered to users with compiled ex- (SVM) [65]. The classification method enables us to identify the re-
ecutables, rather than source code. Disassemblers make it feasible lationships between functions taking into account the following three
for reverse engineers to analyze software programs without source analysis strategies: variants analysis, similarities analysis, and families
code. Technically speaking, a disassembler is a process of converting analysis.
or translating machine language into assembly language. The inverse
operation of ‘‘disassembler’’ is an ‘‘assembler’’. There are many tools Variants Analysis (VA). VA [46,47,59,79,80,83] enables engineers to
used for this purpose (e.g., IDA Pr8 ). realize that a malware sample is actually a variant of a known malware
Disassembly methods can be categorized into the following two in the repository. This strategy allows us to understand to which extent
classes: static techniques and dynamic techniques. Methods that belong malware have been evolved over time.
to the first class analyze the binary components statistically, parsing the
Similarity Analysis (SA). SA [48,49,53,56,128] allows engineers to rec-
opcodes in the binary file. Methods belong to the second class monitor
ognize what parts (i.e., functions) of a malware sample are similar to
the execution traces of a program in order to identify the instructions
known functions in the repository. This strategy allows us to focus only
and recover disassembled version of the binary.
Both dynamic and static methods have pros and cons. Static analysis on new parts and prevent unnecessary investigation.
takes into consideration the whole program, while dynamic analysis Families Analysis (FA). FA [22,24,51,55,60–62,70,71,76,97,101,102].
can only focus on the executed instructions. As a result, it is not easy enables engineers to associate undefined malware to defined families.
to ensure that the entire executable was visited when adapting dy- This strategy works under the assumption that malware from the same
namic analysis. However, dynamic analysis guarantees that the output family are similar to each other in terms of functionality. The difficulty
(i.e., disassembly output) only contains actual instructions.
to recognize them comes from the fact that some malware authors use
Generally speaking, there are two approaches for static analysis
anti-analysis techniques (e.g., obfuscation, packing, polymorphism, and
techniques. The first approach is called linear sweep [122]. This ap-
metamorphism) to conceal that similarity.
proach begins at the first byte of the binary and starts decoding one
instruction after another. The main shortcoming of using linear sweep
disassemblers is the high probability of errors which result from data 5. Characterization of surveyed papers
embedded in the program. The second approach is called recursive
traversal [123], which allows engineers to fix the problem of ‘‘em- In this section, we characterize each reviewed paper. Table 1 pro-
bedded data’’ by following the Control Flow (CF) of the program [15, vides information about both algorithms and features used for each
41]. However, the problem with this approach is that it could fail paper and highlights the main limitations. The table also shows the
to successfully analyze parts (i.e., functions) of the code. This is due scalability of each work in terms of its ability to work in the pres-
to the fact that a control transfer instruction (e.g., jump) cannot be ence of incremental update of the repository. The last column shows
determined statically. This problem can be addresses by using a linear whether the proposed classification techniques are robust against anti-
sweep algorithm to analyze unreachable regions in the code [124]. analysis techniques or not. As can be seen in Table 1, most of the
works use more than one classification algorithm for detecting and
4.1.2. Representation learning classifying malware in order to guarantee more accurate results. In
The success of any malware classification and composition analysis Table 2, different approaches are compared w.r.t the of the main
technique generally depends on data representation. Although specific objective: malware detection and similarity analysis, families analysis
domain knowledge may help engineers design representations and a and variants analysis.
feature vector for an executable, a manual feature engineering process
fail to consider the relationships between features and define those
unique patterns that can distinguish executables. 6. Challenges and issues
Indeed, representation learning is a set of methods and/or tech-
niques that enables a system to automatically extract the representation Based on the characterization explained in Section 5, we discuss
needed for malware classification from raw data (i.e., assembly code). here the challenges and/or issues of the surveyed articles.
This process replaces manual feature engineering and enables a mal-
ware classification system to learn the useful features and integrates 6.1. Malware evading techniques
them to perform a classification.
The motivation behind using feature learning is the fact that com- In this section, we introduce the common techniques that are used
position analysis methods often need inputs that are robust against
by malware authors to evade detection.
anti-analysis techniques such as obfuscation and packing.
Deep learning approaches (e.g., stacked autoencoders [125], stacked
Denoising autoencoders [116], Deep belief networks [126], . . . ) are 6.1.1. Obfuscation
known and considered as the (best) approaches for extracting robust The term of obfuscation mainly refers to the techniques that are
features, which are used for building robust malware and similarity used to create a variant of the original code without affecting its
analysis tools for large-scale heterogeneous environment. functionality. The purpose of obfuscation is usually to hide the real
logic of the original code or to evade signature-based detector or
4.1.3. Classification function clone detector. A few commonly used obfuscation techniques
After disassembling executable samples, the assembly code func- are as follows:
tions are used to feed a representation learning module in order to
1. Dead-Code Insertion [13]: insert useless instructions (e.g., nop)
obtain robust features and ‘‘good’’ representation of data. The function
representation are then fed into any classification algorithms such or insert some instructions that only affect unused variables.
as Naive Bayes classifier (NBC) [64], rule-based classifier [64], deci- 2. Code Transposition [13]: change the order of the independent
sion tree (DT) [65], K-nearest neighbors (K-NN) [71], Bayesian Net- instructions.
work [85], Neural Network (NN) [24], Random Forest (RF) [67], 3. Register Reassignment [13]: exchange the usage of registers for
the storage of data/address in a specific live range.
4. Instruction Substitution [13]: replace an instruction with equiv-
8
https://www.hex-rays.com/products/ida/. alent instructions.

9
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

Table 1
Summary of extraction methods, classification methods, and limitation in malware classification.
Work Classification method Features Limitations Scalability Robust
(Yes/No) against noisy
inputs
(Yes/No)
[129] k-NN and SVM Byte Code Not robust against unseen Yes No
inputs
[130] NN Byte Code Vulnerable to adversarial Yes Yes
attacks
[131] k-NN and NN Byte Code Vulnerable to adversarial Yes Yes
attacks
[65] DT, Naïve Bayes, and SVM Byte Code Not robust against noisy Yes No
inputs
[132] k-NN, NN, and SVM Byte Code Vulnerable to adversarial Yes Yes
attacks
[73] RF Miscellaneous File Information Needs a large number of Yes Yes
labeled examples (malicious
and benign)
[74] DT, RF Miscellaneous File Information Works only under the Yes No
assumption that the new
samples are not packed
[57] SVM Internet Traffic Not scalable (tested using vary No Yes
small datasets)
[75] Cluster Analysis Miscellaneous File Information Unable to classify new Yes No
examples/samples
[64] NBC Printable Strings and Byte Not robust against noisy Yes No
Code inputs
[96] DT, NBC, SVM API Not scalable (tested using very No Yes
small datasets)
[103] BN Miscellaneous File Information Not efficient giving new Yes No
samples
[50] DT, NBC, SVM, k-NN, NN and API and Miscellaneous File Not scalable (tested using No Yes
SVM Information small datasets)
[21] SVM Byte Code and API Not scalable (tested using very No Yes
small datasets)
[41] SVM Byte Code, Assembly Codes not scalable (tested using very No Yes
and API small datasets)
[85] BN API Not robust against noisy Yes No
inputs
[23] BN, DT, k-NN classification, Assembly Codes and API Not robust against noisy Yes No
SVM inputs
[58] DT, RF, Naïve Bayes, SVM Byte Code and API Not scalable (tested using very No Yes
small datasets)
[78] BN Miscellaneous File Information Not robust against unseen Yes No
inputs
[59] Rule-based classifier API Not scalable (tested using very No Yes
small datasets)
[98] RF Internet Traffic Not robust against unseen Yes No
inputs
[99] RF API and Miscellaneous File Not robust against noisy Yes No
Information inputs
[25] NN Printable Strings and Not robust against noisy No yes
Miscellaneous File Information inputs and not scalable (tested
using very small datasets)
[46] Rule based classification API and Miscellaneous File not scalable (tested using very No Yes
Information small datasets)
[47] Cluster analysis API and Miscellaneous File Requiring user interactions Yes No
Information
[101] Cluster analysis Byte Code Not scalable (tested using No Yes
small datasets)
[51] Matching (graph theory) API Not robust against noisy Yes No
inputs
[102] Cluster analysis Assembly Codes Not robust against noisy Yes No
inputs

(continued
on
next
page)

10
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

Table 1 (continued).
Work Classification method Features Limitations Scalability Robust
(Yes/No) against noisy
inputs
(Yes/No)
[24] NN Byte Code and API High error rate Yes No
[70] Clustering Assembly Codes Not robust against noisy Yes No
inputs
[22] DT, k-NN classification, RF, Byte Code and API Not robust against unseen Yes No
SVM inputs
[71] k-NN classification and SVM Assembly Codes and Not robust against unseen Yes No
Miscellaneous File Information inputs
[55] DT Internet Traffic Not scalable (tested using very No Yes
small datasets)
[76] SVM, RF and DT Internet Traffic and Byte Code, Not robust against noisy Yes No
Assembly Codes and API inputs
[61] SVM, RF and DT Internet Traffic and Byte Code Not scalable (tested using very No Yes
and API small datasets)
[60] DT, RF, k-NN classification API Not robust against unseen Yes No
and NBC inputs
[62] DT, k-NN classification and Miscellaneous File Information Not robust against noisy Yes No
SVM and network inputs
[133] k-Means Assembly Codes Not robust against noisy Yes No
inputs
[48] Hierarchical Clustering API, Miscellaneous File Not scalable (tested using very Yes No
Information, and Internet small datasets). Not robust
Traffic against noisy inputs
[49] Cluster analysis API Not robust against noisy Yes No
inputs
[53] Cluster analysis Byte Code and API Not robust against noisy Yes No
inputs
[56] NN API Not robust against noisy No Yes
inputs and not scalable (tested
using small datasets)
[72] DT, k-NN classification, BN Assembly codes not scalable (tested using very No Yes
and RF small datasets)
[63] NBC, RF, and SVM Byte Code, API and file system Not robust against noisy Yes No
inputs
[97] k-NN classification Byte Code Not robust against noisy Yes No
inputs
[104] HMM opcode sequences Not robust against severe Yes Yes
obfuscations techniques
[105] HMM mnemonic opcode sequences Not robust against severe Yes Yes
obfuscations techniques
[106] HMM opcode sequences Not robust against severe Yes Yes
obfuscations techniques
[9] HMM opcode sequences Not robust against severe Yes Yes
obfuscation techniques

5. Control Flow Flattening [134]: (1) break up the body of the func- 6.1.3. Polymorphism
tion to basic blocks (2) put all basic blocks which were originally Polymorphism is also a technique that is based on encryption and
at different nesting levels next to each other (3) encapsulate the decryption. A polymorphic malware contains two parts: the polymor-
basic blocks in a selective structure (a switch statement in the phism engine and the real program which performs the malicious
C++) (4) encapsulate the selection in a loop. functions. The former mutates the encryption algorithms and keys
6. Bogus Control Flow [135]: for a basic block, add a new basic when it replicates and the code of the latter per se is fixed but it is
block which contains an opaque predicate and then make a encrypted by the former in different ways during runtime. This way,
conditional jump to the original basic block. the whole polymorphic malware program would look different at each
generation [136].
6.1.2. Packing
Packing is a technique to compress/encrypt an executable, where
those packed files will be uncompressed/decrypted during runtime. It 6.1.4. Metamorphism
means that a static analyzer cannot see the real code since it does not A metamorphic malware re-programs itself when it replicates. Con-
run the executable. Packing is used not only for malware but also for sequently, in each generation, the whole program body is modified
the protection of Goodware schemes [15,41]. According to the statistics using code obfuscation techniques while the functionality is kept un-
conducted by Anderson et al. [41], 47.56% of the malware are packed changed [136]. Metamorphic malware is considered to be more diffi-
and 19.59% of the Goodware are packed in their dataset. cult to write than polymorphic malware.

11
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

Table 2 the functionality of the original samples. It is natural that the perturba-
Comparison summary (SA: Similarity Analyzes; FA: Families Analysis; VA: Variants
tions should be based on the derivative of the loss function with respect
Analysis.
to the classifier’s input variables since derivatives show the directions
Paper Detection SA FA VA
of changes on the input that is the most effective for changing the
Schultz et al [64] 
output. So a differentiable classifier is required to create adversarial
Kolter and Maloof [65] 
Ahmed et al. [96] 
samples and deep learning models are just differentiable and effective
Chau et al. [103]  classifiers. Studies show that adversarial samples generated to fool one
Firdausi et al. [50]  model can fool a totally different model [138,139]. Therefore, as deep
Anderson et al. [21]  learning models are proposed for the malware detection field, malware
Anderson et al. [41] 
Eskandari et al. [85] 
authors have better opportunities to craft adversarial examples to evade
Santos et al. [23]  the detection of any machine learning models.
Vadrevu et al. [73]  A formal description of the problem to craft an adversarial 𝑥∗ to be
Bai et al. [74]  misclassified by a classifier 𝑓 is
Kruczkowski and Szynkiewicz [57] 
Tamersoy et al. [75]  𝑚𝑖𝑛 ‖𝛿𝑥 ‖ (5)
Uppal et al. [58] 
Chen et al. [78]  𝑠.𝑡. 𝑥∗ = 𝑥 + 𝛿𝑥 , 𝑓 (𝑥∗ ) ≠ 𝑓 (𝑥) (6)
Ghiasi et al. [59]  
Kwon et al. [98]  where ‖ ⋅ ‖ can be any norm and x is the sample to be perturbed.
Mao et al. [99]  Goodfellow et al. [140] present a fast gradient sign method in which
Saxe and Berlin [25]  the adversarial perturbation is determined by multiplying the gradients’
Wuchner et al. [63] 
sign of the sample 𝑆 with some coefficient to control the scale of per-
Raff and Nicholas [97]  
Gharacheh et al. [79]  turbation. Papernot et al. [141] propose a forward derivative method
Khodamoradi et al. [80]  which evaluates the sensitivity of the output to each input component
Upchurch et al. [83]  using its Jacobian matrix and then constructs adversarial saliency maps
Liang et al. [46] 
based on the Jacobian matrix, indicating which input features to be
Vadrevu and Perdisci [47] 
Huang et al. [101] 
included in the perturbation.
Park et al. [51]  Compared with perturbing an adversarial image sample, there are
Ye et al. [102]  some constraints on perturbing a malware sample since most of the
Dahl et al. [24]  features of malware are discrete rather than real-valued and the func-
Hu et al. [70] 
tionality should be intact. Thus, previous methods for perturbation
Islam et al. [22] 
Kong and Yan [71]  of real-valued features need to be adapted and some binary features
Nari and Ghorbani [55]  cannot be changed from ‘‘1’’ to ‘‘0’’ since ‘‘1’’ means that the feature
Ahmadi et al. [76]  exists and that the change in this direction may break the functionality.
Lin et al. [61] 
Grosse et al. [28] propose a technique to craft adversarial Android
Kawaguchi and Omote [60] 
Mohaisen et al. [62] 
malware. Inspired by Papernot et al. [28,141] use the Jacobian matrix
Pai et al. [133]  to examine which features have the greatest potential to lead to the
Bailey et al. [48]  prediction of a malicious program as being Goodware. They only allow
Bayer et al. [49]  distortions to no more than 20 features. All the features are binary
Chen et al. [14] 
Cesare and Xiang [15] 
features. To maintain the functionality of the adversarial example,
Anderson et al. [41]  they add two constraints: (1) only adjust manifest features that relate
Cordy et al. [93]  to the AndroidManifest.xml file. This file is available in any Android
Fredrikson et al. [43]  application; (2) it should be done by adding a single line of code to
Rieck et al. [53] 
it. Using their method, a state-of-the-art feed-forward neural network
Palahan et al. [56] 
Santos et al. [72]  which achieves 98% of accuracy on the original dataset is misled by
Egele et al. [128]  63% of the adversarial malware samples.
Kolter and Maloof [17] 
Moskovitch et al. [18] 
6.2.2. Adversarial defense
Grosse et al. [28] try two methods to defend against adversarial
attack. The first is to apply distillation [141,142] to counter adversarial
6.2. Adversarial attack and defense samples, which successfully reduces misclassification rate by 38.5%
in some case. The second is adversarial training [140] which consists
Since the direction of the recent research is to automate the process of training the model on the original dataset and then training the
model again only on the adversarial samples for a few epochs. The
of malware analysis using machine learning techniques, the proposed
misclassification rate is reduced to 67% from 73% through adversarial
solutions should be robust against adversarial examples, which are
training.
inputs designed by an attacker to fool the machine learning models
Wang et al. [29] defend against adversarial attacks by randomly
and make it generate erroneous decisions (e.g., making the malware
nullifying input features. Their nullification is similar to dropout since
analysis tools unable to detect malicious code). It has been recently
in both mechanisms some input features are randomly set to 0. The
shown that machine learning models, including deep neural networks,
main difference with dropout is that the model do not drop any
are quite vulnerable to adversarial examples. It is easy for an attacker to
input feature during the test but in nullification some features are still
create ‘‘adversarial examples’’ [137] to fool a machine learning model
dropped randomly during the test. Specifically, for each sample in any
through simply perpetuating parts of the inputs.
dataset, a nullification rate is sampled under a Gaussian distribution
and the dimensions (features) to drop are sampled uniformly. The
6.2.1. Adversarial attack intuition is that nullification makes their architecture non-deterministic
Adversarial samples are crafted from normal samples with minimum so that the attackers cannot examine the importance of features and so
perturbations on input variables to confuse a classifier without breaking it is hard for them to detect and exploit the ‘‘blind spots’’ of classifiers.

12
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

In their experiments, the features are the invoked windows system DLL well known method used for learning the word vectors. In particular,
files and they use Jacobian-based saliency map to pick up to 10 features DVS exploits the power of machine learning models (usually Neural
for each sample to perturb. Experimental results show that their method Networks) by training machine learning models to predict a word
can improve the resistance to adversarial samples and that the best (i.e., target word) given the other words in a context. In the process
resistance is 64.86% and is achieved with a nullification rate of 10%. of predicting the target word, we learn the vector representation of the
However, a theoretic problem of their approach is when adversarial target word.
samples are cross-model [138,139]. Thus, even though nullification can The PV-DM model is inspired by Word2Vec by using the idea for
harm the ability of an adversary to use this model to craft adversarial learning the word vectors. In the PV-DM model, both word vectors and
samples, the adversary can use other models (i.e., the same neural paragraph vectors are asked to contribute to the prediction of the target
network without nullification) to craft adversarial samples which can word given many contexts sampled from the paragraph [144]. This
also evade the one equipped with nullification. Therefore, there is no process (i.e., predicting the target word) allows us to learn the vector
theoretic proof or evidence to show whether nullification can improve representation of the paragraph. Ding et al. [143] exploit the power
the resistance against adversarial samples crafted from other deep of the PV-DM model to learn the vector representation of assembly
learning models. functions based on assembly code. This is done by mapping assembly
function (i.e., repository function) and the function’s input tokens
6.3. Efficiency and scalability (i.e., instructions) to a unique vector. The machine learning model is
then trained to predict a target token given the function and its tokens
A practical malware search engine can help security engineers ob- in a context. This process enables us to learn the vector representation
tain malware search results on-the-fly when they are making analysis. of the function.
Instant feedback provides the engineer the structure of a given malware In fact, the solution should be able not only to accommodate un-
that is under investigation [92]. One should note that scalability is an known variants of known malware but also to accommodate unknown
important factor as the number of malware in the database needs to variants of unknown malware. These solutions should also be robust
scale up to millions. It is also a critical issue for producing a reliable against adversarial attacks. Although some works have already ad-
malware search engine. For practical applications, a malware search dressed this problem, these solutions are mostly based on adversarial
engine’ efficiency and scalability should be evaluated using a large training [146] and are not mature enough to combine the extraction
repository in order to measure both its accuracy and latency. of robust and useful features to protect the system against adversarial
examples. Thus, the solution should not only be robust against complex
7. Research direction and noisy data but also against adversarial examples.

The above contributions are effective in addressing some interesting 7.2. Collaborative solutions
research gaps in the literature. However, some points still need fur-
ther study and investigation. The following research avenues could be Computer and communication systems are becoming more and
further explored based on our literature review: more complex and vulnerable to intrusions. Cyber attacks are also
becoming more complex and harder to analyze and recognize. In fact,
7.1. Robust solutions it became increasingly difficult for a single MDS to recognize all intru-
sions, because of limited knowledge about the evolution of malware.
Although the discussed solutions in the literature review have paved The recent works in intrusion detection and malware analysis [147–
the road for a reliable Malware Detection System (MDS) through ex- 149] have shown experimentally that the detection accuracy can be
tracting robust and useful features, the solution still needs to reduce significantly improved, compared to the traditional single MDS, when
human interaction. Thus, an automated system is required to take MDSs cooperate with each other. In collaborative environment, each
the data and automatically abstract and extract robust features from MDS can consult other MDSs about suspicious malware to increase the
them. For this purpose, deep learning techniques could be the best decision accuracy. Fig. 4 shows an example of cooperative MDS.
candidate to replace the existing feature extraction approaches. The Recently, Man and Huh [147] and Singh et al. [148] design a collab-
solution can be designed and implemented using different Deep Learn- orative MDS, which enables malware-detection-alerts to be exchanged
ing architectures (e.g., Generative Adversarial Networks, Stacked De- from different distributed detectors. Moreover, knowledge are enabled
noising Autoencoder, Restricted Boltzmann Machine, and Variational to be exchanged between nodes. In addition, Dermott et al. [150]
Autoencoder) for auto-abstraction and extraction of robust features to propose a collaborative MDS in a cloud-computing environment. The
significantly enhance the detection under heterogeneous, changing and proposed framework use the Dempster-Shafer theory of evidence [151]
noisy environments. in order to combine the decisions form different malware detectors. The
Recently, Ding et al. [143] propose a robust and accurate assembly received decisions are aggregated to take the final decision regarding a
clone search platform named Asm2Vec. The proposed platform enables suspicious malware. This technique has a shortcoming: its centralized-
engineers to automatically learns a vector representation of any as- based architecture, whereby a reliable third-party is used for combining
sembly function by discriminating it from others functions. Also, the feedback and coordinating MDS.
platform allows engineers to jointly learn the semantic relationships In fact, the design of a cooperative MDS should take into con-
of assembly functions based on assembly code [143]. This, in turn sideration the following three properties (challenges): trustworthiness,
enables us to construct useful and robust features to make efficient and fairness and sustainability. By trustworthiness, we mean that the MDS
reliable assembly clone search. The proposed learning representation is should be able to ensure that it will consult, cooperate and share knowl-
inspired by the Distributed Memory Model of Paragraph Vectors (PV- edge with trusted parties (i.e., MDSs). By fairness, we mean that the
DM) model, which is used to learn a vectorized representation of a MDS should be able to guarantee that mutual benefits will be achieved
text paragraph [144]. The PV-DM model is fundamentally based on through minimizing the chance of cooperating with selfish MDSs. This
Word2Vec [145], which is used to learn vector representation of words. is useful to give MDSs the motivation to participate in the community.
This is done by enabling words with similar meaning to be mapped to Finally, by sustainability, we mean enabling an MDS to proactively take
a similar position in the vector space. For example, ‘‘good’’ and ‘‘great’’ decisions about suspicious attacks, regardless if the complete feedback
are close to each other, whereas ‘‘great’’ and ‘‘Japan’’ are more distant. have been received from consulted MDSs or not. Thus, the proposed
Learning the vector representation of words becomes possible thanks solution will be applicable in real-time environments, where MDSs
to the concept of Distributed Vector Representation (DVR) of words, a should take decisions about suspicious malware quickly.

13
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

Fig. 4. The proposed taxonomy.

7.3. Sustainable solutions the type of features obtained from samples, and the algorithms used to
manipulate these features. Second, we provided a comparative analysis
The power of most malware analysis tools is largely based on the of the existing malware classification and composition analysis tech-
amount of knowledge that they have about Malware and dangerous niques, while structuring them according to the proposed taxonomy.
attacks. In fact, supervised machine learning algorithms such as SVM, Third, We determined the main issues and challenges associated with
used by MDS, are heavily dependent on labeled data to learn how to malware classification and composition analysis. Finally, we identified
effectively classify malicious and normal behaviors [152]. However, a number of emergent topics in the discussed field, such as collaborative
obtaining data on malicious behaviors is challenging and dangerous, malware analysis system, with guidelines on how to improve solutions
especially if we are required to launch real attacks on production
to address the new challenges.
systems and put users, applications and systems at risk. To address this
The above contributions are effective in addressing some interest-
problem, we may need to have an efficient approach to synthesize new
ing research gaps in the literature. However, some points still need
malware and augment our training data, in order to improve machine
learning-based MDSs. further study and investigation. The following research avenues could
Generative models such as Generative adversarial Networks be further explored in order to achieve better accuracy and efficient
(GANs) [153] can be used to generate synthetic malware and enhance solutions compared to the state-of-the-art. The first avenue is the design
the detection accuracy of machine learning-based MDS, by augmenting of cooperative MDS to address the problem of limited and incomplete
Malware training sets. We encourage researchers to investigate the knowledge about malware. Through collaboration, an MDS can con-
use of GANs, which have shown unprecedented ability in generating sult other MDSs about suspicious malware and increase the decision
high quality new synthetic data, to generate malware variants. In accuracy. To this end, we identify three challenges that should be
particular, they need to design new algorithms to effectively and addressed in cooperative MDS: trustworthiness, fairness and sustain-
efficiently train GANs on the existing malware that are available in ability. Second, the design of robust MDS by enabling the automatic
the repository in order to learn how to generate variants of them. To extraction of robust features from samples. The solution should be able
this end, researchers are required to collect a large volume of malware not only to accommodate unknown variants of known malware but also
samples that consists of different attributes (vulnerabilities, targeted to accommodate unknown variants of unknown malware. Moreover,
users, targeted hosts, etc.) from the public domain. Since GANs are only the solution should be robust against adversarial attacks. Finally, the
defined for real-valued, continued data and the design of malware is design of sustainable MDS by enabling an MDS to synthetically generate
based on sequences of discrete tokens (bytes), special extensions should
new malicious and benign code in order to enhance the accuracy of
be applied on the original GANs theory. For example, we may need to
machine learning-based malware classification methods.
integrate GANs with recurrent neural networks (RNNs) to tackle the
problem of sequenced data [154]. Moreover, to address the problem
of discrete data, we may need to place in parallel a dense layer per CRediT authorship contribution statement
categorical variable, followed by Gumbel-Softmax activation and a
concatenation to get the final output [155].
Adel Abusitta: Conceptualization, Methodology, Data curation,
8. Conclusion Writing - original draft, Validation, Writing - reviewing and editing,
Supervision, Visualization, Investigation. Miles Q. Li: Conceptualiza-
In this paper, we provide a comprehensive survey on publications tion, Methodology, Data curation, Writing - original draft, Validation,
that contributed to malware classification and composition analysis. Writing - reviewing and editing, Visualization, Investigation. Benjamin
There are four main contributions in our work. First, we proposed an C.M. Fung: Conceptualization, Methodology, Supervision, Funding
organization of reviewed paper according to three dimensions: the pur- acquisition, Writing - original draft, Writing - reviewing and editing,
pose of the analysis (malware classification or composition analysis), Project administration.

14
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

Declaration of competing interest [24] Dahl GE, Stokes JW, Deng L, Yu D. Large-scale malware classification using
random projections and neural networks. In: Acoustics, speech and signal
processing, 2013 IEEE international conference on. IEEE; 2013, p. 3422–6.
The authors declare that they have no known competing finan-
[25] Saxe J, Berlin K. Deep neural network based malware detection using two
cial interests or personal relationships that could have appeared to dimensional binary program features. In: Malicious and unwanted software,
influence the work reported in this paper. 2015 10th international conference on. IEEE; 2015, p. 11–20.
[26] Huang W, Stokes JW. MtNet: a multi-task neural network for dynamic mal-
Acknowledgments ware classification. In: International conference on detection of intrusions and
malware, and vulnerability assessment. Springer; 2016, p. 399–418.
[27] Kolosnjaji B, Zarras A, Webster G, Eckert C. Deep learning for classification of
This research is supported in part by the DND Innovation for malware system call sequences. In: Australasian joint conference on artificial
Defence Excellence and Security, Canada (W7714-207117/001/SV), intelligence. Springer; 2016, p. 137–49.
NSERC, Canada Discovery Grants (RGPIN-2018-03872), and Canada [28] Grosse K, Papernot N, Manoharan P, Backes M, McDaniel P. Adversarial
examples for malware detection. In: European symposium on research in
Research Chairs Program (950-230623). Any opinions, findings, and
computer security. Springer; 2017, p. 62–79.
conclusions or recommendations expressed in this material are those [29] Wang Q, Guo W, Zhang K, Ororbia II AG, Xing X, Liu X, Giles CL. Adversary
of the authors and do not necessarily reflect the views of the funding resistant deep neural networks with an application to malware detection. In:
agencies. Proceedings of the 23rd ACM SIGKDD international conference on knowledge
discovery and data mining. ACM; 2017, p. 1145–53.
[30] Ucci D, Aniello L, Baldoni R. Survey of machine learning techniques for
References
malware analysis. Comput Secur 2018.
[31] Sahu MK, Ahirwar M, Hemlata A. A review of malware detection based on
[1] Malware statistics and facts for 2020. 2020. https://www.comparitech.com/ pattern matching technique. Int J Comput Sci Inf Technol 2014;5(1):944–7.
antivirus/malware-statistics-facts/. [Accessed 17 March 2020]. [32] Souri A, Hosseini R. A state-of-the-art survey of malware detection approaches
[2] Malware Numbers 2017. 2019. https://www.gdatasoftware.com/blog/2018/03/ using data mining techniques. Human-centric Comput Inf Sci 2018;8(1):3.
30610-malware-number-2017. [Accessed 17 August 2019]. [33] Bazrafshan Z, Hashemi H, Fard SMH, Hamzeh A. A survey on heuristic malware
[3] Suarez-Tangil G, Tapiador JE, Peris-Lopez P, Ribagorda A. Evolution, detec- detection techniques. In: The 5th conference on information and knowledge
tion and analysis of malware for smart devices. IEEE Commun Surv Tutor technology. IEEE; 2013, p. 113–20.
2013;16(2):961–87. [34] Shabtai A, Moskovitch R, Elovici Y, Glezer C. Detection of malicious code by
[4] Tailor JP, Patel AD. A comprehensive survey: ransomware attacks prevention, applying machine learning classifiers on static features: A state-of-the-art survey.
monitoring and damage control. Int J Res Sci Innov 2017;4(15):116–21. Inf Secur Tech Rep 2009;14(1):16–29.
[5] Vignau B, Khoury R, Hallé S. 10 years of IoT malware: A feature-based [35] Basu I, Sinha N, Bhagat D, Goswami S. Malware detection based on source data
taxonomy. In: 2019 IEEE 19th international conference on software quality, using data mining: A survey. Am J Adv Comput 2016;3(1):18–37.
reliability and security companion. IEEE; 2019, p. 458–65. [36] Ye Y, Li T, Adjeroh D, Iyengar SS. A survey on malware detection using data
[6] Xu Z, Wang H, Xu Z, Wang X. Power attack: An increasing threat to data mining techniques. ACM Comput Surv 2017;50(3):41.
centers. In: NDSS. 2014. [37] Or-Meir O, Nissim N, Elovici Y, Rokach L. Dynamic malware analysis in the
[7] Kimani K, Oduol V, Langat K. Cyber security challenges for IoT-based smart modern era—A state of the art survey. ACM Comput Surv 2019;52(5):88.
grid networks. Int J Crit Infrastruct Prot 2019;25:36–49. [38] Barriga J, Yoo S. Malware detection and evasion with machine learning
[8] Jakobsson M, Ramzan Z. Crimeware: understanding new attacks and defenses. techniques: A survey. Int J Appl Eng Res 2017;12(318).
Addison-Wesley Professional; 2008. [39] Damodaran A, Di Troia F, Visaggio CA, Austin TH, Stamp M. A comparison
[9] Wong W, Stamp M. Hunting for metamorphic engines. J Comput Virol of static, dynamic, and hybrid analysis for malware detection. J Comput Virol
2006;2(3):211–29. Hacking Tech 2017;13(1):1–12.
[10] Tariq N. Impact of cyberattacks on financial institutions. J Internet Bank [40] Bayer U, Moser A, Kruegel C, Kirda E. Dynamic analysis of malicious code. J
Commer 2018;23(2):1–11. Comput Virol 2006;2(1):67–77.
[11] Chen L, Ye Y, Bourlai T. Adversarial machine learning in malware detection: [41] Anderson B, Storlie C, Lane T. Improving malware classification: bridging the
Arms race between evasion attack and defense. In: 2017 European intelligence static/dynamic gap. In: Proceedings of the 5th ACM workshop on security and
and security informatics conference. IEEE; 2017, p. 99–106. artificial intelligence. ACM; 2012, p. 3–14.
[12] Schultz MG, Eskin E, Zadok F, Stolfo SJ. Data mining methods for detection [42] Royal P, Halpin M, Dagon D, Edmonds R, Lee W. Polyunpack: Automating
of new malicious executables. In: Security and privacy, 2001. S&P 2001. the hidden-code extraction of unpack-executing malware. In: Computer security
Proceedings. 2001 IEEE symposium on. IEEE; 2001, p. 38–49. applications conference, 2006. ACSAC’06. 22nd annual. IEEE; 2006, p. 289–300.
[13] Christodorescu M, Jha S. Static analysis of executables to detect malicious [43] Fredrikson M, Jha S, Christodorescu M, Sailer R, Yan X. Synthesizing near-
patterns. Technical report, Wisconsin Univ-Madison Dept of Computer Sciences; optimal malware specifications from suspicious behaviors. In: Security and
2006. privacy, 2010 IEEE symposium on. IEEE; 2010, p. 45–60.
[14] Chen J, Alalfi MH, Dean TR, Zou Y. Detecting android malware using clone [44] Force UA. Analysis of the Intel Pentium’s ability to support a secure virtual
detection. J Comput Sci Tech 2015;30(5):942–56. machine monitor. In: Proceedings of the 9th USENIX security symposium. 2000.
[15] Cesare S, Xiang Y. Classification of malware using structured control flow. In: p. 129.
Proceedings of the eighth Australasian symposium on parallel and distributed [45] Rutkowska J. Redpill: Detect VMM using (almost) one CPU instruction. 2004,
computing-volume 107. Australian Computer Society, Inc.; 2010, p. 61–70. http://invisiblethings.org/papers/redpill.html.
[16] Ye Y, Li T, Huang K, Jiang Q, Chen Y. Hierarchical associative classifier (HAC) [46] Liang G, Pang J, Dai C. A behavior-based malware variant classification
for malware detection from the large and imbalanced gray list. J Intell Inf Syst technique. Int J Inf Educ Technol 2016;6(4):291.
2010;35(1):1–20. [47] Vadrevu P, Perdisci R. Maxs: Scaling malware execution with sequential multi-
[17] Kolter JZ, Maloof MA. Learning to detect malicious executables in the wild. In: hypothesis testing. In: Proceedings of the 11th ACM on Asia conference on
Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge computer and communications security. ACM; 2016, p. 771–82.
Discovery and Data Mining. ACM; 2004, p. 470–8. [48] Bailey M, Oberheide J, Andersen J, Mao ZM, Jahanian F, Nazario J. Automated
[18] Moskovitch R, Feher C, Tzachar N, Berger E, Gitelman M, Dolev S, Elovici Y. classification and analysis of internet malware. In: International workshop on
Unknown malcode detection using opcode representation. In: Intelligence and recent advances in intrusion detection. Springer; 2007, p. 178–97.
security informatics. Springer; 2008, p. 204–15. [49] Bayer U, Comparetti PM, Hlauschek C, Kruegel C, Kirda E. Scalable,
[19] Dai J, Guha RK, Lee J. Efficient virus detection using dynamic instruction behavior-based malware clustering. In: NDSS, vol. 9. Citeseer; 2009, p. 8–11.
sequences. J Comput Phys 2009;4(5):405–14. [50] Firdausi I, Erwin A, Nugroho AS, et al. Analysis of machine learning techniques
[20] Nataraj L, Karthikeyan S, Jacob G, Manjunath B. Malware images: visualization used in behavior-based malware detection. In: 2010 second international confer-
and automatic classification. In: Proceedings of the 8th international symposium ence on advances in computing, control, and telecommunication technologies.
on visualization for cyber security. ACM; 2011, p. 4. IEEE; 2010, p. 201–3.
[21] Anderson B, Quist D, Neil J, Storlie C, Lane T. Graph-based malware detection [51] Park Y, Reeves D, Mulukutla V, Sundaravel B. Fast malware classification by
using dynamic analysis. J Comput Virol 2011;7(4):247–58. automated behavioral graph matching. In: Proceedings of the sixth annual
[22] Islam R, Tian R, Batten LM, Versteeg S. Classification of malware workshop on cyber security and information intelligence research. ACM; 2010,
based on integrated static and dynamic features. J Netw Comput Appl p. 45.
2013;36(2):646–56. [52] Lindorfer M, Kolbitsch C, Comparetti PM. Detecting environment-sensitive
[23] Santos I, Devesa J, Brezo F, Nieves J, Bringas PG. Opem: A static-dynamic malware. In: International workshop on recent advances in intrusion detection.
approach for machine-learning-based malware detection. In: International joint Springer; 2011, p. 338–57.
conference CISIS’12-ICEUTE 12-SOCO 12 special sessions. Springer; 2013, p. [53] Rieck K, Trinius P, Willems C, Holz T. Automatic analysis of malware behavior
271–80. using machine learning. J Comput Secur 2011;19(4):639–68.

15
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

[54] Comar PM, Liu L, Saha S, Tan P-N, Nucci A. Combining supervised and [81] Sexton J, Storlie C, Anderson B. Subroutine based detection of APT malware.
unsupervised learning for zero-day malware detection. In: 2013 Proceedings J Comput Virol Hacking Tech 2016;12(4):225–33.
IEEE INFOCOM. IEEE; 2013, p. 2022–30. [82] Piyanuntcharatsr SSW, Adulkasem S, Chantrapornchai C. On the comparison
[55] Nari S, Ghorbani AA. Automated malware classification based on network of malware detection methods using data mining with two feature sets. Int J
behavior. In: 2013 international conference on computing, networking and Secur Appl 2015;9(3):293–318.
communications. IEEE; 2013, p. 642–7. [83] Upchurch J, Zhou X. Variant: a malware similarity testing framework. In: 2015
[56] Palahan S, Babić D, Chaudhuri S, Kifer D. Extraction of statistically significant 10th international conference on malicious and unwanted software. IEEE; 2015,
malware behaviors. In: Proceedings of the 29th annual computer security p. 31–9.
applications conference. ACM; 2013, p. 69–78. [84] Jang J, Brumley D, Venkataraman S. Bitshred: feature hashing malware for
[57] Kruczkowski M, Szynkiewicz EN. Support vector machine for malware analysis scalable triage and semantic analysis. In: Proceedings of the 18th ACM
and classification. In: Proceedings of the 2014 IEEE/WIC/ACM international conference on computer and communications security. ACM; 2011, p. 309–20.
joint conferences on web intelligence (WI) and intelligent agent technologies [85] Eskandari M, Khorshidpour Z, Hashemi S. HDM-analyser: a hybrid analysis
(IAT)-Volume 02. IEEE Computer Society; 2014, p. 415–20. approach based on data mining techniques for malware detection. J Comput
[58] Uppal D, Sinha R, Mehra V, Jain V. Malware detection and classification based Virol Hacking Tech 2013;9(2):77–93.
on extraction of api sequences. In: 2014 international conference on advances [86] Graziano M, Canali D, Bilge L, Lanzi A, Shi E, Balzarotti D, van Dijk M,
in computing, communications and informatics. IEEE; 2014, p. 2337–42. Bailey M, Devadas S, Liu M, et al. Needles in a haystack: Mining information
[59] Ghiasi M, Sami A, Salehi Z. Dynamic VSA: a framework for malware detection from public dynamic analysis sandboxes for malware intelligence. Proceedings
based on register contents. Eng Appl Artif Intell 2015;44:111–22. of the 24th USENIX security symposium. 2015. p. 1057–72.
[60] Kawaguchi N, Omote K. Malware function classification using APIs in initial [87] Oliva A, Torralba A. Modeling the shape of the scene: A holistic representation
behavior. In: 2015 10th Asia joint conference on information security. IEEE; of the spatial envelope. Int J Comput Vis 2001;42(3):145–75.
2015, p. 138–44. [88] Bhodia N, Prajapati P, Di Troia F, Stamp M. Transfer learning for image-based
[61] Lin C-T, Wang N-J, Xiao H, Eckert C. Feature selection and extraction for malware classification. 2019, arXiv preprint arXiv:1903.11551.
malware classification. J Inf Sci Eng 2015;31(3):965–92. [89] Agrawal R, Srikant R, et al. Fast algorithms for mining association rules. In:
[62] Mohaisen A, Alrawi O, Mohaisen M. Amal: High-fidelity, behavior-based Proc. 20th Int. Conf. Very Large Data Bases, VLDB, vol. 1215. 1994. p. 487–99.
automated malware analysis and classification. Comput Secur 2015;52:251–66. [90] Kruegel C, Kirda E, Mutz D, Robertson W, Vigna G. Polymorphic worm detection
[63] Wüchner T, Ochoa M, Pretschner A. Robust and effective malware detection using structural information of executables. In: International workshop on
through quantitative data flow graph metrics. In: International conference on recent advances in intrusion detection. Springer; 2005, p. 207–26.
detection of intrusions and malware, and vulnerability assessment. Springer; [91] Ding SHH, Fung BCM, Charland P. Asm2vec: Boosting static representation
2015, p. 98–118. robustness for binary clone search against code obfuscation and compiler
[64] Schultz MG, Eskin E, Zadok F, Stolfo SJ. Data mining methods for detection of optimization. In: 2019 IEEE symposium on security and privacy. IEEE; 2019,
new malicious executables. In: Proceedings 2001 IEEE symposium on security p. 472–89.
[92] Ding SHH, Fung BCM, Charland P. Kam1n0: Mapreduce-based assembly clone
and privacy. IEEE; 2000, p. 38–49.
search for reverse engineering. In: Proceedings of the 22nd ACM SIGKDD
[65] Kolter JZ, Maloof MA. Learning to detect and classify malicious executables in
international conference on knowledge discovery and data mining. ACM; 2016,
the wild. J Mach Learn Res 2006;7(Dec):2721–44.
p. 461–70.
[66] Attaluri S, McGhee S, Stamp M. Profile hidden Markov models and metamorphic
[93] Cordy JR, Roy CK. The NiCad clone detector. In: Program comprehension
virus detection. J Comput Virol 2009;5(2):151–69.
(ICPC), 2011 IEEE 19th international conference on. IEEE; 2011, p. 219–20.
[67] Siddiqui M, Wang MC, Lee J. Detecting internet worms using data mining
[94] Baysa D, Low RM, Stamp M. Structural entropy and metamorphic malware. J
techniques. J Syst Cybern Inform 2009;6(6):48–53.
Comput Virol Hacking Tech 2013;9(4):179–92.
[68] Santos I, Nieves J, Bringas PG. Semi-supervised learning for unknown malware
[95] Tian R, Batten LM, Versteeg S. Function length as a tool for malware
detection. In: International symposium on distributed computing and artificial
classification. In: 2008 3rd international conference on malicious and unwanted
intelligence. Springer; 2011, p. 415–22.
software. IEEE; 2008, p. 69–76.
[69] Chen Z, Roussopoulos M, Liang Z, Zhang Y, Chen Z, Delis A. Mal-
[96] Ahmed F, Hameed H, Shafiq MZ, Farooq M. Using spatio-temporal information
ware characteristics and threats on the internet ecosystem. J Syst Softw
in API calls with machine learning algorithms for malware detection. In:
2012;85(7):1650–72.
Proceedings of the 2nd ACM workshop on security and artificial intelligence.
[70] Hu X, Shin KG, Bhatkar S, Griffin K. Mutantx-s: Scalable malware clustering
ACM; 2009, p. 55–62.
based on static features. In: Proceedings of the USENIX annual technical
[97] Raff E, Nicholas C. An alternative to ncd for large sequences, lempel-ziv jaccard
conference. 2013. p. 187–98.
distance. In: Proceedings of the 23rd ACM SIGKDD international conference on
[71] Kong D, Yan G. Discriminant malware distance learning on structural informa-
knowledge discovery and data mining. ACM; 2017, p. 1007–15.
tion for automated malware classification. In: Proceedings of the 19th ACM
[98] Kwon BJ, Mondal J, Jang J, Bilge L, Dumitraş T. The dropper effect: Insights
SIGKDD international conference on knowledge discovery and data mining.
into malware distribution with downloader graph analytics. In: Proceedings of
ACM; 2013, p. 1357–65.
the 22nd ACM SIGSAC conference on computer and communications security.
[72] Santos I, Brezo F, Ugarte-Pedrero X, Bringas PG. Opcode sequences as rep-
ACM; 2015, p. 1118–29.
resentation of executables for data-mining-based unknown malware detection. [99] Mao W, Cai Z, Towsley D, Guan X. Probabilistic inference on integrity for
Inform Sci 2013;231:64–82. access behavior based malware detection. In: International symposium on recent
[73] Vadrevu P, Rahbarinia B, Perdisci R, Li K, Antonakakis M. Measuring and advances in intrusion detection. Springer; 2015, p. 155–76.
detecting malware downloads in live network traffic. In: European symposium [100] Polino M, Scorti A, Maggi F, Zanero S. Jackdaw: Towards automatic reverse
on research in computer security. Springer; 2013, p. 556–73. engineering of large datasets of binaries. In: International conference on
[74] Bai J, Wang J, Zou G. A malware detection scheme based on mining format detection of intrusions and malware, and vulnerability assessment. Springer;
information. Sci World J 2014;2014. 2015, p. 121–43.
[75] Tamersoy A, Roundy K, Chau DH. Guilt by association: large scale malware [101] Huang K, Ye Y, Jiang Q. Ismcs: an intelligent instruction sequence based
detection by mining file-relation graphs. In: Proceedings of the 20th ACM malware categorization system. In: 2009 3rd international conference on anti-
SIGKDD international conference on knowledge discovery and data mining. counterfeiting, security, and identification in communication. IEEE; 2009, p.
ACM; 2014, p. 1524–33. 509–12.
[76] Ahmadi M, Ulyanov D, Semenov S, Trofimov M, Giacinto G. Novel feature [102] Ye Y, Li T, Chen Y, Jiang Q. Automatic malware categorization using cluster
extraction, selection and fusion for effective malware family classification. In: ensemble. In: Proceedings of the 16th ACM SIGKDD international conference
Proceedings of the sixth ACM conference on data and application security and on knowledge discovery and data mining. ACM; 2010, p. 95–104.
privacy. ACM; 2016, p. 183–94. [103] Nachenberg C, Wilhelm J, Wright A, Faloutsos C. Polonium: Tera-scale graph
[77] Caliskan-Islam A, Harang R, Liu A, Narayanan A, Voss C, Yamaguchi F, Green- mining for malware detection. 2010.
stadt R. De-anonymizing programmers via code stylometry. In: Proceedings of [104] Kalbhor A, Austin TH, Filiol E, Josse S, Stamp M. Dueling hidden Markov
the 24th USENIX security symposium. 2015, p. 255–70. models for virus analysis. J Comput Virol Hacking Tech 2015;11(2):103–18.
[78] Chen L, Li T, Abdulhayoglu M, Ye Y. Intelligent malware detection based on file [105] Raghavan A, Di Troia F, Stamp M. Hidden Markov models with random
relation graphs. In: Proceedings of the 2015 IEEE 9th international conference restarts versus boosting for malware detection. J Comput Virol Hacking Tech
on semantic computing. IEEE; 2015, p. 85–92. 2019;15(2):97–107.
[79] Gharacheh M, Derhami V, Hashemi S, Fard SMH. Proposing an HMM-based [106] Annachhatre C, Austin TH, Stamp M. Hidden Markov models for malware
approach to detect metamorphic malware. In: 2015 4th Iranian joint congress classification. J Comput Virol Hacking Tech 2015;11(2):59–73.
on fuzzy and intelligent systems. IEEE; 2015, p. 1–5. [107] Russell SJ, Norvig P. Artificial intelligence: a modern approach. Malaysia:
[80] Khodamoradi P, Fazlali M, Mardukhi F, Nosrati M. Heuristic metamorphic Pearson Education Limited; 2016.
malware detection based on statistics of assembly instructions using classifi- [108] Quinlan JR. Induction of decision trees. Mach Learn 1986;1(1):81–106.
cation algorithms. In: 2015 18th CSI international symposium on computer [109] Altman NS. An introduction to kernel and nearest-neighbor nonparametric
architecture and digital systems. IEEE; 2015, p. 1–6. regression. Amer Statist 1992;46(3):175–85.

16
A. Abusitta et al. Journal of Information Security and Applications 59 (2021) 102828

[110] Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin [132] Davuluru VSP, Narayanan BN, Balster EJ. Convolutional neural networks as
classifiers. In: Proceedings of the fifth annual workshop on computational classification tools and feature extractors for distinguishing malware programs.
learning theory. ACM; 1992, p. 144–52. In: 2019 IEEE national aerospace and electronics conference. IEEE; 2019, p.
[111] Jensen FV. An introduction to Bayesian networks, vol. 210. UCL Press London; 273–8.
1996. [133] Pai S, Di Troia F, Visaggio CA, Austin TH, Stamp M. Clustering for malware
[112] Liu B, Ma Y, Wong CK. Improving an association rule based classifier. In: classification. J Comput Virol Hacking Tech 2017;13(2):95–107.
European conference on principles of data mining and knowledge discovery. [134] László T, Kiss Á. Obfuscating C++ programs via control flow flattening. Ann
Springer; 2000, p. 504–9. Univ Sci Budapest Rolando Eötvös Nominatae Sect Comput 2009;30:3–19.
[113] Cohen WW. Learning trees and rules with set-valued features. In: AAAI/IAAI, [135] Bogus Control Flow. 2020. https://github.com/obfuscator-llvm/obfuscator/
Vol. 1. 1996. p. 709–16. wiki/Bogus-Control-Flow. [Accessed 10 March 2020].
[114] Hansen LK, Salamon P. Neural network ensembles. IEEE Trans Pattern Anal [136] Li X, Loh PK, Tan F. Mechanisms of polymorphic and metamorphic viruses. In:
Mach Intell 1990;(10):993–1001. Intelligence and security informatics conference, 2011 European. IEEE; 2011,
[115] Pal M. Random forest classifier for remote sensing classification. Int J Remote p. 149–54.
Sens 2005;26(1):217–22. [137] Kurakin A, Goodfellow I, Bengio S. Adversarial examples in the physical world.
[116] Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A. Stacked denoising 2016, arXiv preprint arXiv:1607.02533.
autoencoders: Learning useful representations in a deep network with a local [138] Bruna J, Szegedy C, Sutskever I, Goodfellow I, Zaremba W, Fergus R, Erhan D.
denoising criterion. J Mach Learn Res 2010;11(Dec):3371–408. Intriguing properties of neural networks. 2013.
[117] Ng A, et al. Sparse autoencoder. CS294A Lecture notes 2011;72(2011):1–19. [139] Papernot N, McDaniel P, Goodfellow I. Transferability in machine learning:
[118] Fink O, Zio E, Weidmann U. Fuzzy classification with restricted boltzman from phenomena to black-box attacks using adversarial samples. 2016, arXiv
machines and echo-state networks for predicting potential railway door system preprint arXiv:1605.07277.
failures. IEEE Trans Reliab 2015;64(3):861–8. [140] Goodfellow IJ, Shlens J, Szegedy C. Explaining and harnessing adversarial
[119] Yan X, Cheng H, Han J, Yu PS. Mining significant graph patterns by leap examples. 2014, CoRR abs/1412.6572.
search. In: Proceedings of the 2008 ACM SIGMOD international conference on [141] Papernot N, McDaniel P, Jha S, Fredrikson M, Celik ZB, Swami A. The
management of data. ACM; 2008, p. 433–44. limitations of deep learning in adversarial settings. In: Security and privacy
[120] Wille R. Restructuring lattice theory: an approach based on hierarchies of (EuroS&P), 2016 IEEE European symposium on. IEEE; 2016, p. 372–87.
concepts. In: Ordered sets. Springer; 1982, p. 445–70. [142] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network.
[121] Brémaud P. Markov chains: Gibbs fields, Monte Carlo simulation, and queues, 2015, arXiv preprint arXiv:1503.02531.
vol. 31. Springer Science & Business Media; 2013. [143] Ding SH, Fung BCM, Charland P. Asm2Vec: Boosting static representation
[122] Kruegel C, Robertson W, Valeur F, Vigna G. Static disassembly of obfuscated robustness for binary clone search against code obfuscation and compiler
binaries. In: USENIX security symposium, vol. 13. 2004. p. 18. optimization. In: Proc. of the 40th international symposium on security and
[123] Cifuentes C, Gough KJ. Decompilation of binary programs. Softw - Pract Exp privacy. San Francisco, CA: IEEE Computer Society; 2019, p. 38–55.
1995;25(7):811–29. [144] Le Q, Mikolov T. Distributed representations of sentences and documents. In:
[124] Cifuentes C, Van Emmerik M. UQBT: Adaptable binary translation at low cost. International conference on machine learning. 2014. p. 1188–96.
Computer 2000;33(3):60–6. [145] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word
[125] Shin H-C, Orton MR, Collins DJ, Doran SJ, Leach MO. Stacked autoencoders representations in vector space. 2013, arXiv preprint arXiv:1301.3781.
for unsupervised feature learning and multiple organ detection in a pilot study [146] Carlini N, Wagner D. Audio adversarial examples: Targeted attacks on speech-
using 4D patient data. IEEE Trans Pattern Anal Mach Intell 2012;35(8):1930–43. to-text. In: 2018 IEEE security and privacy workshops. IEEE; 2018, p.
[126] Boureau Y-l, Cun YL, et al. Sparse feature learning for deep belief networks. 1–7.
In: Advances in neural information processing systems. 2008, p. 1185–92. [147] Man ND, Huh E-N. A collaborative intrusion detection system framework
[127] Eddy SR. Hidden Markov models. Curr Opin Struct Biol 1996;6(3):361–5. for cloud computing. In: Proceedings of the international conference on IT
[128] Egele M, Woo M, Chapman P, Brumley D. Blanket execution: Dynamic similarity convergence and security 2011. Springer; 2012, p. 91–109.
testing for program binaries and components. In: Proceedings of the 23rd [148] Singh D, Patel D, Borisaniya B, Modi C. Collaborative ids framework for cloud.
USENIX security symposium. 2014. p. 303–17. Int J Netw Secur 2016;18(4):699–709.
[129] Narayanan BN, Djaneye-Boundjou O, Kebede TM. Performance analysis of [149] Fung CJ, Zhu Q. FACID: A trust-based collaborative decision framework for
machine learning and pattern recognition algorithms for malware classification. intrusion detection networks. Ad Hoc Netw 2016;53:17–31.
In: 2016 IEEE national aerospace and electronics conference (NAECON) and [150] Mac Dermott A, Shi Q, Kifayat K. Collaborative intrusion detection in federated
Ohio innovation summit (OIS). IEEE; 2016, p. 338–42. cloud environments. J Comput Sci Appl 2015;3(3A):10–20.
[130] Kebede TM, Djaneye-Boundjou O, Narayanan BN, Ralescu A, Kapp D. Classifica- [151] Shafer G. Dempster-Shafer theory. Encycl Artif Intell 1992;1:330–1.
tion of malware programs using autoencoders based deep learning architecture [152] Pendlebury F, Pierazzi F, Jordaney R, Kinder J, Cavallaro L. {TESSERACT}:
and its application to the microsoft malware classification challenge (big 2015) Eliminating experimental bias in malware classification across space and time.
dataset. In: 2017 IEEE national aerospace and electronics conference. IEEE; In: Proceedings of the 28th USENIX security symposium). 2019. p. 729–46.
2017, p. 70–5. [153] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S,
[131] Messay-Kebede T, Narayanan BN, Djaneye-Boundjou O. Combination of tra- Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural
ditional and deep learning based architectures to overcome class imbalance information processing systems. 2014, p. 2672–80.
and its application to malware classification. In: NAECON 2018-IEEE national [154] Im DJ, Kim CD, Jiang H, Memisevic R. Generating images with recurrent
aerospace and electronics conference. IEEE; 2018, p. 73–7. adversarial networks. 2016, arXiv preprint arXiv:1602.05110.
[155] Jang E, Gu S, Poole B. Categorical reparameterization with gumbel-softmax.
2016, arXiv preprint arXiv:1611.01144.

17

You might also like