0% found this document useful (0 votes)
6 views

Classifying_Malware_Represented_as_Control_Flow_Graphs_using_Deep_Graph_Convolutional_Neural_Network

The document presents a novel approach to malware classification using deep graph convolutional neural networks (DGCNN) to analyze malware represented as control flow graphs (CFGs). It addresses the limitations of existing machine learning methods by balancing generality and performance, demonstrating that the proposed system, MAGIC, can classify CFG-represented malware with accuracy comparable to state-of-the-art methods. The study evaluates MAGIC using large datasets, highlighting its effectiveness in classifying diverse malware types and its potential application in various operational environments.

Uploaded by

siyapkurandwad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Classifying_Malware_Represented_as_Control_Flow_Graphs_using_Deep_Graph_Convolutional_Neural_Network

The document presents a novel approach to malware classification using deep graph convolutional neural networks (DGCNN) to analyze malware represented as control flow graphs (CFGs). It addresses the limitations of existing machine learning methods by balancing generality and performance, demonstrating that the proposed system, MAGIC, can classify CFG-represented malware with accuracy comparable to state-of-the-art methods. The study evaluates MAGIC using large datasets, highlighting its effectiveness in classifying diverse malware types and its potential application in various operational environments.

Uploaded by

siyapkurandwad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Classifying Malware Represented as Control Flow


Graphs using Deep Graph Convolutional Neural
Network
Jiaqi Yan Guanhua Yan Dong Jin
Illinois Institute of Technology Binghamton University, State University of New York Illinois Institute of Technology
[email protected] [email protected] [email protected]

Abstract—Malware have been one of the biggest cyber threats thwart the voluminous malware attacks. These efforts have
in the digital world for a long time. Existing machine learning- been inspired by the great success of ML in improving the
based malware classification methods rely on handcrafted fea- accuracy of image classification, language translation, and
tures extracted from raw binary files or disassembled code.
The diversity of such features created has made it hard to many other applications. There is however a dilemma when
build generic malware classification systems that work effectively we investigate an effective ML-based malware classification
across different operational environments. To strike a balance system. On one hand, if we focus too much on finding
between generality and performance, we explore new machine discriminative malware features to achieve high classification
learning techniques to classify malware programs represented as accuracy, the features constructed as such may lose their gener-
their control flow graphs (CFGs). To overcome the drawbacks
of existing malware analysis methods using inefficient and non- ality when a different operational environment is encountered.
adaptive graph matching techniques, in this work, we build a For instance, although features extracted from the PE headers
new system that uses deep graph convolutional neural network of malware files have been shown useful in classifying PE
to embed structural information inherent in CFGs for effective malware variants belonging to different families [16], a mal-
yet efficient malware classification. We use two large independent ware classification system trained with these features cannot
datasets that contain more than 20K malware samples to evaluate
our proposed system and the experimental results show that it can be used to detect those fileless malware that only exist in the
classify CFG-represented malware programs with performance memory. On the other hand, if the features extracted from
comparable to those of the state-of-the-art methods applied on malware programs are too generic, such as the frequencies of
handcrafted malware features. n-gram byte sequences [4], it is difficult to train a malware
Index Terms—malware classification, control flow graph, deep classifier with high accuracy from them due to their lack of
learning, graph convolution
discriminative power [16].
To strike a balance between generality and performance, we
I. I NTRODUCTION
aim to build a malware classification system from malware
While many anti-virus vendors and computer security re- programs represented as their control flow graphs (CFGs),
searchers have fought hard against malicious software for a data structure commonly used to characterize the control
many years, they remain one of the biggest digital threats in flow of any computer program. The generality of CFGs for
the cyberworld. Motivated by the high return on investment malware classification can be attributed to two factors: (1)
ratio, the underground malware industry has been consistently the CFG can be extracted from different formats of malware
enlarging the sheer volume of the threats on the Internet every code, such as binary executable files, exploit code discovered
year. According to a report by AV-TEST [1], the number in network traffic [17], emulated malware [18], and attack code
of total malware by 2018 is estimated at over 800 million, chained together from gadgets in return-oriented programming
which has increased 28 times over the past 10 years. Even attacks [19], and (2) the CFG can be used to derive a variety of
worse, recently viciously keen cybercriminals made a lot of static analysis features widely used in existing works on ML-
efforts to diversify their avenues of attacks [2]. Inspired by the based malware classification, such as n-grams [4], q-grams
astonishing rise in crypto currency values, 47 new coin mining [5], opcodes [20] and structural information [21]. Therefore,
malware families have emerged since early 2017 and was a malware classification system trained from CFG-represented
ascribed to the 956% soar of the number of crypto mining at- malware programs can find applications in various operational
tacks in the past year [3]. Traditional signature-based detection environments.
approaches have failed to thwart the ever-evolving malware Classifying CFG-represented malware programs needs to
threats. Therefore efficient, robust and scalable anti-malware address two types of performance issues, classification perfor-
solutions are indispensable for protecting the trustworthiness mance (does the malware classifier achieve high accuracy?)
of the modern cyberworld. and execution performance (can the malware classifier work
A number of recent research efforts [4]–[15] have shown efficiently in practice?). As discussed earlier, reducing CFGs
that machine learning (ML) offers a promising approach to to vectors that contain simple aggregate features such as n-

978-1-7281-0057-9/19/$31.00 ©2019 IEEE 52


DOI 10.1109/DSN.2019.00020
Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
grams and opcodes leads to efficient malware classification flow transition except at its exit. Two vertices (u, v) are
but usually with poor classification accuracy. Due to the graph connected by a directed edge u → v if either the last
nature of CFGs, previous works have studied use of graph sim- instruction in u falls through the first line of code in v, or there
ilarity measures to train malware classification models [21]– is a jump instruction in u that is destined to some instructions
[23]. However, some techniques for calculating pairwise graph (e.g., jump target) in v. The implementation details on how
similarity such as those based on graph matching or isomor- to build the CFG from disassembled execution code will be
phism can be computationally prohibitive, letting alone that given in Section IV-A.
the time needed to compute pairwise graph similarity for a
malware dataset scales quadratically with its size. B. Attributed Control Flow Graph
Against this backdrop, in this work we propose to use The CFG representation of software program is generic for
a state-of-the-art deep learning infrastructure, graph kernel- the purpose of malware classification in several ways. First,
based deep neural network, to classify malware programs this type of representation transcends specific programming
represented as control flow graphs. Due to their capability languages in which the programs are written or hardware
of understanding complex graph data, graph kernel-based platforms for which the programs are developed. Although
deep neural network have found success in a number of other low-level representations such as hexadecimal byte se-
application domains, such as protein classification, chemical quences have similar properties, a CFG explicitly expresses
toxicology prediction, and cancer detection [24]–[27]. Par- the execution logic of a program using a graph data structure.
ticularly, our work extends a special type of graph kernel- Hence, the semantics of a malware program is embodied by
based deep neural network, Deep Graph Convolutional Neu- not only the characteristics of the code in individual basic
ral Network (DGCNN) [27] for classifying CFG-represented blocks but also their structural dependencies defined by the
malware programs. Different from graph classification tech- edges connecting these basic blocks.
niques based on pairwise graph similarity, DGCNN allows To convert CFGs to structures that are amenable to machine
attribute information associated with individual vertices to learning, we define attributes at each vertex that summarize
be aggregated quickly through neighborhood defined by the code characteristics as numerical values. Initially the attributes
graph structure in breadth-first-search fashion, thus embedding computed at a vertex do not contain any structural information,
high-dimensional structural information into vectors that are which means that their values are independently collected
amenable to efficient classification. from the corresponding basic block. Table I lists the attributes
To demonstrate the applicability of DGCNN for malware implemented in our prototype system, although more attributes
classification, we have developed a new system called MAGIC can be conveniently added to further improve malware classi-
(an end-to-end malware defense system that classifies CFG- fication performance.
represented malware programs using DGCNN). MAGIC im-
proves the effectiveness of malware classification by extending TABLE I
the standard DGCNN with customized techniques tailored for B LOCK - LEVEL ATTRIBUTES USED IN MAGIC
malware classification. We use two large malware datasets to Attribute Type Attribute Description
evaluate the performance of MAGIC and the experimental # Numeric Constants
results show that it can classify CFG-represented malware # Transfer Instructions
# Call Instructions
programs with accuracy comparable to those of the state-of- # Arithmetic Instructions
the-art methods applied on handcrafted malware features. # Compare Instructions
From Code Sequence
The remainder of the paper is organized as follows. In # Mov Instructions
# Termination Instructions
Section II, we introduce the overall design of MAGIC. In # Data Declaration Instructions
Section III, we provide a primer on DGCNN and then present # Total Instructions
our improvements over the standard DGCNN for classifying # Offspring, i.e., Degree
From Vertex Structure
# Instructions in the Vertex
CFG-represented malware programs. In Section IV we discuss
a few implementation details of MAGIC. We present our
experimental results in Section V. We discuss related work As the raw attributes in an attributed CFG (ACFG) contain
in Section VI and draw concluding remarks in Section VII. little structural information and the number of vertices in an
ACFG varies with the individual program from which the CFG
II. S YSTEM OVERVIEW OF MAGIC is derived, for the purpose of malware classification it is nec-
This section overviews the three main components of essary to aggregate these attributes over all the vertices in the
MAGIC, whose workflow is illustrated in Figure 1. ACFG in an organic manner depending on its graph structure.
The task of such attribute aggregation is accomplished with
A. Control Flow Graph DGCNN, which shall be explained next.
MAGIC relies on the state-of-the-art tools, such as IDA
Pro [28], to extract CFGs from malware code. In a CFG, C. Deep Graph Convolution Neural Network
a vertex represents a basic block, which contains a straight Unlike image or text-based data, graph-based data are of
sequence of code or assembly instructions without any control variable sizes and are thus not naturally ordered tensors. To

53

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
$VVHPEO\&RGH &RQWURO)ORZ*UDSK $WWULEXWHG&)* 'HHS*UDSK&RY1HW
PRYHD[>HVS@ ^`
%HQLJQ3URJUDP
WH[W$ PRY FO>HD[@
« ^` ^`
WH[W$
WH[W$$
PRY
FDOO
HG[>HVS@
VXEB
^`
WH[W&& MQ] ORFB
WH[W' DJDLQ

5DPQLW /ROOLSRS
WH[W'&&&& VXE HD[K
WH[W% PRY HD[>HVS@ [RUHVLHVL ^^`
WH[W' MQ] VKRUW 3DUVH «« ([WUDFW ^` )HHG ^`
^`
&ODVVLI\
ORFB
WH[W ORFB
WH[W$ PRY FO>HD[@

6LPGD 9XQGR
ªªªª
WH[W&& FDOO""BPHPFS\BV
DGGHD[K SXVKHVL ^` ^`
WH[W() MQ]VKRUW
«« ^` ^` ^` ^`
ORFBW «« ^` ^`
WH[W& UHWQ
ªªªª
ªªªª SXVKHD[ ^`
WH[W$)
WH[W&%&
WH[W(&%
[RUHVLHVL
FPSHD[HVL
FDOOBBHUUQR
«« ^` ^`
^` *DWDN 7UDFXU
WH[W( FDOOBBSULQWN
WH[W( VXE HD[HG[
[RUHD[HD[
«« ^`
^^`
^^`
^`
«

Fig. 1. The workflow of MAGIC, a DGCNN-based malware classification system

address this challenge, we use the state-of-the-art deep neural to obtain its tensor representation for malware classification.
network that can automatically learn discriminative latent Note that D̃ can be calculated from Ã.
features from malware data abstracted as ACFGs. Particularly,
we use deep graph convolution neural network to transform
unordered graph data of varying sizes to tensors of fixed $ % & (
size and order. The transformation algorithm first recursively
propagates the weighted attributes in each vertex through
the neighborhood defined by the graph structure. Next, it
sorts the vertices in the order of their feature descriptors.
'
After the sorting step, the graphs with variant sizes are
embedded into fixed-size vectors, which are amenable to ML- $ % & ' ( $ % & ' ( ) )
based classification. In the next section, we shall elaborate on $      $      $  
these operations as well as our extensions based on formal %      %      %  
mathematical descriptions. &      &      &  

III. A LGORITHM D ESCRIPTION '      '      '  

(      (      (  
Our work on applying DGCNN for malware classification
in MAGIC has been inspired by the deep learning model pro-
posed in [27]. In this section, we first introduce how DGCNN Fig. 2. An sample graph g used in Section III to illustrate how the extended
aggregates attributes through the neighborhood defined by the DGCNN works in MAGIC. Assuming the vertices have two attribute channels,
we show g’s augmented adjacency matrix Ã, augmented diagonal degree
graph structure. We then discuss how to extend the existing matrix D̃, as well as attribute matrix X.
DGCNN model with our own modifications. To explain the ra-
tionale of the DGCNN-based malware classification algorithm 2) Graph Convolution Layer(s): In the first stage, a graph
clearly, we walk through an example graph with five vertices convolution technique propagates each vertex’s attributes to
as shown in Figure 2. its neighborhood based on the structural connectivity. To
aggregate multi-scale substructural attributes, multiple graph
A. Primer on DGCNN
convolution layers are stacked, which can be defined recur-
1) Notations: We denote the adjacency matrix of a graph sively as follows:
G = (V, E) of n vertices as A ∈ Zn×n . Note that G is a
directed graph, and A is not necessarily symmetric. To allow Zt+1 = f (D̃−1 ÃZt Wt ) (1)
the attributes of a vertex to be propagated back to the vertex where Z 0 = X. The t-th layer takes input Z t ∈ Zn×ct ,
itself, we define the augmented adjacency matrix à = A + I. mapping ct feature channels into ct+1 feature channels with
Accordingly, the augmented diagonal  degree matrix of G the graph convolution parameter Wt ∈ Rct ×ct+1 . The newly
is defined as D̃, where D̃i,i = j Ãi,j . We assume that obtained channels of each vertex are then propagated to both
each vertex is associated with a c-dimension attribute vector. its neighboring vertices and itself, first multiplied with the
Therefore, we use X ∈ Rn×c to denote the attribute matrix augmented adjacency matrix Ã, and then normalized row-
for all the vertices in the graph. Alternatively, we can also wisely using the augmented degree diagonal matrix D̃. This
treat X as the concatenation of c attribute channels of the key step enables vertices to pass its own attributes through the
graph. For the sample graph g in Figure 2, we displayed the graph in a breadth-first-search fashion. Define F = Zt · Wt
corresponding augmented adjacency matrix Ã, the augmented and O = Ã · F, where
diagonal degree matrix D̃, and the attribute matrix X with two  n
attribute channels F 1 and F 2. Given à and X for graph G, O[i][j] = Ã[i][k] × F[k][j] (2)
the DGCNN-based algorithm performs three sequential stages k=1

54

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
∀1 ≤ i ≤ n, 1 ≤ j ≤ ct . In other words, the j-th feature 4) Remaining Layer: In the last stage, the authors of the
channel of vertex i is computed as a linear combination of all original DGCNN [27] appenda one-dimension convolutionh
h
its neighbors’ j-th feature channels. The layer finally outputs (Conv1D) layer of kernel size 1 ct and stride size 1 ct . If
the element-wise activation using a nonlinear function f . At F is the number of filters in the last one-dimension convolution
the end of h graph convolution layers, DGCNN concatenates layer, the sort pooling output Zsp will be reduced to a one-
each layer’s output Z t , denoted as Z1:h = [Z1 , Z2 , . . . , Zh ]. dimension vector of size k × F , which is then fed into a fully
For the sample graph g, Figure 3 shows how the two sequential connected one-layer perceptron for graph classification.
graph convolution layers transform the initial attribute matrix
Z0 = X to Z1 and Z2 , both of which together form Z1:2 .
      
      
      
      
      
*UDSK&RQYROXWLRQ/D\HU *UDSK&RQYROXWLRQ/D\HU
*UDS

       


6RUW3RROLQJ/D\HUZLWKN 
       

       

       

              
      
      
       
 
 
 
  
 
 

       
  
  
  
  
  
  
 
      
      
       Fig. 4. Given the graph convolution result Z1:h for the sample graph g in
Figure 3, the sortpooling layer with k = 3 sorts the feature descriptors (based
on only the last feature channel in this example) and then truncates the two
Fig. 3. After applying h = 2 graph convolution layers, the sample ‘smallest’ rows.
graph g is transformed to Z1:h . We assume the weight parameters in the
two graph convolution layers are W1 = [[1, 0, 1], [0, 1, 0]] and W2 =
[[0, 1, −2, 2], [1, 1, 7, −2], [1, 0, −1, 4]]. For simplicity, numbers are of 2 B. WeightedVertices Layer
precision, and we perform the element-wise RELU nonlinear activation
f (x) = max(x, 0) in both graph convolution layers. In the first extension to DGCNN, we observe that the
Conv1D layer following the sortpooling layer can alternatively
be of kernel size k, stride size k, and single channel. Mathe-
3) SortPooling Layer: Intuitively Z1:h has n rows and matically, a single channel Conv1D layer can be represented
h h
1 ct column, which corresponds to the feature descriptor as a row of parameters W ∈ R1×k . Its output E ∈ R1× 1 ct ,
of each vertex at different scales. The second stage, namely when fed with transposed Zsp , will be equivalent to
the sortpooling layer, leverages the feature descriptors to
sort the vertices. Vertices in different graphs will be put E = f (W × Zsp ) (3)
in similar positions as long as they have similar weighted
This is because
feature descriptors. The sortpooling layer starts with the last
k

layer because Zh is approximately equivalent to the most
Ec = f ( Wi × Zsp
i,c ) (4)
refined continuous colors as in the Weisfeiler-Lehman graph
i
kernels [29]. More specifically, vertices are first sorted by the h
last channel of the last layer in a decreasing order. If there where 1 ≤ c ≤ 1 ct , and f is an element-wise nonlinear
are ch ties on the last layer’s output Zh , sorting continues by activation function. Inspired by the graph embedding idea in
using the second last layer’s output Zh−1 , and the procedure [30], our Conv1D layer treats each row of the sort pooling
repeats until all ties are broken. The sortpooling layer further result Zsp
i as the embedding of the vertices kept by the
truncates or pads the sorted tensors
h by the first dimension so sortpooling layer.
that it outputs Zsp of size k by 1 ct . Hence, the sortpooling Equivalently, Equation (3) computes E, the embedding of
process unifies the size of feature descriptors for all graphs. the graph obtained through a weighted summation of vertex
Following our sample graph g, we visualize this process in embeddings [30]. For our sample graph g, its “embedding”
Figure 4. The row of Zh is first sorted using only the value in is computed in Figure 5, where we assume weight vector
the last column. The last two rows (i.e., yellow and red) are W = [0.4, 0.1, 0.5]. In reality, W is updated by gradient
discarded from the sorted matrix as n − k = 2 > 0. descent during the process of minimizing the classification

55

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
loss. For ease of presentation, in the remainder of the paper We then perform a two-pass iteration over P . Instructions
we refer to this special Conv1D layer after the sortpooling inst ∈ I are associated with a couple of tags, i.e., {start,
layer as the WeightedVertices layer. We replace the original branchTo, fallThrough, return}, used by the sec-
Conv1D layer with the WeightVertices layer, because the ond pass for creating code blocks and connecting blocks.
WeightVertices layer leverages the graph embedding idea to To adapt to (potentially) hundreds of types of instructions,
make the output from the sorting pooling layer compatible the first pass applies the visitor pattern to implement if-
with the malware classifier. else free instruction tagging. As an example, Algorithm 1
details the tagging operations when visiting a conditional jump
C. AdaptiveMaxPooling: An Alternative to Sortpooling instruction cj. This procedure relies on a helper function
The intuition behind sorting from the deeper layer is to findDstAddr(inst) to extract the destination address of
treat its output as more refined WL colors [29], [31]. The a jump instruction inst. For a conditional jump instruction,
inner sorting inside the channels of a fixed layer output is, it branches to the jump target (P [dstAddr], handled by line
however, less reasonable. Besides, the Conv1D addendum is 2 – 3) and falls through to the next instruction (P [cj.addr +
only aggregating the feature descriptors of per vertex and per cj.size], handled by line 4 – 5).
convolution channel separately.
Our second extension is to apply the adaptive max pooling Algorithm 1 visitConditionalJump(cj)
(AMP) on the concatenated graph convolution layer output 1: dstAddr ← findDstAddr(cj)
Z1:h . Given an set of two-dimension inputs of various sizes 2: cj.branchT o ← dstAddr
{xi |xi ∈ Rhi ×wi }, The AMP layer divides each input xi into a 3: P [dstAddr].start ← true
H × W grid with a sub-window size approximately to hi /H 4: cj.f allT hrough ← true
and wi /W , and then automatically chooses kernel sizes as 5: P [cj.addr + cj.size].start ← true
well as convolution strides for different xi . Inside each sub-
window and each channel, only the maximum element is kept The second pass creates code blocks (or vertices) and
in order to form the set of identical-dimension outputs {yi |yi ∈ connects blocks on the fly. The skeleton of the procedure is
RH×W }. The way in which AMP works for our sample graph illustrated in Algorithm 2. Algorithm 2 assumes two trivial
g is illustrated in Figure 6. Since the dimension of the graph helper functions. Firstly, getBlockAtAddr(addr) returns
convolution output Z1:hg for g is 5 × 7, AMP uses a max the block starting at addr if it already exists; otherwise
pooling kernel of size 3 × 3. To show how AMP works for it creates a new block starting at addr. The second one
inputs of different dimension sizes, in Figure 6 we also feed getNextInst(P, inst) returns the instruction next to inst
Z1:h 
g  , the graph convolution output for another graph g (not if it exists; otherwise, None is returned. With both helpers,
shown here), to the 3 × 3 AMP layer. In this case, the kernel Algorithm 2 works in three steps. The first if statement creates
size is adaptively adjusted to 2 × 3. a new block if inst was marked as start in the first pass.
We have two motivations for using AMP at the end of the The second step connects block to nextBlock if the last
graph convolution layer. In addition to unifying the convolu- instruction in block is falling through to the next instruction,
tion layer output Z1:h , AMP empowers us to aggregate Z1:h which happens to be the start of nextBlock. The final step
across the dimensions of both feature channels and graph ver- creates an edge (potentially a new block) for any branching
tices simultaneously, which enables us to capture informative operations, e.g., jump or call.
features that vary only by location. This is easily accomplished
by applying a two-dimension convolution (Conv2D) layer with B. Loss Functions Used in Model Training
an arbitrary number of filters before the AMP layer. The output Another technical advantage of MAGIC is its support for
of the AMP layer is further fed into a multiple-Conv2D- the end-to-end deep neural network training. Regardless of
layer neural network, inspired by VGG [32], to predict the how we change the layer configurations, i.e., whether to use
probability distribution of the malware families that the input the sort pooling layer or the adaptive max pooling layer, the
CFG should belong to. model’s output is always the prediction of the observed input.
IV. I MPLEMENTATION Therefore, the training procedure always minimizes the mean
negative logarithmic loss, which is defined as
In this section, we discuss a few implementation details,
N 
 C
which include how we derive CFGs from dissembled code,
what kind of loss functions are used in model training, and L= yi,c log(pi,c ) (5)
i=1 c=1
the open source MAGIC project.
where N is the number of observations in the dataset, C is the
A. Details in Building CFG number of malware families in the dataset, yi,c is 1 if the ith
To build a control flow graph from disassembled code in sample belongs to malware family c, and pi,c is the predicted
possibly different formats, we first pre-process the input files probability that ith sample is in family c in the output of
so that the resulting program P is a one-to-one mapping from the model. We adopt the Adam optimization algorithm [33]
sorted addresses to assembly instructions, e.g., P : Z+ → I. implemented in PyTorch [34] to auto-generate the gradient

56

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
      

             

      

Fig. 5. WeightVertices layer aggregates sample graph g’s vertex embeddings, e.g. the output of sort pooling layer Zsp in Figure 4, to graph embedding E.
We choose again RELU as the nonlinear activition function f for simplicity.

      


         

         
      
  
       $03
;
         
      
  
      
         

Fig. 6. An example of 3 × 3 adaptive max pooling layer with two two-dimension inputs with different sizes. The left top matrix Z1:h g represents the graph
convolution output of the sample graph g in Figure 2. The left bottom matrix Z1:h g
represents the graph convolution output of another imaginary graph g 
with four vertices. For Z1:h
g of size 5 × 7, adaptive max pooling’s kernel size = 3 × 3 (shown as red shadow). For Z1:h
g
of size 4 × 7, adaptive max pooling’s
kernel size = 2 × 3 (shown as red shadow). For both inputs, padding = 0, stride = 2 × 1.

of the model parameters and update the parameters in the


Algorithm 2 CfgBuilder::connectBlocks() DGCNN (e.g. W t , 1 ≤ t ≤ h in graph convolution layers)
1: for all inst in program P do
accordingly.
2: if inst.start then
3: currBlock ← getBlockAtAddr(inst.addr) C. Open Source MAGIC Project
4: end if For malware classification tasks, MAGIC runs either in the
5: nextBlock ← currBlock training mode or in the prediction mode. In the training stage,
6:
we repeatedly activate only the first half of the pipeline to
7: nextInst ← getNextInst(P, inst)
obtain a large amount of labeled CFGs. Then, a DGCNN and
8: if nextInst = None then
its classifier are trained using the stochastic gradient descent on
9: if inst.f allT hrough and nextInst.start then
10: nextBlock ← getBlockAtAddr(nextInst.addr)
the labeled CFGs in a batch mode. When the training finishes,
11: add nextBlock to currBlock’s edge list MAGIC takes the CFGs of unknown binary executables as
12: end if inputs and make predictions.
13: end if We have implemented a prototype system of MAGIC with
14: approximately 4,000 line of Python code. The implementation
15: if inst.branchT o = None then can be divided into two independent parts. The first part
16: block ← getBlockAtAddr(inst.branchT o) generates ACFGs from either assembly code or control flow
17: add block to currBlock’s edge list graphs. Due to the necessity of processing a large number
18: end if of assembly programs, MAGIC can generate multiple ACFGs
19: in parallel using Python’s multi-threading library. The second
20: add inst to currBlock part handles the training, tuning, and evaluation of the ex-
21: currBlock ← nextBlock tended DGCNN, which is built upon, but heavily rewritten,
22: end for from Muhang’s PyTorch implementation [35]. For example,
we developed the adaptive pooling layer and the Weight-
edVertices layer. Besides, MAGIC supports automatic and

57

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
exhaustive hyper-parameter tuning, cross-validation, training YANCFG; so they cannot be applied to one model. Secondly,
and prediction using GPUs. We will make MAGIC’s codebase testing MAGIC on datasets collected from independent sources
publicly available at Github in the near future. also allows us to gain insights into its generality when applied
in different operational environments.
V. E XPERIMENTAL E VALUATION
We evaluate the performance of MAGIC using two large B. Model Training and Evaluation
malware datasets, each with more than 10,000 samples, and As the malware families are different in the two datasets,
present our experimental results in this section. we need to create two different MAGIC instances to classify
their malware samples separately. However, MAGIC uses the
A. Malware Datasets same way to train DGCNN and tune its hyperparameters for
The first dataset, which we refer to as the MSKCFG both datasets. The first step is hyperparameter tuning. Table II
dataset, includes the CFGs derived from the malware files lists the hyperparameters used in both the deep neural network
used in the 2015 Microsoft Malware Classification Challenge itself (e.g., the sizes of the graph convolution layers) and the
hosted by Kaggle [36]. The dataset contains samples that fall algorithm for training the model (e.g., batch size and learning
into nine families: {Ramnit, Lollipop, Kelihos ver3, Vundo, rate). To determine the optimal values of these hyperparam-
Simda, Tracur, Kelihos ver1, Obfuscator.ACY, Gatak}. Fig- eters, we exhaustively search all 208 hyperparameter settings
ure 7 presents the number of samples in each of these nine defined by the value ranges listed in Table II. In particular, 64
malware families in the dataset. In the competition, Kaggle DGCNN models use adaptive pooling, 96 DGCNN models
provided 10,868 labeled malware samples as the training use the sort pooling and Conv1D layer, and 48 DGCNN
dataset, for each of which two files were given. The first file models use the sort pooling and WeightVertices layer. We
contains the raw binary content in a hexadecimal representa- apply the five-fold cross-validation technique to evaluate the
tion (referred as .byte file in the following discussion). The performance of a model under a specific hyperparameter
second file is the corresponding assembly code of the binary setting. To conduct five-fold cross validation, the dataset is
code, which was generated with the IDA Pro tool [28] (referred splitted into five equal-size subsets. In each fold of the cross
as .asm file in the following discussion). The correctness of validation, four subsets (80%) of the data are used for training
the .asm file is not guaranteed because PE headers were erased a brand new model initialized randomly, and the rest subset
from the raw malware files for sterility before they were (20% of the data), different in each fold, is used to evaluate
disassembled and sophisticated binary packing techniques may the resultant model. In this way, the training process never
also prevent reverse engineering tools from disassembling the sees the testing samples used for performance evaluation. We
malware correctly [37]. In our experiments, only the .asm files train each model with 100 epochs and record the negative log-
were used for malware classification. We generated a total likelihood validation loss after every epoch.
number of 10,868 ACFGs from the training .asm files, which The validation loss collected after each epoch is used to
took approximately 17 hours to finish, or averagely 5.8 seconds find the hyperparameters that can mitigate the overfitting issue.
per malware instance, using a commodity desktop equipped Once the validation loss increases for two continuous epochs,
with Intel Core i7-6850K CPU and 64 GB memory. we decrease the learning rate by a factor of ten to prevent the
Another dataset, which we refer to as YANCFG, includes model from overfitting the training dataset. For a particular
the CFGs of 16,351 binary executable files which were used model, when all five training-validation folds are finished, we
in [8]. Similar to the MSKCFG dataset, the PE headers compute each epoch’s validation loss by averaging the five
were not available to us for malware classification from the validation losses over the five folds. The minimum validation
second dataset. All the CFGs were labeled with a majority loss over the 100 epochs is treated as the score of this model
voting scheme based on the detection results of five major AV and is further used as the criterion for comparing with different
scanners returned by the VirusTotal online malware analysis hyperparameter settings. In other words, after the five-fold
service [38]. All the CFGs belong to 12 distinct malware fam- cross-validation for all the 208 model training instances, we
ilies: {Bagle, Benign, Bifrose, Hupigon, Koobface, Ldpinch, choose the best model with the minimum average validation
Lmir, Rbot, Sdbot, Swizzor, Vundo, Zbot, Zlob}. Figure 8 loss. Besides the average validation negative log-likelihood
depicts the number of samples for each of these 12 families loss, we also measure its precision, recall, and F1 score
in the dataset. These CFGs were further converted to their averaged over the five validation sets (together referred to
corresponding ACFGs by MAGIC within 6.8 hours using the as the cross-validation scores), and then compare the best
same desktop machine as mentioned above. model’s performance against those of previous works. We used
We did not merge two malware datasets in our experiments four GeForce GTX 1080 Ti graphic cards to train and run
due to the following reasons. Firstly, the YANCFG dataset the 208 variants of DGCNN. Training a particular model is
carries pre-processed CFGs, while we developed our own always done on a single GPU, but the evaluation procedure
parser to extract CFGs from the malware assembly code in the actually takes up all the four GPUs because our MAGIC
Microsoft dataset (see Section IV-A). The CFG extracted from implementation supports parallel model training on multiple
the MSKCFG dataset by our own parser has different low- GPUs. The last two columns in Table II describe the best
level feature representation from that of the CFGs pre-given in models chosen by MAGIC for the MSKCFG and YANCFG

58

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
TABLE II
H YPERPARAMETERS AND S EARCH R ANGES DURING T UNING

Hyperparameter Choice or Value Range Best Model for MSKCFG Best Model for YANCFG
Pooling Type [Adaptive Pooling, Sort Pooling] Adaptive Pooling Adaptive Pooling
Pooling Ratio [0.2, 0.64] 0.64 0.2
Graph Convolution Size [(32, 32, 32, 1)1 , (32, 32, 32, 32), (128, 64, 32, 32)] (128, 64, 32, 32) (32, 32, 32, 32)
Remaining Layer2 [1D Convolution Layer, WeightVertices Layer] Not Applicable Not Applicable
2D Convolution Channels3 [16, 32] 16 16
1D Convolution Channel Pairs4 [(16, 32)] Not Applicable Not Applicable
1D Convolution Kernel Size5 [5, 7] Not Applicable Not Applicable
Dropout Rate [0.1, 0.5] 0.1 0.5
Batch Size [10, 40] 10 40
Weight L2 Regularization Factor [0.0001, 0.0005] 0.0001 0.0005

Fig. 7. Malware Family Distribution in MSKCFG Dataset. Fig. 9. Cross-Validation Scores of MAGIC on the MSKCFG Dataset.

TABLE III
P ERFORMANCES OF MAGIC ON THE MSKCFG DATASET.

Family Precision Recall F1


Ramnit 0.962378 0.991289 0.976615
Lollipop 0.995960 0.997550 0.996754
Kelihos Ver3 1.000000 1.000000 1.000000
Vundo 0.981975 1.000000 0.990895
Simda 0.990476 1.000000 0.994987
Tracur 1.000000 0.987013 0.993463
Kelihos Ver1 1.000000 0.982493 0.991156
Obfuscator.ACY 0.995593 0.962293 0.978655
Gatak 0.999775 0.996841 0.998304

malware families, our best model has achieved good validation


scores with precisions higher than 0.96, recalls higher than
Fig. 8. Malware Family Distribution in YANCFG Dataset. 0.96, and F1-scores higher than 0.97.
Since Microsoft released the competition dataset in 2015,
many researchers have used the dataset (completely or par-
datasets, respectively. In the following, we report the cross- tially) to evaluate their techniques for malware detection and
validation scores for the two datasets. classification [9]–[15], [39]. We surveyed the works men-
C. Results on the MSKCFG Dataset 1 Only for sort pooling
The best cross-validation scores (precision, recall and F1) 2 Applicable only when set pooling type to sort
3 Applicable only when set pooling type to adaptive pooling
for the MSKCFG dataset are shown in Figure 9, and the exact
4 Applicable only when set pooling type to sort pooling and remaining layer
score values are listed in Table III. The standard variances are
to 1D convolution
not listed because the scores’ variations among five different 5 Applicable only when set pooling type to sort pooling and remaining layer
cross-validation folds are negligible (< 0.004). For all nine to 1D convolution

59

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
tioned above and found that three of them cannot be directly Koobface, Swizzor, Vundo, Zbot, Zlob}. The classification
compared with our results because they were using different performances on both the Koobface and Swizzor families are
metrics. The works in [39] and [12] used the Microsoft perfect with a precision of 1.0 and a recall of 1.0. Regarding
dataset in the context of malware detection, where all the the other five families with F1 scores lower than 0.8, they all
samples contained within it were treated as malicious, and have relatively small populations in the YANCFG dataset. Our
then they were merged with a number of benign programs to classifier suffers relatively low recalls (around 0.5) for both the
construct a new dataset for malware detection. As a result, their Ldpinch and Sdbot families. For the Ldpinch, Rbot and Sdbot
methods and performance metrics (two-class AUC, F1 score families, their precision scores (between 0.64 and 0.70) are
or accuracy) are not directly comparable to the approaches not as good as the other ten families (more than 0.8).
aimed at classifying malware samples into the corresponding In order to further assess MAGIC, we compared our results
families (e.g., [9]–[11], [13]–[15]), this work included. Note to the F1 scores obtained in [8]. That work ensembles a group
that without loss of generality, benign software can be treated of individual SVM (Support Vector Machine)-based classifiers
as a special family. The work in [10] did not adopt the cross- (refer to ESVC hereafter). Our work does not use the raw
validation methodology. Instead, the authors manually split hexadecimal bytes, the PE headers, and the execution traces in
the training dataset into 75% training data and 25% holdout the original dataset.We plot the comparison results in Figure 11
testing data, and reported the mean square error, accuracy as the relative and absolute amount of improvement to ESVC
and confusion matrix over both training and testing data. The as achieved by MAGIC. Note that the improvement statistics
holdout set is not the test set provided by Microsoft. Therefore, for the Benign family are not shown in Figure 11 because the
we did not compare our work with the evaluation results F1 score for the benign samples is not reported in [8]. The
reported in [39], [12] and [10]. positive values in Figure 11 mean factual improvement, while
Among the other five papers of malware classification, both the negative values mean degradation. Close to the bottom
[11] and [14] conducted the ten-fold cross validation over of the figure, we observe that the only family over which
the Microsoft dataset, but only reported the overall accuracy. MAGIC performs visibly poorer than ESVC is Rbot, with an
[15] also performed a ten-fold cross validation but reported approximate performance degradation of 0.07 relatively and
both the overall accuracy and logarithmic loss. Both [13] and 0.05 absolutely. For Hupigon, the downgradation is nearly
[9] performed a five-fold cross validation and reported both invisible (less than 0.01 both relatively and absolutely), and
the overall accuracy and logarithmic loss. Since the Microsoft both approaches achieve F1 scores higher than 0.94. On the
dataset is not balanced across malware families, we compare other hand, MAGIC outperforms ESVC for the other ten
MAGIC with the five previous works that reported not only families. Moreover, the amount of absolute improvement is
the overall accuracy but also the mean logarithmic loss, and greater than or equal to 0.2 for each of the Bagle, Koofbace,
the results are shown in Table IV. Ldpinch and Lmir families. Lastly, it is noted that both
The methods listed in Table IV can be classified into either approaches performed relatively poorly on the Ldpinch and
ensemble-learning or single-model based approaches. [13] Lmir malware families. Still, the DGCNN-based approach
extracts more than 1800 features and uses gradient boosting used by MAGIC improves the work in [8] by 70% and 35%
based classifier, and it achieves the best log-loss (0.0197) and in terms of the F1-score for Ldpinch and Lmir, respectively.
accuracy (99.42%) using the XGBoost classifier. [11] achieves
the second best accuracy (99.3%) by ensembling multiple E. Discussion
random forest methods, which already ensembles multiple We break down the major execution overhead of MAGIC
decision trees. The DGCNN-based technique used by MAGIC into three parts: feature extraction time, classifier training
achieves highly competitive results. In fact, the logarithmic time, and malware prediction time. Building ACFGs for the
loss (0.0543) is the second best; the accuracy (99.25%) is the MSKCFG dataset with MAGIC takes less than 6 seconds on
third best, only 0.005 less than the second best one (99.3% average. We gathered the training and testing running time
reported by [11]). [9] adopts a deep-learning based hybrid over 20 runs to evaluate MAGIC. The mean and standard
approach. It relies on a single deep autoencoder to perform deviation of the classifier training time per instance among
automatic feature learning, and then uses gradient-boosting the 20 runs is approximately 29.69±4.90 milliseconds, while
based classifier to make the prediction. As an alternative work the mean and standard deviation of malware prediction time
that also applies deep neural network, our DGCNN-based per instance is only 11.33±1.35 milliseconds. Our measure-
approach outperforms the work in [9] by 27.40% in terms of ments of the execution overhead of MAGIC suggest that it is
logarithmic loss and 1.5% in terms of classification accuracy. actionable for online malware classification in practice.
By design, MAGIC is aimed at striking a balance between
D. Results on the YANCFG Dataset generality and performance. XGBoost [13] achieves impres-
MAGIC’s best cross validation scores on the YANCFG sive performance (99.42% CV accuracy for MSKCFG), but
dataset are depicted in Figure 10 and the exact values of these relies on various handcrafted features (more than 1800 from
scores are listed in Table V. We observe in Figure 10 that the MSKCFG dataset) and time-consuming feature selection
MAGIC achieves F1-scores higher than 0.9 for nine of the 13 techniques (e.g., forward stepwise selection). In contrast,
binary families including {Bagle, Benign, Bifrose, Hupigon, MAGIC achieves a similar performance (99.25%) with only

60

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
TABLE IV
C ROSS VALIDATION M ETRIC C OMPARISON ON THE M ICROSOFT DATASET.

Approach Brief Description Mean Logarithmic Loss Accuracy


MAGIC 0.0543 99.25
XGBoost with Heavy Feature Engineering [13] 0.0197 99.42
Deep Autoencoder based XGBoost [9] 0.0748 98.20
Strand Gene Sequence Classifier [15] 0.2228 97.41
Ensemble Multiple Random Forest Classifiers [11] Not Reported 99.30
Random Forest with Feature Engineering [14] Not Reported 99.21

TABLE V
P ERFORMANCE OF MAGIC ON THE YANCFG DATASET.

Family Precision Recall F1 Score


Bagle 0.863636 0.950000 0.904762
Benign 0.954128 0.962963 0.958525
Bifrose 0.930380 0.901840 0.915888
Hupigon 0.935287 0.945679 0.940454
Koobface 1.000000 1.000000 1.000000
Ldpinch 0.692308 0.514286 0.590164
Lmir 0.833333 0.731707 0.779220
Rbot 0.641221 0.763636 0.697095
Sdbot 0.700000 0.488372 0.575342
Swizzor 0.995708 0.995708 0.995708
Vundo 0.990859 0.981884 0.986351
Zbot 0.941799 0.936842 0.939314
Zlob 0.967254 0.992248 0.979592

Fig. 11. F1 Score Comparison between MAGIC and ESVC [8] on the
YANCFG Dataset. Improvement on the classification accuracy of benign
samples is not shown because it is not reported in [8].

VI. R ELATED W ORKS


We introduce related works on deep learning based malware
classification and deep neural networks for graph data.
A. Deep Learning Based Malware Detection
As an important problem in cybersecurity, malware de-
tection and classification have drawn the attentions of many
researchers from both communities of cybersecurity and data
mining [40], [41]. There has been a recent trend of applying
deep learning techniques for malware defense tasks and these
works largely fall into two categories. In the first category,
Fig. 10. Cross-Validation Scores of MAGIC on the YANCFG Dataset. which adopts a feature-centric approach, researchers reuse or
extend features developed in previous works but extracted
from the newer datasets collected from customized, private,
and specific environments [9], [10], [42]–[47]. For example,
a dozen easy-to-extract attributes embedded within malwares the work in [43] focused on malware collected from Android
CFG structures. The work in [8] sequentially integrates SVM- devices and that in [44] targets malware on the cloud plat-
based malware classifiers trained from heterogeneous features, forms. Off-the-shelf deep learning models have been used as
but its use of dynamic programming to search an optimal tunable black boxes. Compared with classic machine learning
malware classifier with a bounded false positive rate increases algorithms like decision trees, random forest, and gradient
model training time significantly. boosting, deep learning models fed with the same training
The YANCFG and MSKCFG datasets are the two largest datasets may result in better detection performance [43], [45]
labeled malware datasets that we could obtain to evaluate our and faster execution [42], because of their advantages on
work. It is possible that malware development trends after the big data analysis [45] and the possibility of using parallel
collection of these two datasets introduce new challenges to computing hardware (i.e., GPUs). Moreover, the works in [9]
the malware classification problem. We plan to test our models and [10] have focused on automation of feature learning using
with the latest malware samples in our future work. unsupervised deep learning techniques.

61

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
In the second category, which adopts a model-centric ap- a variety of operational environments. Secondly, we extend
proach, researchers are motivated by finding specific but the state-of-the-art graph convolutional neural network to
superior deep learning architectures for malware defense tasks. aggregate attributes collected from individual basic blocks
For example, to find similar success of convolution neural through the neighborhood defined by the graph structures and
network in the application domains of computer vision and thus embed them into vectors that are amenable to machine
image classification, researchers have proposed methods to learning-based classification. Our experimental evaluation con-
transform the byte sequences in binary malware executables ducted on two large malware datasets has shown that our
into gray or color images, which are amenable to the existing proposed method achieves classification performances that are
deep learning-based image classification techniques [48], [49]. comparable to those of state-of-the-art methods applied on
Other researchers have explored how deep sequence models, handcrafted features.
such as LSTM, GRU, and the attention mechanism, can be We envision MAGIC would be deployed on a cloud, as
applied to programs the sequences of system calls or API calls a typical end user does not have enough labeled malware
transformed from malware programs [46], [47]. samples to train good classification models. A user can upload
Our work takes a hybrid approach that intersects with suspicious files to the cloud, which further trains appropriate
the works in both categories. First, our work improves the MAGIC parameters to classify programs newly seen in her
accuracy of malware classification using not only the features operational environment. In this way, even if a particular user
that can be explicitly expressed with numeric values (i.e., may not have labeled programs to train a specific neural
attributes extracted from basic blocks) but also those that are network, he can still benefit from MAGIC who learns from
inherent within the structure of the program (i.e., control flow many other uses that do have labeled datasets.
graphs). On the other hand, deep learning models are not used
as black boxes in our work, as we have proposed modifications ACKNOWLEDGMENT
to the standard DGCNN that are better tailored to the malware
classification problem. We would like to thank the anonymous DSN reviewers
and our shepherd Long Wang for their valuable feedback. We
B. Deep Learning Models for Graph-Represented Data also thank Ping Liu for the insightful discussions and Muhan
There are two parallel lines of research on deep learning Zhang for the bug fixes in the pytorch version of DGCNN.
algorithms for graph-represented data. In the first setting [50]– This work is partly sponsored by the Air Force Office of
[52], a single graph is given and the task is to infer unknown Scientific Research (AFOSR) under Grant YIP FA9550-17-
labels of individual vertices, or unknown types of connectivity 1-0240, the National Science Foundation (NSF) under Grant
between vertices. Though this problem has wide applications CNS-1618631, and the Maryland Procurement Office under
in social networks and recommendation systems, it does not Contract No. H98230-18-D-0007. Any opinions, findings and
align well with our goal in this paper. Instead, our work fits conclusions or recommendations expressed in this material are
into the second setting, where assuming a group of labeled those of the author(s) and do not necessarily reflect the views
graphs with different structures and sizes, the task is to predict of AFOSR, NSF, and the Maryland Procurement Office.
the label of future unknown graphs [26], [27], [53]. In this
setting, both the works in [53] and [27] have mentioned their R EFERENCES
connections with the classic Weisfeiler-Lehman subtree kernel
[1] “Malware Statistics & Trends Report by AV-TEST,” https://www.av-
[29] or the Weisfeiler-Lehman algorithm [31]. In contrast, test.org/en/statistics/malware, accessed: 2018-11-30.
the work in [53] introduces the recurrent neuron based graph [2] “Executive Summary - 2018 Internet Security Threat Report,”
kernel, then stacks multiple graph kernel neural layers into https://www.symantec.com/content/dam/symantec/docs/reports/istr-23-
executive-summary-en.pdf, accessed: 2018-11-30.
deep network. Similar to the sort pooling layer discussed in
[3] “Crypto Mining Attacks Soar in First Half of 2018,”
[27], the work in [26] generalizes the convolution operator and https://www.coindesk.com/crypto-mining-attacks-soar-in-first-half-
enables it to handle arbitrary graphs of different sizes. In our of-2018, accessed: 2018-11-30.
work, we propose to enhance the architectures introduced in [4] J. Z. Kolter and M. A. Maloof, “Learning to detect and classify malicious
executables in the wild,” Journal of Machine Learning Research, vol. 7,
[27] with both the weight vertices layer and the adaptive max no. Dec, pp. 2721–2744, 2006.
pooling layer for the malware classification task. [5] K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis
of malware behavior using machine learning,” Journal of Computer
VII. C ONCLUSION Security, vol. 19, no. 4, pp. 639–668, 2011.
[6] R. Perdisci, A. Lanzi, and W. Lee, “McBoost: Boosting scalability
In this work, we have applied deep graph convolution neural in malware collection and analysis using statistical classification of
network to the malware classification problem. Different from executables,” in Computer Security Applications Conference, 2008.
ACSAC 2008. Annual. IEEE, 2008, pp. 301–310.
existing machine learning-based malware detection approaches [7] B. Anderson, D. Quist, J. Neil, C. Storlie, and T. Lane, “Graph-
that commonly rely on handcrafted features and ensemble based malware detection using dynamic analysis,” Journal in computer
models, this work proposes and evaluates an end-to-end mal- Virology, vol. 7, no. 4, pp. 247–258, 2011.
[8] G. Yan, “Be sensitive to your errors: Chaining neyman-pearson criteria
ware classification pipeline with two distinguishing features. for automated malware classification,” in Proceedings of the 10th ACM
Firstly, our malware classification system works directly on Symposium on Information, Computer and Communications Security,
CFG-represented malware programs, making it deployable in ser. ASIA CCS ’15. New York, NY, USA: ACM, 2015, pp. 121–132.

62

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.
[9] M. Yousefi-Azar, V. Varadharajan, L. Hamey, and U. Tupakula, tion,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer
“Autoencoder-based feature learning for cyber security applications,” in and Communications Security. ACM, 2017, pp. 363–376.
2017 International Joint Conference on Neural Networks (IJCNN), May [31] B. J. Weisfeiler and A. Leman, “A reduction of a graph to a canon-
2017, pp. 3854–3861. ical form and an algebra arising during this reduction,” Nauchno–
[10] T. M. Kebede, O. Djaneye-Boundjou, B. N. Narayanan, A. Ralescu, and Technicheskaja Informatsia, pp. 12–16, 1968.
D. Kapp, “Classification of malware programs using autoencoders based [32] K. Simonyan and A. Zisserman, “Very deep convolutional networks
deep learning architecture and its application to the microsoft malware for large-scale image recognition,” in ICLR, 2015. [Online]. Available:
classification challenge dataset,” in 2017 IEEE National Aerospace and http://arxiv.org/abs/1409.1556
Electronics Conference (NAECON), June 2017, pp. 70–75. [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[11] M. Hassen and P. K. Chan, “Scalable function call graph-based malware 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
classification,” in Proceedings of the Seventh ACM on Conference on [34] “PyTorch,” https://www.pytorch.org, accessed: 2018-11-30.
Data and Application Security and Privacy, ser. CODASPY ’17. New [35] “PyTorch DGCNN,” https://github.com/muhanzhang/pytorch DGCNN,
York, NY, USA: ACM, 2017, pp. 239–248. accessed: 2018-11-30.
[12] D. Yuxin and Z. Siyi, “Malware detection based on deep learning [36] R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi,
algorithm,” Neural Computing and Applications, Jul 2017. “Microsoft malware classification challenge,” 2018. [Online]. Available:
[13] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, http://arxiv.org/abs/1802.10135
“Novel feature extraction, selection and fusion for effective malware [37] B. Cheng, J. Ming, J. Fu, G. Peng, T. Chen, X. Zhang, and J.-Y. Marion,
family classification,” in Proceedings of the Sixth ACM Conference on “Towards paving the way for large-scale windows malware analysis:
Data and Application Security and Privacy, ser. CODASPY ’16. New Generic binary unpacking with orders-of-magnitude performance boost,”
York, NY, USA: ACM, 2016, pp. 183–194. in Proceedings of the 2018 ACM SIGSAC Conference on Computer and
[14] M. Hassen, M. M. Carvalho, and P. K. Chan, “Malware classification Communications Security, ser. CCS ’18. New York, NY, USA: ACM,
using static analysis based features,” in 2017 IEEE Symposium Series 2018, pp. 395–411.
on Computational Intelligence (SSCI), Nov 2017, pp. 1–7. [38] “VirtusTotal,” https://www.virustotal.com, accessed: 2018-11-30.
[15] J. Drew, T. Moore, and M. Hahsler, “Polymorphic malware detection [39] J. Yan, Y. Qi, and Q. Rao, “Detecting malware with an ensemble method
using sequence classification methods,” in 2016 IEEE Security and based on deep neural network,” Sec. and Commun. Netw., vol. 2018, pp.
Privacy Workshops (SPW), May 2016, pp. 81–87. 17–, Mar. 2018.
[16] G. Yan, N. Brown, and D. Kong, “Exploring discriminatory features [40] A. Souri and R. Hosseini, “A state-of-the-art survey of malware detection
for automated malware classification,” in International Conference on approaches using data mining techniques,” Human-centric Computing
Detection of Intrusions and Malware, and Vulnerability Assessment. and Information Sciences, vol. 8, no. 1, p. 3, Jan 2018.
Springer, 2013, pp. 41–61. [41] N. Idika and A. P. Mathur, “A survey of malware detection techniques,”
Purdue University, vol. 48, 2007.
[17] Q. Zhang, D. S. Reeves, P. Ning, and S. P. Iyer, “Analyzing network
[42] M. Rhode, P. Burnap, and K. Jones, “Early Stage Malware Prediction
traffic to detect self-decrypting exploit code,” in Proceedings of the
Using Recurrent Neural Networks,” 2017. [Online]. Available:
2nd ACM symposium on Information, computer and communications
http://arxiv.org/abs/1708.03513
security. ACM, 2007, pp. 4–12.
[43] D. Zhu, H. Jin, Y. Yang, D. Wu, and W. Chen, “DeepFlow: Deep
[18] M. Sharif, A. Lanzi, J. Giffin, and W. Lee, “Automatic reverse engineer- learning-based malware detection by mining Android application for ab-
ing of malware emulators,” in Security and Privacy, 2009 30th IEEE normal usage of sensitive data,” in 2017 IEEE Symposium on Computers
Symposium on. IEEE, 2009, pp. 94–109. and Communications (ISCC), July 2017, pp. 438–443.
[19] H. Shacham, “The geometry of innocent flesh on the bone: Return-into- [44] Y. Ye, L. Chen, S. Hou, W. Hardy, and X. Li, “DeepAM: a heterogeneous
libc without function calls (on the x86),” in Proceedings of the 14th deep learning framework for intelligent malware detection,” Knowledge
ACM conference on Computer and communications security. ACM, and Information Systems, vol. 54, no. 2, pp. 265–285, Feb 2018.
2007, pp. 552–561. [45] G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu, “Large-scale malware
[20] D. Bilar, “Opcodes as predictor for malware,” International journal of classification using random projections and neural networks,” in 2013
electronic security and digital forensics, vol. 1, no. 2, pp. 156–168, 2007. IEEE International Conference on Acoustics, Speech and Signal Pro-
[21] D. Kong and G. Yan, “Discriminant malware distance learning on cessing, May 2013, pp. 3422–3426.
structural information for automated malware classification,” in Proceed- [46] G. Kim, H. Yi, J. Lee, Y. Paek, and S. Yoon, “LSTM-Based
ings of the 19th ACM SIGKDD international conference on Knowledge System-Call Language Modeling and Robust Ensemble Method for
discovery and data mining. ACM, 2013, pp. 1357–1365. Designing Host-Based Intrusion Detection Systems,” 2016. [Online].
[22] X. Hu, T.-c. Chiueh, and K. G. Shin, “Large-scale malware indexing Available: http://arxiv.org/abs/1611.01726
using function-call graphs,” in Proceedings of the 16th ACM conference [47] B. Athiwaratkun and J. W. Stokes, “Malware classification with LSTM
on Computer and communications security. ACM, 2009, pp. 611–620. and GRU language models and a character-level CNN,” in 2017 IEEE
[23] S. Cesare and Y. Xiang, “Classification of malware using structured International Conference on Acoustics, Speech and Signal Processing
control flow,” in Proceedings of the Eighth Australasian Symposium on (ICASSP), March 2017, pp. 2482–2486.
Parallel and Distributed Computing-Volume 107. Australian Computer [48] H. Huang, C. Yu, and H. Kao, “R2-D2: ColoR-inspired Convolutional
Society, Inc., 2010, pp. 61–70. NeuRal Network (CNN)-based AndroiD Malware Detections,” 2017.
[24] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, [Online]. Available: http://arxiv.org/abs/1705.04448
A. Aspuru-Guzik, and R. P. Adams, “Convolutional networks on graphs [49] D. Gibert, “Convolutional neural networks for malware classification,”
for learning molecular fingerprints,” in Advances in neural information Ph.D. dissertation, MS Thesis, Dept. of Computer Science, Escola
processing systems, 2015, pp. 2224–2232. Politcnica de Catalunya, 2016.
[25] P. D. Dobson and A. J. Doig, “Distinguishing enzyme structures from [50] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for
non-enzymes without alignments,” Journal of molecular biology, vol. networks,” in Proceedings of the 22nd ACM SIGKDD international
330, no. 4, pp. 771–783, 2003. conference on Knowledge discovery and data mining. ACM, 2016,
[26] M. Simonovsky and N. Komodakis, “Dynamic Edge-Conditioned pp. 855–864.
Filters in Convolutional Neural Networks on Graphs,” 2017. [Online]. [51] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line:
Available: http://arxiv.org/abs/1704.02901 Large-scale information network embedding,” in Proceedings of the 24th
[27] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An End-to-End Deep International Conference on World Wide Web. International World Wide
Learning Architecture for Graph Classification,” in Proceedings of AAAI Web Conferences Steering Committee, 2015, pp. 1067–1077.
Conference on Artificial Inteligence, 2018. [52] T. N. Kipf and M. Welling, “Semi-Supervised Classification
[28] “IDA Pro,” https://www.hex-rays.com/. with Graph Convolutional Networks,” 2016. [Online]. Available:
[29] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and http://arxiv.org/abs/1609.02907
K. M. Borgwardt, “Weisfeiler-lehman graph kernels,” Journal of Ma- [53] T. Lei, W. Jin, R. Barzilay, and T. S. Jaakkola, “Deriving Neural
chine Learning Research, vol. 12, no. Sep, pp. 2539–2561, 2011. Architectures from Sequence and Graph Kernels,” 2017. [Online].
[30] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “Neural network- Available: http://arxiv.org/abs/1705.09037
based graph embedding for cross-platform binary code similarity detec-

63

Authorized licensed use limited to: RV University. Downloaded on March 17,2025 at 05:16:00 UTC from IEEE Xplore. Restrictions apply.

You might also like