Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification

The document discusses exploring the use of function call graph vectorization and statistical features extracted from portable executable files to improve machine learning based malware classification. The proposed model uses both function call graph vectorization features and six types of statistical features. Experimental results on a malware dataset show the combined approach improves classification accuracy compared to only using function call graph features.

Uploaded by

Phạm Bảo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification

Uploaded by

Phạm Bảo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Received February 6, 2020, accepted February 29, 2020, date of publication March 4, 2020, date of current version March

13, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2978335

Exploring Function Call Graph Vectorization

and File Statistical Features in Malicious
PE File Classification
YIPIN ZHANG 1 , XIAOLIN CHANG 1 , (Member, IEEE),
YUZHOU LIN 1 , JELENA MIŠIĆ 2 , (Fellow, IEEE),
AND VOJISLAV B. MIŠIĆ 2 , (Senior Member, IEEE)
1 Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing 100044, China
2 Computer Science Department, Ryerson University, Toronto, ON M5B 2K3, Canada
Corresponding author: Xiaolin Chang ([email protected])
The work of Yipin Zhang, Xiaolin Chang, and Yuzhou Lin was supported in part by the National Natural Science Foundation of China
under Grant U1836105, and in part by the National Key Laboratory of Science and Technology on Information System Security. The work
of Jelena Mišić and Vojislav B. Mišić was supported in part by the National Science and Engineering Research Council of
Canada (NSERC) through Discovery Grants.

ABSTRACT Over the last few years, the malware propagation on PC platforms, especially on Windows
OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning
(ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated
classification. Recently, function call graph (FCG) vectorization (FCGV) representation was explored as
the input feature to achieve higher ML classification accuracy, but FCGV representation loses some critical
features of PE files due to the hash technique. This paper aims to further improve the classification accuracy
of FCGV-based ML model by applying both graph and non-graph features. We propose an FCGV-SF based
Random Forest classification model, which applies both FCGV features (graph features) and statistical
features (SF, non-graph features) extracted from disassembled PE files. Six types of effective non-graph
features are chosen for our integrated vector, namely, metadata, symbol, operation code, register, section
and data definition. We evaluate our model on a dataset provided by Microsoft hosted at Kaggle, and the
experimental results indicate that the classification accuracy increases from 0.9851 to 0.9957 compared with
the existing model based on FCGV only.

INDEX TERMS Function call graph, machine learning, malware classification, Portable Executable,
statistical features.

I. INTRODUCTION There are usually two separate steps to perform on each

Over the last few years, the malware propagation on PC malicious PE file, namely malware detection and classifica-
platforms, especially on Windows OS, has been even severe, tion. Firstly, if an executable program contains any malicious
causing various threats to system security and data privacy. content, it needs detecting through malware analysis tech-
AV-TEST statistical report [1] indicated that there were over niques. After malware detection, a classification mechanism
30 million malicious Portable Executable (PE) files regis- categorizes the executable program labeled as malware into
tered merely in the first half of 2019. This huge intrusion of the most similar family for further analysis. In this paper,
malicious PE files results from the malware modification and we focus on the techniques of malware classification. To
obfuscation performed by attackers, so that similar malware extract features used for malware classification, static and
samples will be distinct from the others [2] and then evade dynamic analysis techniques have been explored to perform
detection. the analysis of malicious PE files. Static analysis exam-
ines the codes of malware samples without executing them.
The associate editor coordinating the review of this manuscript and In static analysis, content-based features such as instruction
approving it for publication was Francesco Mercaldo . opcodes [3], API sequences [4], [5] and function call graphs

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
44652 VOLUME 8, 2020
Y. Zhang et al.: Exploring FCGV and File SF in Malicious PE File Classification

(FCGs) [6], [7] are typically extracted from disassembled from 0.9851 to 0.9957 by comparison with the existing model
malicious PE files as the original features for analysis. Static based on FCGV only.
analysis can easily capture syntax and semantic information The rest of this paper is divided into four sections
for in-depth analysis, but it is susceptible to code obfuscation as follows. Section II discusses related work. Section III
techniques, e.g., compression and polymorphic/metamorphic presents the details of our FCGV-SF model using the new
transformation [8]. Dynamic analysis usually executes mal- integrated vector. Section IV presents experimental results
ware samples in a virtual environment which is monitored and Section V concludes this study and describes future work.
by debugger [12] for observing their behavioral information
II. RELATED WORK
such as network activities [9], system calls [10], file oper-
ations and registry modification records [11]. Code obfus- The past years witnessed various ML-based approaches, most
cation technologies exert less effect on dynamic analysis, of which depended on the features extracted from malware
but malware execution consumes much more time and many binaries by using static and/or dynamic analysis. These fea-
more resources than static analysis. Both static and dynamic tures can be organized as two groups, one is graph feature and
analysis techniques have their unique strength and weakness. the other is non-graph feature. We discuss the existing classi-
A large number of PE malware variants poses great fication approaches from the aspects of graph and non-graph
challenges to human experts in manually analyzing all of features in the following.
these malware. This situation exposes an imperative need A. GRAPH FEATURE BASED APPROACHES
for developing effectively and efficiently automated malware There are usually three main types of graph information from
classification techniques. Using machine learning (ML) in malware samples: FCGs, system-call dependency graphs and
the malware classification can make a significant contribu- control flow graphs. An FCG is a directed graph repre-
tion to resist the malware epidemic. ML classifiers used sentation constructed from codes where the vertices spec-
for malicious PE file classification typically employ a sin- ify functions and the edges correspond to the caller-callee
gle numerical feature vector representation of each file as relations between functions (vertices) [20]. A system call
input and mark one or more class labels for each file dur- dependency graph is a directed graph that is usually deter-
ing training. By performing static and/or dynamic analysis mined by dynamic taint analysis. In a system call dependency
on each PE sample, two types of features can be extracted graph, a vertex corresponds to a system call and an edge
from malware binaries, namely, non-graph features and graph represents a data dependency between system calls. In [21],
features. Recently function call graph (FCG) vectorization Allen defined a control flow graph as a ‘‘directed graph where
(FCGV), which is a kind of graph features, was explored to basic code blocks are represented by vertices and control
achieve higher ML classification accuracy [13] but FCGV flow paths are represented by edges’’. A basic control block
representation loses some critical features of PE files due to was described as ‘‘a linear program instructions sequence
the hash technique. Meanwhile, non-graph features have been which has one entry point (the first instruction executed) and
applied for malware classification [14]. Each type of features one exit point (the last instruction executed)’’. Graph-based
represents its unique perspective of malicious PE files, having features have been increasingly used in many researches to
its own merits and limitations. Hence, it is a necessity to cluster and classify malware. Such features have the most
creating an integrated feature vector which contains more significant advantages of preserving interactive information
comprehensive information of PE binaries. However, there is between different parts of the malicious codes.
no work on the integration of FCGV features and non-graph This section only discusses features extracted by static
features for designing ML malware classifiers. analysis, which are called as FCGs. FCGs are usually built
In this paper, we propose an FCGV-SF based Random from disassembled binaries constructed by static analysis.
Forest classification model (denoted as FCGV-SF model in Various researches have extracted FCG features for malware
the following) which applies both FCGV features (graph classification and clustering. After creating FCGs, we need
features) and statistical features (SF, non-graph features) a measure to evaluate the similarity between two FCGs,
extracted from disassembled PE files. Statistical features such as approximate graph edit distance (GED). In [22]
reflect the high-level statistical characteristics in PE binaries, and [23], Simulated Annealing Algorithm [24] was employed
which is more concise and representative. Six types of effec- to approximate GED. On the other hand, Hu et al. [25]
tive statistical features [14] are chosen to build our integrated used Hungarian Algorithm to approximate GED. Hassen
vector, namely metadata, symbol, operation code, register, and Chan [13] developed a function clustering-based FCGV
section and data definition. To the best of our knowledge, representation using hash technique to approximate GED,
we are the first to apply both FCGV features and non-graph which achieved remarkable performance as well as improved
features for malware classification. Compared with prior mal- classification accuracy.
ware classification work based on FCGV only or non-graph Note that GED is not the only way to measure graph
statistical features only, our proposed model preserves more similarity. For example, as another measure of similar-
vital information in disassembled PE files. We use the data ity, the normalized common edge number between two
provided on Kaggle [2] for Microsoft Malware Challenge to graphs was used in [26]. Dullien and Rolles [27] com-
evaluate our model, and the classification accuracy increases puted graph similarity through fixed points and propagations.