Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
13, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2978335
ABSTRACT Over the last few years, the malware propagation on PC platforms, especially on Windows
OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning
(ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated
classification. Recently, function call graph (FCG) vectorization (FCGV) representation was explored as
the input feature to achieve higher ML classification accuracy, but FCGV representation loses some critical
features of PE files due to the hash technique. This paper aims to further improve the classification accuracy
of FCGV-based ML model by applying both graph and non-graph features. We propose an FCGV-SF based
Random Forest classification model, which applies both FCGV features (graph features) and statistical
features (SF, non-graph features) extracted from disassembled PE files. Six types of effective non-graph
features are chosen for our integrated vector, namely, metadata, symbol, operation code, register, section
and data definition. We evaluate our model on a dataset provided by Microsoft hosted at Kaggle, and the
experimental results indicate that the classification accuracy increases from 0.9851 to 0.9957 compared with
the existing model based on FCGV only.
INDEX TERMS Function call graph, machine learning, malware classification, Portable Executable,
statistical features.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
44652 VOLUME 8, 2020
Y. Zhang et al.: Exploring FCGV and File SF in Malicious PE File Classification
(FCGs) [6], [7] are typically extracted from disassembled from 0.9851 to 0.9957 by comparison with the existing model
malicious PE files as the original features for analysis. Static based on FCGV only.
analysis can easily capture syntax and semantic information The rest of this paper is divided into four sections
for in-depth analysis, but it is susceptible to code obfuscation as follows. Section II discusses related work. Section III
techniques, e.g., compression and polymorphic/metamorphic presents the details of our FCGV-SF model using the new
transformation [8]. Dynamic analysis usually executes mal- integrated vector. Section IV presents experimental results
ware samples in a virtual environment which is monitored and Section V concludes this study and describes future work.
by debugger [12] for observing their behavioral information
II. RELATED WORK
such as network activities [9], system calls [10], file oper-
ations and registry modification records [11]. Code obfus- The past years witnessed various ML-based approaches, most
cation technologies exert less effect on dynamic analysis, of which depended on the features extracted from malware
but malware execution consumes much more time and many binaries by using static and/or dynamic analysis. These fea-
more resources than static analysis. Both static and dynamic tures can be organized as two groups, one is graph feature and
analysis techniques have their unique strength and weakness. the other is non-graph feature. We discuss the existing classi-
A large number of PE malware variants poses great fication approaches from the aspects of graph and non-graph
challenges to human experts in manually analyzing all of features in the following.
these malware. This situation exposes an imperative need A. GRAPH FEATURE BASED APPROACHES
for developing effectively and efficiently automated malware There are usually three main types of graph information from
classification techniques. Using machine learning (ML) in malware samples: FCGs, system-call dependency graphs and
the malware classification can make a significant contribu- control flow graphs. An FCG is a directed graph repre-
tion to resist the malware epidemic. ML classifiers used sentation constructed from codes where the vertices spec-
for malicious PE file classification typically employ a sin- ify functions and the edges correspond to the caller-callee
gle numerical feature vector representation of each file as relations between functions (vertices) [20]. A system call
input and mark one or more class labels for each file dur- dependency graph is a directed graph that is usually deter-
ing training. By performing static and/or dynamic analysis mined by dynamic taint analysis. In a system call dependency
on each PE sample, two types of features can be extracted graph, a vertex corresponds to a system call and an edge
from malware binaries, namely, non-graph features and graph represents a data dependency between system calls. In [21],
features. Recently function call graph (FCG) vectorization Allen defined a control flow graph as a ‘‘directed graph where
(FCGV), which is a kind of graph features, was explored to basic code blocks are represented by vertices and control
achieve higher ML classification accuracy [13] but FCGV flow paths are represented by edges’’. A basic control block
representation loses some critical features of PE files due to was described as ‘‘a linear program instructions sequence
the hash technique. Meanwhile, non-graph features have been which has one entry point (the first instruction executed) and
applied for malware classification [14]. Each type of features one exit point (the last instruction executed)’’. Graph-based
represents its unique perspective of malicious PE files, having features have been increasingly used in many researches to
its own merits and limitations. Hence, it is a necessity to cluster and classify malware. Such features have the most
creating an integrated feature vector which contains more significant advantages of preserving interactive information
comprehensive information of PE binaries. However, there is between different parts of the malicious codes.
no work on the integration of FCGV features and non-graph This section only discusses features extracted by static
features for designing ML malware classifiers. analysis, which are called as FCGs. FCGs are usually built
In this paper, we propose an FCGV-SF based Random from disassembled binaries constructed by static analysis.
Forest classification model (denoted as FCGV-SF model in Various researches have extracted FCG features for malware
the following) which applies both FCGV features (graph classification and clustering. After creating FCGs, we need
features) and statistical features (SF, non-graph features) a measure to evaluate the similarity between two FCGs,
extracted from disassembled PE files. Statistical features such as approximate graph edit distance (GED). In [22]
reflect the high-level statistical characteristics in PE binaries, and [23], Simulated Annealing Algorithm [24] was employed
which is more concise and representative. Six types of effec- to approximate GED. On the other hand, Hu et al. [25]
tive statistical features [14] are chosen to build our integrated used Hungarian Algorithm to approximate GED. Hassen
vector, namely metadata, symbol, operation code, register, and Chan [13] developed a function clustering-based FCGV
section and data definition. To the best of our knowledge, representation using hash technique to approximate GED,
we are the first to apply both FCGV features and non-graph which achieved remarkable performance as well as improved
features for malware classification. Compared with prior mal- classification accuracy.
ware classification work based on FCGV only or non-graph Note that GED is not the only way to measure graph
statistical features only, our proposed model preserves more similarity. For example, as another measure of similar-
vital information in disassembled PE files. We use the data ity, the normalized common edge number between two
provided on Kaggle [2] for Microsoft Malware Challenge to graphs was used in [26]. Dullien and Rolles [27] com-
evaluate our model, and the classification accuracy increases puted graph similarity through fixed points and propagations.
They defined a fixed point between pair of graphs. A fixed which differ from the opcode sequences that only focus on
point represents two nodes (one each from two different the code segments.
graphs), which can be easily determined to represent the same All these works only applied non-graph features. Although
item in both graphs. Their method started from an initial fixed it is mostly easy for us to extract non-graph features, it still
point. By considering the adjacent nodes, it propagated to turns out to be hard to remain comprehensive information
more fixed points. In addition, Kong and Yan [28] extracted from binaries. As a result, we combine non-graph features
new features from FCGs and used these features to approxi- with FCG features to achieve more accurate classification
mate the similarity between two graphs. results.
All these aforementioned works only applied FCG
III. FCGV-SF BASED RANDOM FOREST
features. FCGs have an advantage on preserving more com-
CLASSIFICATION MODEL
plete structural information in binaries, however, loss of some
This section first presents the overview of our FCGV-SF
structural information is inevitable during the extraction.
model, then details the procedure of FCG vectorization and
This representation fails to extract comprehensive features
statistical feature extraction respectively.
of code, which leads to suboptimal classification accuracy.
It highlights the necessity to integrate non-graph features as A. FCGV-SF OVERVIEW
well as FCGs. Hence, our model explores both FCG and Selecting features which represent malware samples is one of
non-graph features for classification. the most challenging issues for the classification task. This
paper combines various features (namely, FCGV represen-
B. NON-GRAPH FEATURE BASED APPROACHES tation and non-graph statistical features) to generate a novel
This section presents research work based on common integrated feature vector in order to attain better classification
non-graph features which can be extracted from one mali- accuracy.
cious PE file often with less preprocessing than graph We choose the features extracted from FCGs as the first
features. part because they preserve more complete structural informa-
Ahmadi et al. [14] provided a constructive set of non-graph tion of codes, compared with n-gram features. They include
features. They extracted several customized statistical fea- the information of functions in a malicious PE file, and
tures which reflected the essence of maliciousness in disas- more importantly contain the interactive information between
sembled PE files. For example, the symbols like ‘‘−, +, ∗, ], functions. However, most of the existing researches, which
[, ?, @’’ are typical of code that has been designed to evade employed FCG-based features for malware classification,
detection. Jung et al. [15] explored the application of function relied on computationally intensive techniques to estimate
lengths in bytes, which were from disassembled malware graph similarity. As a result, these work exposed shortcom-
samples and represented as a histogram with the predefined ings of large performance overhead and weak scalability.
number of bins. Then they used these frequency values of Hence, we use FCGV technique [13] to reach more accu-
function lengths as features for the malware classification. rate results with less time cost. The details are discussed in
In [16] and [17], another set of features based on printable Subsection III.B.
strings were extracted using static analysis from malware Note that non-graph features, compared with graph
binaries. Besides, the authors in [17] and [18] used program features, are easily extracted and they are also helpful in
import table to create a binary feature vector for malware improving classification accuracy. Meanwhile, FCGV rep-
classification. The program import table in a PE file can resentation can be easily combined with other non-graph
provides information for the program so that it can import features. In the midst of various non-graph features, we inves-
the function in external libraries while executing. tigate a set of statistical features [14], namely intuitively
Instruction n-gram constructed from instruction sequences statistical characteristics of PE files, such as file size and the
are extracted from a disassembled malicious PE file [18], [19]. number of lines in the file. The main attraction of statistical
In this case, instruction mnemonics or opcodes, excluding features is that they represent global characteristics of PE files
the operands, can represent the instructions. Hu et al. [19] which is more concise and representative as well. To avoid
emphasized that the high dimensionality of instruction n- unnecessary performance overheads, we devise a simple, yet
gram was one of the challenges when using them. Even efficient feature extraction module without using more com-
for 2-gram features, the number of 2-grams can possibly plex features based on n-grams or sequences. We eventually
reach tens of thousands. To address this problem, a hashing choose six types of statistical features including metadata,
trick was used to reduce dimensionality. They employed a symbol, operation code, register, section, data definition,
uniformly distributed hash function on the large feature space as discussed in Subsection III.C.
to hash this high-dimension space into a smaller dimension. Fig.1 presents a high-level view of our classification
In [17], binary file byte sequence n-gram was used as fea- model. As it shows, the first step of our model is the extrac-
tures for malware classification. Unlike n-gram sequences tion of FCG representations from disassembled malicious
based on instruction mnemonic or opcodes, these features PE binaries. Once an FCG is extracted, we use function
do not require disassembled files. These byte sequences clustering techniques for vectorizing this FCG to an FCGV
are extracted from all sections of the original PE binaries, vector in a low-dimensional feature space, which is defined
2) FUNCTION CLUSTERING
FIGURE 1. The overview of FCGV-SF model. The second module of vectorization is Function Clustering,
which needs a proper measure to identify similar functions.
GED is one of effective ways to estimate similarity, however,
with O(n2 ) time complexity in the number of instructions.
To address this problem, Jaccard Index can be employed to
approximate GED but it still costs much time. Fortunately,
locality-sensitive hashing (LSH) is an algorithmic technique
that preserves relative distances between items while hashing
similar input items into low-dimension versions. Hence, it can
effectively reduce the dimensionality of high-dimension data.
A set of LSH functions called Minhash [30] can be used to
efficiently approximate Jaccard Index, which further simpli-
fies the measurement of similarity between functions. The
FIGURE 2. The procedure of FCG vectorization.
procedure of Minhash is as follows. First we employ a ran-
dom permutation on the set. Then the index value of the first
as vFCGV . Meanwhile, we collect six types of predefined
element is computed. Guided by the rationale of Minhash,
statistical features to build a non-graph SF vector vSF . Then
this value is the hash of the set. Under the strict mathematic
we concatenate vFCGV and vSF to create an integrated feature
verification, the probability that two sets have the same Min-
vector v. At last, malware classification is performed on
hash value is equivalent to the Jaccard similarity between
vthrough the ML model based on Random Forest Algorithm,
the two sets. Therefore, for the local function, we perform
which predicts the malware family that the sample belongs
Minhash signature and secondary hashing on its n-gram
to. We discuss Random Forest classifier in Subsection III.D.
instruction opcode set to compute a positive cluster-id. For
B. FCG VECTORIZATION the external function, we directly hash the function name to a
The procedure of converting an FCG to FCGV representation negative cluster-id. Then we get an FCG whose functions are
is presented in Fig.2. Three modules are involved in FCG labeled with cluster-ids.
vectorization, including FCG Extraction, Function Clustering
and Vector Extraction. Our model starts by first performing 3) VECTOR EXTRACTION
the extraction of FCGs from disassembled binaries. Vertices In the third module, we extract FCGV representation from
in FCGs are functions represented in terms of their instruction the FCG labeled with function cluster-ids. Vertex weight
opcode sequences. Then the hash technique for clustering and edge weight constitute this vector representation. In this
functions reconstructs an FCG labeled with function cluster- labeled FCG, the number of functions (vertices) from each
ids. The final step is to perform vector extraction on this FCG cluster is represented by the vertex weight. As for the edge
into FCGV representation. Details of these three modules are weight, it corresponds to the number of times that a caller-
discussed in the following. callee relation is found from a function (vertex) in one cluster
to a function (vertex) of another cluster or a function (vertex)
1) FCG EXTRACTION within the same cluster. We can create an FCGV represen-
FCG Extraction is the first module of FCGV technique which tation of FCG by concatenating the weights of vertices and
extracts FCG representations from disassembled binaries of edges.
malicious PE files. In order to obtain the functions and Due to function clustering, two graphs that are slightly
their caller-callee relations, we use the regular expression to different may have the exactly same FCGV representation.
match instruction opcodes in binaries. During the process In the area of malware classification, there are various obfus-
of FCG Extraction, vertices which represent external func- cation techniques such as changing function calling pat-
tions are labeled with their function names. Note that the terns. This FCGV representation has its own merit which
names of local functions named by malware writers gener- is more resilient to resist the negative effects of these
ally cannot reflect what the function implements with their techniques. The reason is that it might not express small
instruction sequences. Therefore, we decide to label later the changes in the graph structure on condition that there is
vertices which represent the local functions. In this FCG, no variation observed in the edge and vertex frequencies.
vertices corresponding to local functions also contain a set Therefore, FCGV representations make the malware classifi-
of instruction opcode sequences extracted from these func- cation less susceptible to malicious changes such as function
tions. Besides, the caller-callee relations between functions calls reordering. However, the original FCG inevitably loses
are defined as directed and unweighted edges. When FCG a few details of the graph structure after being processed
TABLE 1. Six types of statistical features. vector contain structural information in the code, but also it
has intuitively statistical characteristics of binaries. The more
effective features it can preserve, the higher accuracy it may
lead to.
D. RANDOM FOREST CLASSIFIER
Random Forest [29] is a ML algorithm using ensemble learn-
ing whose base classifiers are decision trees. Ensemble learn-
ing techniques generally combine predictions from multiple
classifiers to improve prediction accuracy. Random Forest
Algorithm starts from building T decision trees. We randomly
select training samples for each tree from the original training
dataset D to get a new training set Dt which has the same size
with the original D. This random selection is based on some
certain distribution. Then we train an individual decision tree
on Dt without pruning. For the split of any given tree node,
we only consider F features which are randomly selected
rather than all available features. Finally, the majority vote
policy is adopted among all the trees to predict the class label
of the sample.
Using Random Forest Algorithm for malware classification
brings us various advantages [31] as follows.
• Random Forest tends to have a better prediction per-
formance on a large-scale set of features because it
allows us to construct individual trees whose decorre-
lation improves the classification accuracy.
• Random Forest has a fast training speed, since for each
tree node, it considers a randomly selected subset of
features which is much smaller than our entire feature
space. Hence, it results in a training time reduction.
by the hash technique. Hence, we combine this FCGV • Random Forest can prevent overfitting to a certain
representation with statistical features to make up the loss. extent, since the training samples for each tree and the
features for the split of each tree node are randomly
C. STATISTICAL FEATURES EXTRACTION selected. Besides, because of the Law of Large Num-
FCGV representations only use code segments in binaries. bers, Random Forest does not overfit as more trees are
Furthermore, they are simplified through the Minhash tech- added [29].
nique, which means they do not preserve all the critical infor- • Random Forest reaches better or equal prediction accu-
mation from disassembled files. As a result, it leads to the racy compared with other techniques, such as decision
loss of expressiveness and suboptimal classification accuracy. tree, logistic regression, and backpropagation artificial
In this subsection, we consider combining statistical features neural networks [32].
with FCGV representations. In our FCGV-SF model, the integrated vector v consists of
Then we face the challenge of how to choose proper sta- two parts, vFCGV and vSF , whose dimensions depend on opti-
tistical features for the integrated vector. On the one hand, mal parameters chosen in FCG vectorization. The integrated
we need to ensure that FCGV vector and SF vector are of vector v is given as input to Random Forest classifier and the
equal importance, which means we should make the dimen- output is a predicted label that represents which family the
sion of vFCGV and vSF basically identical. On the other hand, sample belongs to.
we choose the effective and representative statistical features
in terms of the experiments in [14]. IV. EXPERIMENTAL EVALUATION
We notice that the dimension of API (one set of statistical This section evaluates the capability of our FCGV-SF model.
features) is too high and MISC (another set of statistical We first present the dataset and then determine the optimal
features) includes too many items that only have auxiliary values of parameters, namely the n-gram length, the number
effects. Therefore, we abandon these two sets. In Table 1, of clusters and the number of decision trees. Eventually we
we provide details on what features we have chosen and why show the improvement of our FCGV-SF model compared
they have been selected. Extracting these six types of features with the previous FCGV-only model.
from disassembled files, we combine them with the FCGV A. DATASET
vector presented in Subsection III.B to create a new feature Microsoft Malware Classification Challenge (BIG 2015)
vector as the input of our ML model. Not only does this dataset [2] hosted at Kaggle is used for evaluating our
model is a novel feature vector which concatenates FCGV [16] H. Xue, S. Sun, G. Venkataramani, and T. Lan, ‘‘Machine learning-based
representation with non-graph statistical features. We select analysis of program binaries: A comprehensive study,’’ IEEE Access,
vol. 7, pp. 65889–65912, 2019.
the optimal parameters (namely the n-gram length, the num- [17] E. Raff, R. Zak, R. Cox, J. Sylvester, P. Yacci, R. Ward, A. Tracy,
ber of clusters and the number of decision trees), which M. McLean, and C. Nicholas, ‘‘An investigation of byte n-gram features for
can achieve a balance between time cost and accuracy for malware classification,’’ J. Comput. Virol. Hacking Techn., vol. 14, no. 1,
pp. 1–20, Sep. 2016.
vectorizing an FCG to FCGV representation. Experiments are [18] G. Yan, N. Brown, and D. Kong, ‘‘Exploring discriminatory features for
carried out for verifying the effectiveness of our model by automated malware classification,’’ in Proc. Int. Conf. Detection Intrusions
comparing with FCGV-only classification results. Our pro- Malware, Vulnerability Assessment. Berlin, Germany: Springer, Jul. 2013,
pp. 41–61,
posed model is able to capture more information of malicious [19] X. Hu, K. G. Shin, S. Bhatkar, and K. Griffin, ‘‘MutantX-S: Scalable
PE files by integrating more representative features together malware clustering based on static features,’’ in Proc. USENIX Annu. Tech.
and then leads to a higher classification accuracy. Conf., 2013, pp. 187–198.
[20] B. G. Ryder, ‘‘Constructing the call graph of a program,’’ IEEE Trans.
Note that the proposed model is evaluated by a malicious Softw. Eng., vol. SE-5, no. 3, pp. 216–226, May 1979.
PE file dataset constructed from static analysis. It can only [21] F. E. Allen, ‘‘Control flow analysis,’’ ACM SIGPLAN Notices, vol. 5, no. 7,
extract features in malicious codes, but a number of functions pp. 1–19, Jul. 1970.
[22] J. Kinable and O. Kostakis, ‘‘Malware classification based on call graph
in binaries are only used for obfuscation instead of running in clustering,’’ J. Comput. Virol., vol. 7, no. 4, pp. 233–245, Feb. 2011.
the execution. Hence, features are less comprehensive with- [23] O. Kostakis, J. Kinable, H. Mahmoudi, and K. Mustonen, ‘‘Improved call
out analysis based on malware behaviors. Besides, this paper graph comparison using simulated annealing,’’ in Proc. ACM Symp. Appl.
Comput. (SAC), 2011, pp. 1516–1523.
only focuses on malware classification. However, malware [24] L. Xu and E. Oja, ‘‘Improved simulated annealing, Boltzmann machine,
detection is also a challenging and important task. Therefore, and attributed graph matching,’’ in Proc. Eur. Assoc. Signal Process.
we next plan to evaluate our model on a dataset which is Workshop. Berlin, Germany: Springer, Feb. 1990, pp. 151–160.
[25] X. Hu, T.-C. Chiueh, and K. G. Shin, ‘‘Large-scale malware indexing using
not only constructed by static analysis as well as dynamic function-call graphs,’’ in Proc. 16th ACM Conf. Comput. Commun. Secur.
analysis but also contains benign and malicious PE files both. (CCS), 2009, pp. 611–620.
[26] M. Xu, L. Wu, S. Qi, J. Xu, H. Zhang, Y. Ren, and N. Zheng, ‘‘A sim-
ilarity metric method of obfuscated malware using function-call graph,’’
REFERENCES J. Comput. Virol. Hacking Techn., vol. 9, no. 1, pp. 35–47, Jan. 2013.
[1] AV-TEST. (Jul. 2019). Malware Statistics & Trends Report. [Online]. [27] T. Dullien and R. Rolles, ‘‘Graph-based comparison of executable objects
Available: https://www.av-test.org/en/statistics/malware/ (English version),’’ in Proc. SSTIC, 2005, vol. 5, no. 1, p. 3.
[2] Kaggle. (Apr. 2015). Microsoft Malware Classification Challenge (BIG). [28] D. Kong and G. Yan, ‘‘Discriminant malware distance learning on struc-
[Online]. Available: https://www.kaggle.com/c/malware-classification tural information for automated malware classification,’’ in Proc. 19th
[3] B. Kolosnjaji, G. Eraisha, G. Webster, A. Zarras, and C. Eckert, ‘‘Empow- ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining (KDD), 2013,
ering convolutional networks for malware classification and analysis,’’ in pp. 1357–1365.
Proc. Int. Joint Conf. Neural Netw. (IJCNN), May 2017, pp. 3838–3845. [29] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
2001.
[4] A. Pektaş and T. Acarman, ‘‘Classification of malware families based on
[30] A. Z. Broder, ‘‘On the resemblance and containment of documents,’’ in
runtime behaviors,’’ J. Inf. Secur. Appl., vol. 37, pp. 91–100, Dec. 2017.
Proc. Compress. Complex. SEQUENCES, Jun. 1997, pp. 21–29.
[5] A. Pektaş and T. Acarman, ‘‘Malware classification based on API calls and
[31] X. Gao, C. Shan, C. Hu, Z. Niu, and Z. Liu, ‘‘An adaptive ensemble
behaviour analysis,’’ IET Inf. Secur., vol. 12, no. 2, pp. 107–117, Mar. 2018.
machine learning model for intrusion detection,’’ IEEE Access, vol. 7,
[6] H. Jiang, T. Turki, and J. T. L. Wang, ‘‘DLGraph: Malware detection using pp. 82512–82521, 2019.
deep learning and graph embedding,’’ in Proc. 17th IEEE Int. Conf. Mach. [32] M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, ‘‘Do we
Learn. Appl. (ICMLA), Dec. 2018, pp. 1029–1033. need hundreds of classifiers to solve real world classification problems?’’
[7] H.-T. Nguyen, Q.-D. Ngo, and V.-H. Le, ‘‘A novel graph-based approach J. Mach. Learn. Res., vol. 15, no. 1, pp. 3133–3181, 2014.
for IoT botnet detection,’’ Int. J. Inf. Secur., to be published. [33] H. S. Anderson and P. Roth, ‘‘EMBER: An open dataset for training
[8] A. Moser, C. Kruegel, and E. Kirda, ‘‘Limits of static analysis for malware static PE malware machine learning models,’’ 2018, arXiv:1804.04637.
detection,’’ in Proc. 23rd Annu. Comput. Secur. Appl. Conf. (ACSAC), [Online]. Available: http://arxiv.org/abs/1804.04637
Dec. 2007, pp. 421–430.
[9] J. Jiang, Q. Yin, Z. Shi, and M. Li, ‘‘Comprehensive behavior profiling
model for malware classification,’’ in Proc. IEEE Symp. Comput. Commun. YIPIN ZHANG is currently pursuing the master’s
(ISCC), Jun. 2018, pp. 129–135. degree with the Beijing Key Laboratory of Secu-
[10] T. Lee, B. Choi, Y. Shin, and J. Kwak, ‘‘Automatic malware mutant rity and Privacy in Intelligent Transportation, Bei-
detection and group classification based on the n-gram and clustering jing Jiaotong University. Her research interests
coefficient,’’ J. Supercomput., vol. 74, no. 8, pp. 3489–3503, Dec. 2015. include network security and machine learning.
[11] G. Cabau, M. Buhu, and C. P. Oprisa, ‘‘Malware classification based on
dynamic behavior,’’ in Proc. 18th Int. Symp. Symbolic Numeric Algorithms
Sci. Comput. (SYNASC), Sep. 2016, pp. 315–318.
[12] M. Egele, T. Scholte, E. Kirda, and C. Kruegel, ‘‘A survey on automated
dynamic malware-analysis techniques and tools,’’ ACM Comput. Surv.,
vol. 44, no. 2, pp. 1–42, Feb. 2012.
[13] M. Hassen and P. K. Chan, ‘‘Scalable function call graph-based mal- XIAOLIN CHANG (Member, IEEE) is currently a
ware classification,’’ in Proc. 7th ACM Conf. Data Appl. Secur. Privacy Professor with the School of Computer and Infor-
(CODASPY), 2017, pp. 239–248. mation Technology, Beijing Jiaotong University.
[14] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, Her current research interests include edge/cloud
‘‘Novel feature extraction, selection and fusion for effective malware computing, network security, and security and
family classification,’’ in Proc. 6th ACM Conf. Data Appl. Secur. Privacy privacy in machine learning.
(CODASPY), 2016, pp. 183–194.
[15] B. Jung, T. Kim, and E. G. Im, ‘‘Malware classification using byte sequence
information,’’ in Proc. Conf. Res. Adapt. Convergent Syst. (RACS), 2018,
pp. 143–148.
YUZHOU LIN is currently pursuing the mas- VOJISLAV B. MIŠIĆ (Senior Member, IEEE) is
ter’s degree with the Beijing Key Laboratory of currently a Professor of computer science with
Security and Privacy in Intelligent Transportation, Ryerson University, Toronto, ON, Canada. His
Beijing Jiaotong University. His research interests research interests include performance evaluation
include malware detection and machine learning. of wireless networks and systems and software
engineering. He is a member of ACM. He serves
on the Editorial Boards of the IEEE TRANSACTIONS
ON CLOUD COMPUTING, Ad Hoc Networks, Peer-
to-Peer Networks and Applications, and the
International Journal of Parallel, Emergent and
Distributed Systems.
JELENA MIŠIĆ (Fellow, IEEE) is currently a
Professor of computer science with Ryerson Uni-
versity, Toronto, ON, Canada. She has published
four books, over 125 articles in archival journals,
and close to 190 articles at international con-
ferences in the areas of computer networks and
security. She is a member of ACM. She serves
on the Editorial Board of the IEEE TRANSACTIONS
ON VEHICULAR TECHNOLOGY, the IEEE INTERNET OF
THINGS JOURNAL, the IEEE Network, Computer
Networks, and Ad hoc Networks.