Zhu 2015
Zhu 2015
Abstract—To mitigate security problem brought by Android are put in this case to get the best model to detect Android
malware, various work has been proposed such as behavior malware.
based malware detection and data mining based malware In summary, our paper makes the following contributions
detection. In this paper, we put forward a novel Android
malware detection model using data mining techniques. We to the detection of Android malware:
design an algorithm with two steps. The first step is modeling • Automatically generation of Android malicious be-
Android application code into graph structure, called API havior: As mentioned before, behavior based malware
control flow graph by us. Next step is calculating API sequences detection method use manually defined behavior mode
fulfilling minimum intra-family support in each malware family
because malware in malware family usually share similar
to compare with the behavior of the application to be
behavior pattern. Finally, supervised learning method is took detected. In our paper, we can automatically generate
advantage in building our malware detecting model with API Android malicious behavior represented as API se-
sequences as input features. We evaluate this model with 1200 quence by extracting frequent API sequence in Android
applications, half of them are malicious and half are benign, malware family, which are used as features of the
and find it effective in identifying Android malware and even
unknown malware. detecting model.
• Effective detection results: After our experiment, we
Keywords-Android malware; malware family; data mining; find that our model is effective in malware detecting
feature selection;
and can reach a high detecting accuracy. Besides, the
influence of two important arguments in our model -
I. I NTRODUCTION intra-family minimum support and minimum length of
With the rising number of Android devices held by users API sequences are tested to build the best detecting
and applications developed by developers, various applica- model.
tions with malicious purpose began to occur in this envi- The following sections of this paper proceeds as follows.
ronment, which brought loss to Android users to different Section 2 describes the related work about Android malware
degree. These malicious applications usually can be used detection. Section 3 provides the detailed scheme of our
to destroy devices’ system, steal users’ private information work. Section 4 describes the process of our experiment and
and even bring loss to their property. Unfortunately, this Section 5 concludes.
condition is still rising now day by day. The statistics showed
II. R ELATED W ORK
that there were more than 75 million new malware samples
detected and 200,000 new strains of malware each day [1]. From the aspect of whether running applications to detect
So more research on the aspect of detection of Android malware or not, static analysis and dynamic analysis are
apps security should be done to protect Android users from two approaches for the detection of malware. In Android
malware. platform, both static analysis and dynamic analysis scheme
In this paper, we propose a novel scheme which uses data have been proposed by researchers in recent years. Because
mining based malware detection method with fine grained our work use static analysis to detect Android malware, we
features to detect Android malware. We use API sequence state static analysis related work in this section.
which reflects malicious behavior as features. To automat- For static analysis, at first, static information like API call
ically get these malicious behavior API sequences but not or permission requested is extracted from other representa-
manually define them, we extract these API sequences inside tion of application code, such as AST(Abstract Syntax Tree)
each Android malware family fulfilling the minimum intra- and CFG(Control Flow Graph). Then this static information
family support. Finally, several machine learning algorithms is viewed as feature to do with behavior based malware
detection or data mining based malware detection. Many
Zhi Guan is corresponding author researchers used behavior based malware detection method
674
to extract the features inside each malware family. Because is represented as ai = (si1 , si2 , . . . , sin ). The output of this
there is similar behavior pattern in each malware family, model is binary number −1 and 1.
our proposal try to mine the common API sequences in
each malware family. At first, we handle with the single IV. E XPERIMENT
malware’s API CFG generated in the last process and gen- A. Dataset
erate the sequence set of this malware. Then the algorithm In our work, we use supervised learning method to train
merges all of the generated sequences of every malware in the model so we have to collect enough training examples
one malware family and calculates the frequent sequences to do the experiment. Besides, our proposal need to have
occurring in this malware family as the features. a feature selection procedure for each Android malware
Our algorithm contains three steps. The first step is family. Therefore, the malware in our dataset is manually
traversing and recording all key paths in the graph. Here, classified into different Android malware family. Specif-
we define the key path of a graph meeting the following ically, we select 600 malware from 30 famous malware
properties: families(30 malware for each family, including FakeInstaller,
• There is no duplicate node occurring in the key path. GingerMaster and Geinimi) and 600 benign applications
• There is no path in the graph which is the prefix of this collected from Android Market.
key path.
B. Evaluation Method
To get all key paths in the graph, the algorithm traverses
In our experiment, we divide the data set into 5 sets
every node in the graph and maintains the path information
(formally defined as S1 , S2 , S3 , S4 , S5 ) in average with
of every visited node. If a path cannot traverse forward
6 malware families and 120 benign apps in each set. 5-fold
which means it is one of the key paths in this graph, then
cross validation is utilized in evaluating the model quality.
go back-track and update the path information of the visited
The detailed procedure is listed as follows.
node in this key path until finding another unvisited node.
• Step 1: Random choose 4 sets in S1 ∼ S5 as training
Finally, all key paths will be recorded in this process.
examples. For the malware in these sets, we generate
The second step is extracting every subsequences in key
frequent API sequences as the input features. Then the
paths and merging them into a no duplicate set. Here, all
training examples are represented in vector format by
subsequences of every key path are calculated and put into
finding whether each input feature exists in training
the subsequence set. After these two steps, we obtain all
examples’ API CFG. Build the model with SVM, Naive
API subsequences of one malware from its API CFG.
Bayes and ID3 algorithm.
The last step is choosing all subsequences fulfill intra-
• Step 2: Use the remained set as test apps. These
family support threshold. We count the number of every
detected apps are still represented as vector format
subsequence and compare this number with the minimum
based on the input features.
support. In detail, subsequences generated by all malware
• Step 3: Repeat Step 1 for 5 times and calculate the
in malware family will be merged into a set at first. Then,
average accuracy.
sorting algorithm is used to sort the element in this set.
At last, visiting the sorted element from the first element We do the experiment in this way because the evaluation
to count the number of each sequences and pick up API of our model may be not so accurate if we use part of
sequences fulfilling minimum support as the result. Here, our the applications in all malware family to train the model.
proposal can extract frequent API sequences from malware Some of the malware family has to not be used in the
in malware family, which will be used as the input features training phase so that the effectiveness in detecting unknown
in the detecting model. malware can be tested.
C. Experiment Results
D. Learning-based Detection
In this section, we evaluate the model with 1,200 Android
In our paper, we choose supervised learning method apps. We measure the area under the ROC curve (AUC) of
with labeled Android applications as training examples to detecting models with different arguments to estimate how
build our model in detecting Android malware. As the well our method does to detect Android malware.
classification problem, malware detection can be viewed At first, we try different machine learning algorithm to
as a classification problem with two classes, benign and train our detecting model. We use WEKA tool [10], which
malicious app. Specifically, we try SVM, Naive Bayes and contains the implementation of several machine learning
Decision Tree algorithm to train our model. For the model, algorithms, to do our experiment. After 5-fold cross vali-
the input features are frequent API sequences mined in the dation, we present the AUC of these three models learned
last procedure. Formally, we define the ai to be the ith app to by SVM, Naive Bayes and ID3 algorithm with 100% support
be detected and sj to be the binary whether the app contains in malware family and minimum sequence length 1 in Table
jth API sequence in its program. So the training examples 1. From our data, we find all these three models can obtain
675
an accurate prediction on malware detection. Furthermore, (600 malware from 30 malware families and 600 benign
SVM algorithm acquires the best model for this dataset. apps). The results shows that all these three models are
Later, we use SVM algorithm as the learning algorithm to able to have an accurate prediction on malware detection,
test the influence of different arguments in our experiment. in which SVM algorithm obtains the best results among
these three algorithms. Besides, we measure the influence
Table I
Q UALITY OF MODELS WITH DIFFERENT LEARNING ALGORITHMS
of arguments minimum support in malware family and
minimum API sequence length. We find that the model with
Algorithm AUC 0.9 support and minimum length 2 acquires better effect than
ID3 0.915 other case.
Naive Bayes 0.872
SVM 0.924 ACKNOWLEDGMENT
This work was supported by the National Natural Science
The minimum support for each malware family and the Foundation of China under Project Code 61170263.
minimum length of API sequences are two arguments which
could be tuned in our experiment to show the influence of R EFERENCES
these two arguments. Because there exists some variant of [1] PandaLabs. [Online]. Available: http://www.pandasecurity.
origin malware which some additional behaviors in malware com/
family, so we plan to measure the influence of intra-family
support. For the minimum support in malware family, 60%, [2] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein,
Y. Le Traon, D. Octeau, and P. McDaniel, “Flowdroid: Precise
70%, 80%, 90% and 100% are used in our model. Table context, flow, field, object-sensitive and lifecycle-aware taint
II (a) shows that when the support in malware family is analysis for android apps,” in Proceedings of the 35th ACM
90%, the quality of this model obtains the best result. SIGPLAN Conference on Programming Language Design and
For the support less than this number, there may contain Implementation. ACM, 2014, p. 29.
many redundant sequences which are not suitable as features
[3] Z. Yang, M. Yang, Y. Zhang, G. Gu, P. Ning, and X. S. Wang,
reflecting software behavior. While 100% support condition “Appintent: Analyzing sensitive data transmission in android
contains less features than the condition with 90% support, for privacy leakage detection,” in Proceedings of the 2013
which may not cover enough software behavior patterns ACM SIGSAC conference on Computer & communications
for detecting unknown malware. Table II (b) is the results security. ACM, 2013, pp. 1043–1054.
of different models with different minimum length of API
[4] A. P. Fuchs, A. Chaudhuri, and J. S. Foster, “Scan-
sequences, which means that choosing API sequences whose droid: Automated security certification of android applica-
length are greater than 2 obtain the best quality in our tions,” Manuscript, Univ. of Maryland, http://www. cs. umd.
dataset. edu/avik/projects/scandroidascaa, vol. 2, no. 3, 2009.
Table II [5] J. Newsome and D. Song, “Dynamic taint analysis for auto-
I NFLUENCE OF DIFFERENT ARGUMENTS TO OUR MODEL UNDER SVM matic detection, analysis, and signature generation of exploits
ALGORITHM
on commodity software,” 2005.
(a) Quality of models with (b) Quality of models with different
different minimum support minimum sequence length [6] W. Enck, D. Octeau, P. McDaniel, and S. Chaudhuri, “A
study of android application security.” in USENIX security
Min Support AUC Min Sequence Length AUC
60% 0.903 1 0.924
symposium, vol. 2, 2011, p. 2.
70% 0.889 2 0.930
80% 0.909 3 0.890 [7] Y. Aafer, W. Du, and H. Yin, “Droidapiminer: Mining api-
90% 0.937 level features for robust malware detection in android,” in
100% 0.924 Security and Privacy in Communication Networks. Springer,
2013, pp. 86–103.
676