0% found this document useful (0 votes)
18 views4 pages

Zhu 2015

This paper presents a novel Android malware detection model that utilizes data mining techniques to identify malicious applications. The model generates API control flow graphs and extracts frequent API sequences from malware families, which are then used as features in a supervised learning framework. Experimental results demonstrate the model's effectiveness, achieving high accuracy in detecting both known and unknown Android malware.

Uploaded by

zfazza4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

Zhu 2015

This paper presents a novel Android malware detection model that utilizes data mining techniques to identify malicious applications. The model generates API control flow graphs and extracts frequent API sequences from malware families, which are then used as features in a supervised learning framework. Experimental results demonstrate the model's effectiveness, achieving high accuracy in detecting both known and unknown Android malware.

Uploaded by

zfazza4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

UIC-ATC-ScalCom-CBDCom-IoP 2015

API Sequences based Malware Detection for Android

Jiawei Zhu∗†‡ , Zhengang Wu§ , Zhi Guan∗†‡ , Zhong Chen∗†‡


∗ Institute
of Software, School of EECS, Peking University, Beijing, China
[email protected], [email protected], [email protected]
† MoE Key Lab of High Confidence Software Technologies (PKU), Beijing, China
‡ MoE Key Lab of Network and Software Security Assurance (PKU), Beijing, China
§ China Academy of Information and Communications Technology, Beijing, China
[email protected]

Abstract—To mitigate security problem brought by Android are put in this case to get the best model to detect Android
malware, various work has been proposed such as behavior malware.
based malware detection and data mining based malware In summary, our paper makes the following contributions
detection. In this paper, we put forward a novel Android
malware detection model using data mining techniques. We to the detection of Android malware:
design an algorithm with two steps. The first step is modeling • Automatically generation of Android malicious be-
Android application code into graph structure, called API havior: As mentioned before, behavior based malware
control flow graph by us. Next step is calculating API sequences detection method use manually defined behavior mode
fulfilling minimum intra-family support in each malware family
because malware in malware family usually share similar
to compare with the behavior of the application to be
behavior pattern. Finally, supervised learning method is took detected. In our paper, we can automatically generate
advantage in building our malware detecting model with API Android malicious behavior represented as API se-
sequences as input features. We evaluate this model with 1200 quence by extracting frequent API sequence in Android
applications, half of them are malicious and half are benign, malware family, which are used as features of the
and find it effective in identifying Android malware and even
unknown malware. detecting model.
• Effective detection results: After our experiment, we
Keywords-Android malware; malware family; data mining; find that our model is effective in malware detecting
feature selection;
and can reach a high detecting accuracy. Besides, the
influence of two important arguments in our model -
I. I NTRODUCTION intra-family minimum support and minimum length of
With the rising number of Android devices held by users API sequences are tested to build the best detecting
and applications developed by developers, various applica- model.
tions with malicious purpose began to occur in this envi- The following sections of this paper proceeds as follows.
ronment, which brought loss to Android users to different Section 2 describes the related work about Android malware
degree. These malicious applications usually can be used detection. Section 3 provides the detailed scheme of our
to destroy devices’ system, steal users’ private information work. Section 4 describes the process of our experiment and
and even bring loss to their property. Unfortunately, this Section 5 concludes.
condition is still rising now day by day. The statistics showed
II. R ELATED W ORK
that there were more than 75 million new malware samples
detected and 200,000 new strains of malware each day [1]. From the aspect of whether running applications to detect
So more research on the aspect of detection of Android malware or not, static analysis and dynamic analysis are
apps security should be done to protect Android users from two approaches for the detection of malware. In Android
malware. platform, both static analysis and dynamic analysis scheme
In this paper, we propose a novel scheme which uses data have been proposed by researchers in recent years. Because
mining based malware detection method with fine grained our work use static analysis to detect Android malware, we
features to detect Android malware. We use API sequence state static analysis related work in this section.
which reflects malicious behavior as features. To automat- For static analysis, at first, static information like API call
ically get these malicious behavior API sequences but not or permission requested is extracted from other representa-
manually define them, we extract these API sequences inside tion of application code, such as AST(Abstract Syntax Tree)
each Android malware family fulfilling the minimum intra- and CFG(Control Flow Graph). Then this static information
family support. Finally, several machine learning algorithms is viewed as feature to do with behavior based malware
detection or data mining based malware detection. Many
Zhi Guan is corresponding author researchers used behavior based malware detection method

978-1-4673-7211-4/15 $31.00 © 2015 IEEE 673


DOI 10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.135
as mentioned above [2], [6], [3], [4]. In this case, taint After obtaining the CFG of the application file, we only
analysis [5] was the most common technique to proceed want to focus on the sensitive API statement nodes and
behavior analysis. In detail, FlowDroid [2] was designed ignore other nodes because our goal is to generate API
to detect the flow of private personal information with call sequence. Therefore, before mining frequent API call
taint analysis. Enck et al. [6] designed Android decompiled sequence of each malware family, we model the application
tool ded decompiling Android application to origin Java code into API control flow graph. Here, we define the API
class file, which could be dealed with Java static analysis CFG as follows:
tools such as Fortify SCA. Some researchers also pay their API Control Flow Graph
attention to data mining based malware detection method, • Node: Each node represents a sensitive API call in
which performs quicker than behavior based malware detec- Android OS.
tion method. The detection of malware can ben transferred • Edge: Each edge represents the control flow in the
to the classification problem if with various labeled apps. program.
DroidApiMiner [7] used single API call as the feature with
To generate this graph, we design an algorithm to trans-
SVM, kNN, Random Forest and Naive Bayes to build the
form the CFG into API CFG shown in Algorithm 1. To
detection model. Drebin [8] utilized more features, such
correctly deal with the condition with user-defined method,
as API call, permission, static string to build the model.
we use iterative algorithm to solve this problem. For all
DroidMiner [9] uses similar idea to automatically mine API
user-defined methods in the program, if there is no user-
sequences which can represent software behavior patterns,
defined methods or all user-defined methods called by this
but we introduce a different algorithm to generate frequent
caller method have been obtained their API CFG, then
sequences and do more detailed experiment about machine
this method can be handled with update function. For the
learning process.
update algorithm, DFS(Depth First Search) can be used to
III. M ETHODOLOGY traverse the graph of this method(we call it method m). If a
node with sensitive API method is met, it will be added into
A. Overview the new API CFG graph of m. If a node with user-defined
In this paper, we put forward a data mining based Android method um is met, the already obtained API CFG of um
malware detection scheme with behavior API sequences will be inserted. Besides, the new graph should add an edge
as features of the detecting model. The overview of our from the last legal API node to the new added API node
proposal is shown in Figure 1. like.

Algorithm 1 API CFG Generation Algorithm


Require:
Each user-defined method m’s CFG CF G(m)
Each user-defined method m’s caller methods caller(m)
User-defined method set S in which methods have not
been handled yet
Ensure:
API CFG of the whole program
1: repeat
2: for each m ∈ S do
3: f lag = true;
4: for each cm ∈ caller(m) do
5: if cm ∈ S then
Figure 1. Proposal Overview
6: f lag = false;
7: break;
B. API CFG Generation 8: if f lag is true then
9: update(m);
Control flow analysis is done on the decompiled Android 10: remove m from S;
code. The code is analyzed from its entry points, that is the 11: until main method is not in set S
components which are defined in Android manifest file. Be
different from control flow analysis on Java code, Android
code contains several entry points. Besides, the communi-
cation between components of Android applications should C. Frequent Behavior API Sequence Mining
also be handled to generate an accurate CFG. For this To automatically generate behavior API call sequence set
problem, several papers have provided methods such as [2]. as the detecting model’s feature, we design an algorithm

674
to extract the features inside each malware family. Because is represented as ai = (si1 , si2 , . . . , sin ). The output of this
there is similar behavior pattern in each malware family, model is binary number −1 and 1.
our proposal try to mine the common API sequences in
each malware family. At first, we handle with the single IV. E XPERIMENT
malware’s API CFG generated in the last process and gen- A. Dataset
erate the sequence set of this malware. Then the algorithm In our work, we use supervised learning method to train
merges all of the generated sequences of every malware in the model so we have to collect enough training examples
one malware family and calculates the frequent sequences to do the experiment. Besides, our proposal need to have
occurring in this malware family as the features. a feature selection procedure for each Android malware
Our algorithm contains three steps. The first step is family. Therefore, the malware in our dataset is manually
traversing and recording all key paths in the graph. Here, classified into different Android malware family. Specif-
we define the key path of a graph meeting the following ically, we select 600 malware from 30 famous malware
properties: families(30 malware for each family, including FakeInstaller,
• There is no duplicate node occurring in the key path. GingerMaster and Geinimi) and 600 benign applications
• There is no path in the graph which is the prefix of this collected from Android Market.
key path.
B. Evaluation Method
To get all key paths in the graph, the algorithm traverses
In our experiment, we divide the data set into 5 sets
every node in the graph and maintains the path information
(formally defined as S1 , S2 , S3 , S4 , S5 ) in average with
of every visited node. If a path cannot traverse forward
6 malware families and 120 benign apps in each set. 5-fold
which means it is one of the key paths in this graph, then
cross validation is utilized in evaluating the model quality.
go back-track and update the path information of the visited
The detailed procedure is listed as follows.
node in this key path until finding another unvisited node.
• Step 1: Random choose 4 sets in S1 ∼ S5 as training
Finally, all key paths will be recorded in this process.
examples. For the malware in these sets, we generate
The second step is extracting every subsequences in key
frequent API sequences as the input features. Then the
paths and merging them into a no duplicate set. Here, all
training examples are represented in vector format by
subsequences of every key path are calculated and put into
finding whether each input feature exists in training
the subsequence set. After these two steps, we obtain all
examples’ API CFG. Build the model with SVM, Naive
API subsequences of one malware from its API CFG.
Bayes and ID3 algorithm.
The last step is choosing all subsequences fulfill intra-
• Step 2: Use the remained set as test apps. These
family support threshold. We count the number of every
detected apps are still represented as vector format
subsequence and compare this number with the minimum
based on the input features.
support. In detail, subsequences generated by all malware
• Step 3: Repeat Step 1 for 5 times and calculate the
in malware family will be merged into a set at first. Then,
average accuracy.
sorting algorithm is used to sort the element in this set.
At last, visiting the sorted element from the first element We do the experiment in this way because the evaluation
to count the number of each sequences and pick up API of our model may be not so accurate if we use part of
sequences fulfilling minimum support as the result. Here, our the applications in all malware family to train the model.
proposal can extract frequent API sequences from malware Some of the malware family has to not be used in the
in malware family, which will be used as the input features training phase so that the effectiveness in detecting unknown
in the detecting model. malware can be tested.
C. Experiment Results
D. Learning-based Detection
In this section, we evaluate the model with 1,200 Android
In our paper, we choose supervised learning method apps. We measure the area under the ROC curve (AUC) of
with labeled Android applications as training examples to detecting models with different arguments to estimate how
build our model in detecting Android malware. As the well our method does to detect Android malware.
classification problem, malware detection can be viewed At first, we try different machine learning algorithm to
as a classification problem with two classes, benign and train our detecting model. We use WEKA tool [10], which
malicious app. Specifically, we try SVM, Naive Bayes and contains the implementation of several machine learning
Decision Tree algorithm to train our model. For the model, algorithms, to do our experiment. After 5-fold cross vali-
the input features are frequent API sequences mined in the dation, we present the AUC of these three models learned
last procedure. Formally, we define the ai to be the ith app to by SVM, Naive Bayes and ID3 algorithm with 100% support
be detected and sj to be the binary whether the app contains in malware family and minimum sequence length 1 in Table
jth API sequence in its program. So the training examples 1. From our data, we find all these three models can obtain

675
an accurate prediction on malware detection. Furthermore, (600 malware from 30 malware families and 600 benign
SVM algorithm acquires the best model for this dataset. apps). The results shows that all these three models are
Later, we use SVM algorithm as the learning algorithm to able to have an accurate prediction on malware detection,
test the influence of different arguments in our experiment. in which SVM algorithm obtains the best results among
these three algorithms. Besides, we measure the influence
Table I
Q UALITY OF MODELS WITH DIFFERENT LEARNING ALGORITHMS
of arguments minimum support in malware family and
minimum API sequence length. We find that the model with
Algorithm AUC 0.9 support and minimum length 2 acquires better effect than
ID3 0.915 other case.
Naive Bayes 0.872
SVM 0.924 ACKNOWLEDGMENT
This work was supported by the National Natural Science
The minimum support for each malware family and the Foundation of China under Project Code 61170263.
minimum length of API sequences are two arguments which
could be tuned in our experiment to show the influence of R EFERENCES
these two arguments. Because there exists some variant of [1] PandaLabs. [Online]. Available: http://www.pandasecurity.
origin malware which some additional behaviors in malware com/
family, so we plan to measure the influence of intra-family
support. For the minimum support in malware family, 60%, [2] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein,
Y. Le Traon, D. Octeau, and P. McDaniel, “Flowdroid: Precise
70%, 80%, 90% and 100% are used in our model. Table context, flow, field, object-sensitive and lifecycle-aware taint
II (a) shows that when the support in malware family is analysis for android apps,” in Proceedings of the 35th ACM
90%, the quality of this model obtains the best result. SIGPLAN Conference on Programming Language Design and
For the support less than this number, there may contain Implementation. ACM, 2014, p. 29.
many redundant sequences which are not suitable as features
[3] Z. Yang, M. Yang, Y. Zhang, G. Gu, P. Ning, and X. S. Wang,
reflecting software behavior. While 100% support condition “Appintent: Analyzing sensitive data transmission in android
contains less features than the condition with 90% support, for privacy leakage detection,” in Proceedings of the 2013
which may not cover enough software behavior patterns ACM SIGSAC conference on Computer & communications
for detecting unknown malware. Table II (b) is the results security. ACM, 2013, pp. 1043–1054.
of different models with different minimum length of API
[4] A. P. Fuchs, A. Chaudhuri, and J. S. Foster, “Scan-
sequences, which means that choosing API sequences whose droid: Automated security certification of android applica-
length are greater than 2 obtain the best quality in our tions,” Manuscript, Univ. of Maryland, http://www. cs. umd.
dataset. edu/avik/projects/scandroidascaa, vol. 2, no. 3, 2009.

Table II [5] J. Newsome and D. Song, “Dynamic taint analysis for auto-
I NFLUENCE OF DIFFERENT ARGUMENTS TO OUR MODEL UNDER SVM matic detection, analysis, and signature generation of exploits
ALGORITHM
on commodity software,” 2005.
(a) Quality of models with (b) Quality of models with different
different minimum support minimum sequence length [6] W. Enck, D. Octeau, P. McDaniel, and S. Chaudhuri, “A
study of android application security.” in USENIX security
Min Support AUC Min Sequence Length AUC
60% 0.903 1 0.924
symposium, vol. 2, 2011, p. 2.
70% 0.889 2 0.930
80% 0.909 3 0.890 [7] Y. Aafer, W. Du, and H. Yin, “Droidapiminer: Mining api-
90% 0.937 level features for robust malware detection in android,” in
100% 0.924 Security and Privacy in Communication Networks. Springer,
2013, pp. 86–103.

[8] D. Arp, M. Spreitzenbarth, M. Hübner, H. Gascon, K. Rieck,


From this results, we find that when the SVM algorithm
and C. Siemens, “Drebin: Effective and explainable detection
learned model with minimum support as 0.9 and minimum of android malware in your pocket,” 2014.
sequence length as 2 can get the best quality in our dataset,
which is also an effective model for detecting Android [9] C. Yang, Z. Xu, G. Gu, V. Yegneswaran, and P. Porras,
malware from the AUC number we get. “Droidminer: Automated mining and characterization of fine-
grained malicious behaviors in android applications,” in Com-
V. C ONCLUSION puter Security-ESORICS 2014. Springer, 2014, pp. 163–182.
In this paper, we put forward a data mining based Android [10] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,
malware detection model with fine grained features which and I. H. Witten, “The weka data mining software: an update,”
uses API sequences as features but not single API call. In our ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–
experiment, we evaluate our model with 1200 applications 18, 2009.

676

You might also like