Machine Learning Aided Android Malware Classification
Machine Learning Aided Android Malware Classification
m a l w a r e cl a s sific a tio n
Ni k ol a, M , D e h g h a n t a n h a , A a n d Kim-Kw a n g R ay m o n d, C
h t t p:// dx. d oi.o r g/ 1 0. 1 0 1 6/j.co m p el e c e n g. 2 0 1 7. 0 2. 0 1 3
Ti t l e M a c hi n e le a r ni n g ai d e d a n d r oi d m al w a r e cl a s sific a tio n
Aut h ors Ni kol a, M , D e h g h a n t a n h a , A a n d Kim-Kw a n g R ay m o n d, C
Typ e Articl e
U RL This ve r sio n is a v ail a bl e a t :
h t t p:// u sir.s alfo r d. a c. u k/id/ e p ri n t/ 4 1 5 5 4/
P ublis h e d Dat e 2017
Nikola Milosevic
School of Computer Science,
University of Manchester, UK
[email protected]
Ali Dehghantanha
School of Computing, Science and Engineering
University of Salford, UK
[email protected]
Abstract
The widespread adoption of Android devices and their capability to store ac-
cess significant private and confidential information have resulted in these de-
vices being targeted by malware developers. Existing Android malware analysis
techniques can be broadly categorized into static and dynamic analysis. In
this paper, we present two machine learning aided approaches for static anal-
ysis of Android malware. The first approach is based on permissions and the
other is based on source code analysis utilizing a bag-of-words representation
model. Our permission-based model is computationally inexpensive, and is im-
plemented as the OWASP Seraphimdroid Android app that can be obtained
from Google Play Store. Our evaluations of both approaches indicate an F-
score of 95.1% and F-measure of 89% for the source code-based classification
and permission-based classification models, respectively.
Keywords: Static malware analysis, OWASP, Seraphimdroid Android app,
Machine learning
2016 MSC: 00-01, 99-00
In our increasingly connected society, the number and range of mobile devices
continue to increase. It is estimated to have approximately 6.1 billion mobile
device users by 2020 [1]. The wealth of private information that is stored on
5 or can be accessed via these devices made them an attractive target for cyber
criminals [2]. Studies have also revealed that users generally do not install
anti-virus or anti-malware app installed on their mobile devices, although the
effectiveness of such apps is also unclear or debatable [3]. Hence, mobile devices
are perceived by security professionals among the ”weakest links” in enterprise
10 security.
While all mobile operating systems/platforms have been targeted by mal-
ware developers, the trend is generally to focus on mobile operating systems
with a larger market share. A bigger market share [4] along with Google’s
flexible publishing policy on Android’s official application (also referred to as
15 app) market – Google Play – resulted in Android users being a popular target
for malware developers. It is also known that Android permission-based secu-
rity model provides little protection as most users generally grant apps requests
permissions [5]. There have also been instances where malicious apps were suc-
cessfully uploaded to Google Play [6]. This suggests a need for more efficient
20 Android malware analysis tools.
Existing approaches for malware analysis can be broadly categorized into
dynamic malware analysis and static malware analysis. In static analysis, one
reviews and inspects the source code and binaries in order to find suspicious
patterns. Dynamic analysis (behavioral-based analysis) involves the execution
25 of the analyzed software in an isolated environment while monitoring and tracing
its behavior [7].
Early approaches to mobile malware detection were based on the detection
of anomalies in battery consumption [8]. Operating system events, such as API
calls, Input/Output requests, and resource locks, have also been used in dy-
30 namic malware detection approaches. For example, TaintDroid is a malware
2
detection system based on anomalies in the app’s data usage behavior [9]. In
[10], the authors created a system that monitored anomalies in Android Dalvik
op-codes frequencies to detect malicious apps. Several approaches utilized ma-
chine learning to classify malware based on their behaviors. For example, the
35 authors in [11] focused on run-time behavior and classified Android malware
into the malware families using inter-process communications in combination
with SVM. A random forest-based approach with set of 42 vectors including
battery, CPU and memory usage as well as network behavior was also used for
Android mawlare detection in [12]. In [13], the authors used system calls and
40 regular expressions to detect data leakage, exploit execution and destructive
apps.
In order to avoid degrading of mobile devices’ performance, solutions based
on distributed computing and collaborative analysis for both static and dynamic
malware analysis have also been proposed [7]. For example, M0Droid is an
45 Android anti-malware solution that analyzes system calls of Android apps on
the server and creates signatures which are pushed to the user devices for threat
detection [14].
Static malware analysis techniques mainly rely on manual human analysis,
which limits the speed and scalability of investigation. Different approaches to
50 automate the static analysis process have also been proposed. [15] suggested
transforming malware source code into Calculus for Communicating Systems
(CCS) statements and utilized formal methods for checking the software’s be-
havior. However, their approach requires human analysts to formally describe
the unwanted behavior, which could still be time-consuming. The authors in
55 [16] proposed a methodology to generate fingerprints of apps that captures bi-
nary and structural characteristics of the app. Machine learning techniques can
be used to automate static malware analysis process. In [17], pattern recogni-
tion techniques are used to detect malware, while other works used standard
machine learning algorithms such as perception, SVM, locality sensitive hashing
60 and decision trees to assist in malware analysis (see [18]). In [19], the authors
extracted network access function calls, process execution, string manipulation,
3
file manipulation and information reading, prior to applying different machine
learning algorithms to classify malicious programs. In [20], the authors ex-
tracted 100 features based on API calls, permissions, intents and related strings
65 of different Android apps and applied Eigen space analysis to detect malicious
programs. Sahs and Khan used Androguard to obtain permissions and control
flow graphs of Android apps and created a SVM-based machine learning model
to classify Android malware [21].
In this paper, we demonstrate the utility of employing machine learning
70 techniques in static analysis of Android malware. Specifically, techniques such
as manifest analysis and code analysis are utilized to detect malicious Android
apps. The contributions of this paper are two-folded:
The structure of this paper is as follows. In the next section, we present the
research methodology used in this paper. Research results are then presented,
followed by a discussion of the findings. Finally, the paper is concluded and
85 several future directions are suggested.
2. Methodology
4
which can be learned by a machine learning algorithm. On the other hand, the
app code reflects the app’s behavior and, therefore, is a common choice for static
malware analysis. We utilize two machine learning techniques, namely: classi-
fication and clustering. As apps can be classified into malware and goodware,
95 the task of malware detection can be modeled as a classification problem.
Classification is a supervised machine learning technique, which can be used
to identify category or sub-population of a new observation based on labeled
data. Clustering is an unsupervised machine learning technique that is capable
of forming clusters of similar entities. Clustering algorithms are useful when
100 only a small portion of dataset is labeled. The labeled examples can be utilized
to infer the class of unlabeled data. Labels obtained through clustering can be
subsequently used to retrain a classification model with more data.
Also, in this research, we conducted four experiments, namely: permission-
based clustering, permission-based classification, source code-based clustering,
105 and source code-based classification. For the training and testing of our machine
learning models, we utilize M0Droid dataset, which contains 200 malicious and
200 benign Android apps [14].
5
learning algorithms, including SVM, Naive Bayes, C4.5 Decision trees, JRIP
and AdaBoost. Classification algorithms we chose differ in their underlying
concept. Support Vector Machines is a non-probabilistic supervised machine
learning binary classification algorithm. SVM is capable of nonlinear classifica-
125 tion that maps inputs into high dimensional feature space. C4.5 decision tree is
a statistical classifier that builds a decision tree based on information entropy.
Each node of the tree algorithm selects a feature and splits its sets of samples
into subsets until classes can be inferred. Random forest is an ensemble classifi-
cation algorithm that combines a number of decision trees and returns the mode
130 of individual decisions by decision trees. Naive Bayes is a simple probabilistic
classifier that is based on applying Bayes theorem with strong independence
assumption between features. Bayesian network is a probabilistic graphical
model that represents a set of random variables and their inter-dependencies
in directed acyclic graph. JRIP is a propositional rule learner that tries every
135 attribute with every possible value and adds a rule which results to the greatest
information gain. Logistic regression is a statistical regression model where de-
pendent variable is used to estimate the probability of binary response based on
multiple features. AdaBoost is a meta algorithm that can be used with many
other algorithms to improve their performance by combining their outputs into
140 a weighted sum which represents the final output.
1
We then used the modified Weka 3.6.6 library for Android to develop the
OWASP Seraphimdroid Android app, which is using support vector machines
with sequential minimal optimization 2 .
We also apply several clustering techniques in order to evaluate the per-
145 formance of our unsupervised and supervised learning algorithms. Training,
testing and evaluation of our model are performed using Weka Toolkit by ap-
plying the Farthest First, Simple K-means and Expectation maximization (EM)
1 http://www.pervasive.jku.at/Teaching/lvaInfo.php?key=346&do=uebungen
2 https://www.owasp.org/index.php/OWASP_SeraphimDroid_Project
https://github.com/nikolamilosevic86/owasp-seraphimdroid
6
algorithms. Simple K-means is a clustering algorithm where samples are clus-
tered into n clusters, in which each sample belongs to a cluster with the nearest
150 mean. Farthest First algorithm uses farthest-first traversal to find k clusters
that minimize the maximum diameter of a cluster, and Expectation maximiza-
tion (EM) assigns a probability distribution to each instance which indicates
the probability of it belonging to each of the clusters.
155 The second approach is a static analysis of the app’s source code. Malicious
codes generally use a combination of services, methods, and API calls in a way
that is not usual for non-malicious app [11]. Machine learning algorithms are
capable of learning common combinations of malware services, API and system
calls to distinguish them from non-malicious apps.
160 In this approach, Android apps are first decompiled and then a text mining
classification based on bag-of-words technique is used to train the model. Bag-
of-words technique has already showed promising results for classification of
harmful apps on personal computers [22]. Decompiling Android apps to conduct
static code analysis involves several steps. First, it is necessary to extract the
165 Dalvik Executable file (dex file) from the Android application package (APK
file). The second step is to transform the Dalvik Executable file into a Java
archive using the dex2jar tool 3 . Afterwards, we extract all .class files from the
Java archive and utilize Procyon Java decompiler (version 0.5.29) to decompile
.class files and create .java files. Then, we merge all Java source code files of
170 the same app into one large source file for further processing.
Since Java and natural language text have some degree of similarity, we apply
the technique used in natural language processing, known as ”a bag-of-words”.
In this technique, the text, or Java source code in our case, is represented as a
bag or set of words which disregards the grammar or word order. The model
175 takes into account all words that appear in the code. Our approach considers the
3 https://github.com/pxb1988/dex2jar
7
whole code including import statements, method calls, function arguments, and
instructions. The source code obtained in the previous step is then tokenized
into unigrams that are used as a bag-of-words. We use several machine learn-
ing algorithms for classifications, namely: C4.5 decision trees (in Weka toolkit,
180 it is known as J48), Naive Bayes, Support Vector Machines with Sequential
Minimal Optimization, Random Forests, JRIP, Logistic Regression and Ad-
aBoostM1 with SVM base. We performed our training, testing and evaluation
using Weka Toolkit. For source code analysis, we also utilized ensemble learn-
ing with combinations of three and five algorithms and majority voting decision
185 system. Ensemble learning combines multiple machine learning algorithms over
the same input, in hope to improve the classification performance. The number
of algorithms is chosen in a way that system is able to unambiguously choose
the output class based on majority of votes.
We also experiment with clustering on the source code. Clustering algo-
190 rithms we use include the Farthest First, Simple K-means and Expectation
maximization (EM). A flow diagram of the process is presented in Figure 1.
Figure 1: Workflow of Android file decompiling and machine learning-based malware detection
methodology
8
2.3. Ensemble learning
To improve the performance of our learning algorithms, our tests were per-
formed using ensemble learning with voting for both permission-based and
195 source code-based analysis. Ensemble methods use multiple classification al-
gorithms to obtain better performance than could be obtained from any of the
constituent algorithms individually. The final prediction is chosen as the label
that was predicted by the majority of classifiers. We also experiment with en-
sembles that contained combinations of three and five algorithms. Odd number
200 of algorithms allow us to unambiguously choose the class with majority voting.
For classification algorithms, we use SVM, C4.5, Decision Trees, Random Tree,
Random Forests, JRIP, and Linear Regression.
9
The recall is sometimes referred to as ”sensitivity” or the ”true positive rate”.
Given the number of true positive and false positive classified items, precision
(also known as ”positive predictive rate”) is calculated as follows:
TP
P recision =
(T P + F P )
The measure that combines precision and recall is known as F-measure, given
as:
(1 + β 2 ) ∗ Recall ∗ P recision
F = ,
β 2 ∗ P recision + Recall)
where β indicates the relative value of precision. A value of β = 1 (which is
220 usually used) indicates the equal value of recall and precision. A lower value
indicates a larger emphasis on precision and a higher value indicates a larger
emphasis on recall [23].
10
230 seconds to train the model. Instances were also classified faster; thus, making
this approach suitable for real-time classification of (malicious) apps. We then
integrated this model for classification based on permissions with SVM in the
OWASP Seraphimdroid Android app 4 , which can be obtained from Google
Play Store 5 .
235 On the other hand, Bayesian algorithms such as Naive Bayes and Bayesian
networks have the worst performance. This could be due to the small dataset
(comprising only 387 instances) used in this study. Bayesian algorithms usu-
ally require much larger datasets than SVM to train the model with a higher
accuracy. A larger dataset may also the improve SVM model performance.
240 SVM algorithm outperforms Naive Bayes, Bayesian Network, JRip and Lo-
gistic regression on statistical t-test with a confidence interval of 0.05. However,
it is not significantly statistically better than decision trees and random forests.
In Table 2, we present the results of ensemble learning using majority voting.
We experimented with ensembles of three algorithms in order to determine
245 which algorithm(s) contribute to the best results in ensembles. The best three
performing algorithms are SVM with SMO, Logistic Regression and Random
Forest with an F-measure of 0.891. This is only a slight improvement compared
to using the SVM algorithm in isolation. The t-test suggested that ensemble
learning is not significantly better with a confidence interval of 0.05.
250 On the other hand, ensemble algorithms were much slower as more time is
needed to apply multiple machine learning algorithms (in our case, three or five)
and post-process results. Since the significance test showed that the performance
of the ensemble learning algorithm is not significantly better than the single
machine learning algorithm, there is no benefit in using these algorithms in
255 production.
Both results from the single classifier and ensemble method present a promis-
ing performance that can be used in anti-malware systems. This method would
4 https://github.com/nikolamilosevic86/owasp-seraphimdroid
5 https://play.google.com/store/apps/details?id=org.owasp.seraphimdroid
11
be able to detect unknown and new malware samples since it does not rely on
signatures, but rather on learned dangerous permission combination. Our find-
260 ings echoed the findings of previous studies such as [24], which demonstrated
the potential of machine learning algorithms in achieving a high detection rate,
even on new malware samples.
There are, however, limitations with this approach. For the permission-
based approach, we reported an F-measure of 87.9% for single machine learning
265 algorithms. In other words, some malware samples would be undetected and
some non-malicious apps classified as malicious. In our case, 340 apps were cor-
rectly classified, while 47 were incorrectly classified. Using ensemble learning,
the number of misclassified instances was reduced to 42. Our reported perfor-
mance is higher than those reported in [25]. Also, our permission-based analysis
270 model is not computationally expensive and when implemented in the OWASP
Seraphimdroid app, we were able to scan and classify all 83 installed apps on
the test device (i.e., a Nexus 5 device with Quad-core 2.3 GHz, 2 GB RAM) in
under 8 seconds.
12
The set of elements is clustered into a certain number of groups, which are
280 usually formed based on the elements’ similarity.
Table 3 presents the results of our permission-based clustering approach. In
our case, apps will be grouped according to whether they use a similar set of
permissions. However, if an app uses a similar set of permissions as some mal-
ware, it does not mean that the app is malicious. As it can be seen from Table 3,
285 the results are not as good as classification. The best algorithm incorrectly clus-
tered more than 35% of the instances while permission-based classification only
incorrectly classified around 10.5% of the instances. In our permission-based
analysis, clustering had a higher error rate than classification.
13
Algorithm Precision Recall F-Score
Table 4: Evaluation results of source code-based classification using single machine learning
algorithm
evaluated and had an F-score of over 90%. Therefore, source code appears to be
a viable source of information for a machine learning classification algorithm.
305 Also, with the machine learning-based source code analysis, it is possible to
analyze whether an Android package (apk) is malicious in less than 10 seconds,
which is significantly faster than a human analyst.
14
learning with voting had a slight improvement compared to the best results
310 from using single machine learning algorithms (the best F-measure of ensem-
ble learning was 0.956; the F-measure of SVM was 0.951) by combining SVM
with SMO, logistic regression, LogitBoost with simple regression functions as
base learners (simple logistic regression) and AdaBoostM1 with SVM as a base.
Some of the ensembles (e.g. C4.5 decision tree+random tree+random forests
315 or SVM with SMO+Logistic regression+Random Forest) performed worse than
SVM with SMO. Since the F-measure of C4.5 decision trees was 0.886, it had
a negative impact on the ensembles. Ensembles that contained SVM may have
misclassified some instances if the majority of algorithms voted for the wrong
class. The combination of algorithms in one case (SVM with SMO+Logistic re-
320 gression+Simple Logistic regression+AdaBoostM1 with SVM base) had slightly
improved classification performance (by 0.5% in F-measure), but it was not
statistically significant. Our source code analysis approach allows successful
classification of new malware in 95.1% cases with a single machine learning
algorithm.
Table 6 presents the results of source code clustering. These results were
more promising than the those obtained from permission-based clustering since
the best performance of correctly clustered instances increased from 64.6% to
82.3%. The increase in performance is due to the fact that source code provides
330 a greater amount of data based on which clustering can be done. However,
there were still 17.6% incorrectly clustered instances. Since clustering is a type
of unsupervised machine learning algorithm, it creates clusters that are based on
code similarity. This is not necessarily a good indication of the code’s malicious
behavior. The way clustering maps instances in the absence of any supervi-
335 sion is the main reason that its performance is worse than the classification
algorithms. The results for non-supervised learning can be used for creating
larger labeled data sets. Classification (SVM) performed 14% better than the
best clustering method, which indicates that clustering should not be used for
15
detecting malware but only for expanding small datasets if necessary.
16
files prior to analysis. However, detailed analysis of decompiled code does not
take more than 10 seconds per app. Practically, this method can be used to
scan and classify any apps including those on Google Play and other app stores.
365 Future research includes the evaluation of the proposed models using a signif-
icantly bigger labeled balanced data sets and utilizing online learning. Another
research focus is combining static and dynamic software analysis in which mul-
tiple machine learning classifiers are applied to analyze both source code and
dynamic features of apps in run-time.
370 References
[5] C. Chia, K.-K. R. Choo, D. Fehrenbacher, How cyber-savvy are older mo-
bile device users?, in: M. H. Au, K.-K. R. Choo (Eds.), Mobile Security
and Privacy, Syngress/Elsevier, Waltham, MA, 2017, Ch. 4, pp. 67–83.
17
390 [6] N. Viennot, E. Garcia, J. Nieh, A measurement study of google play, in:
ACM SIGMETRICS Performance Evaluation Review, Vol. 42, ACM, 2014,
pp. 221–233.
18
(CIS), 2011 Seventh International Conference on, IEEE, 2011, pp. 1011–
1015.
19
445 [22] V. A. Benjamin, H. Chen, Machine learning for attack vector identification
in malicious source code, in: Intelligence and Security Informatics (ISI),
2013 IEEE International Conference on, IEEE, 2013, pp. 21–23.
450 [24] J. Z. Kolter, M. A. Maloof, Learning to detect and classify malicious exe-
cutables in the wild, The Journal of Machine Learning Research 7 (2006)
2721–2744.
[25] Cyveillance, Cyveillance testing finds av vendors detect on average less than
19% of malware attacks (2010).
455 URL http://www.businesswire.com/news/home/20100804005348/en/
Cyveillance-Testing-Finds-AV-Vendors-Detect-Average
5. Biography of authors
20