0% found this document useful (0 votes)

74 views21 pages

Machine Learning Aided Android Malware Classification

This document presents two machine learning approaches for static Android malware classification. The first approach uses permissions and is computationally inexpensive. It has been implemented in an Android app called OWASP Seraphimdroid. The second approach analyzes source code using a bag-of-words model. Evaluations found an F-score of 95.1% for the source code model and 89% for the permission model, indicating potential for automating static code analysis to detect malware.

Uploaded by

Daniel Donciu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views21 pages

Machine Learning Aided Android Malware Classification

Uploaded by

Daniel Donciu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

M a c hi n e le a r ni n g ai d e d a n d r oi d

m a l w a r e cl a s sific a tio n
Ni k ol a, M , D e h g h a n t a n h a , A a n d Kim-Kw a n g R ay m o n d, C
h t t p:// dx. d oi.o r g/ 1 0. 1 0 1 6/j.co m p el e c e n g. 2 0 1 7. 0 2. 0 1 3

Ti t l e M a c hi n e le a r ni n g ai d e d a n d r oi d m al w a r e cl a s sific a tio n
Aut h ors Ni kol a, M , D e h g h a n t a n h a , A a n d Kim-Kw a n g R ay m o n d, C
Typ e Articl e
U RL This ve r sio n is a v ail a bl e a t :
h t t p:// u sir.s alfo r d. a c. u k/id/ e p ri n t/ 4 1 5 5 4/
P ublis h e d Dat e 2017

U SIR is a di git al c oll e c tio n of t h e r e s e a r c h o u t p u t of t h e U niv e r si ty of S alfo r d.

W h e r e c o py ri g h t p e r mi t s, full t e x t m a t e ri al h el d in t h e r e p o si to ry is m a d e
fr e ely a v ail a bl e o nli n e a n d c a n b e r e a d , d o w nlo a d e d a n d c o pi e d fo r n o n-
c o m m e r ci al p riv a t e s t u dy o r r e s e a r c h p u r p o s e s . Pl e a s e c h e c k t h e m a n u s c ri p t
fo r a n y fu r t h e r c o py ri g h t r e s t ri c tio n s.

Fo r m o r e info r m a tio n, in cl u di n g o u r p olicy a n d s u b mi s sio n p r o c e d u r e , pl e a s e

c o n t a c t t h e R e p o si to ry Te a m a t: u si r@ s alfo r d. a c. u k .
Machine learning aided Android malware classification

Nikola Milosevic
School of Computer Science,
University of Manchester, UK
[email protected]

Ali Dehghantanha
School of Computing, Science and Engineering
University of Salford, UK
[email protected]

Kim-Kwang Raymond Choo

Department of Information Systems and Cyber Security
The University of Texas at San Antonio, San Antonio, TX 78249-0631, USA
[email protected]

Abstract

The widespread adoption of Android devices and their capability to store ac-
cess significant private and confidential information have resulted in these de-
vices being targeted by malware developers. Existing Android malware analysis
techniques can be broadly categorized into static and dynamic analysis. In
this paper, we present two machine learning aided approaches for static anal-
ysis of Android malware. The first approach is based on permissions and the
other is based on source code analysis utilizing a bag-of-words representation
model. Our permission-based model is computationally inexpensive, and is im-
plemented as the OWASP Seraphimdroid Android app that can be obtained
from Google Play Store. Our evaluations of both approaches indicate an F-
score of 95.1% and F-measure of 89% for the source code-based classification
and permission-based classification models, respectively.
Keywords: Static malware analysis, OWASP, Seraphimdroid Android app,
Machine learning
2016 MSC: 00-01, 99-00

Preprint submitted to Computers and Electrical Engineering February 10, 2017

1. Introduction

In our increasingly connected society, the number and range of mobile devices
continue to increase. It is estimated to have approximately 6.1 billion mobile
device users by 2020 [1]. The wealth of private information that is stored on
5 or can be accessed via these devices made them an attractive target for cyber
criminals [2]. Studies have also revealed that users generally do not install
anti-virus or anti-malware app installed on their mobile devices, although the
effectiveness of such apps is also unclear or debatable [3]. Hence, mobile devices
are perceived by security professionals among the ”weakest links” in enterprise
10 security.
While all mobile operating systems/platforms have been targeted by mal-
ware developers, the trend is generally to focus on mobile operating systems
with a larger market share. A bigger market share [4] along with Google’s
flexible publishing policy on Android’s official application (also referred to as
15 app) market – Google Play – resulted in Android users being a popular target
for malware developers. It is also known that Android permission-based secu-
rity model provides little protection as most users generally grant apps requests
permissions [5]. There have also been instances where malicious apps were suc-
cessfully uploaded to Google Play [6]. This suggests a need for more efficient
20 Android malware analysis tools.
Existing approaches for malware analysis can be broadly categorized into
dynamic malware analysis and static malware analysis. In static analysis, one
reviews and inspects the source code and binaries in order to find suspicious
patterns. Dynamic analysis (behavioral-based analysis) involves the execution
25 of the analyzed software in an isolated environment while monitoring and tracing
its behavior [7].
Early approaches to mobile malware detection were based on the detection
of anomalies in battery consumption [8]. Operating system events, such as API
calls, Input/Output requests, and resource locks, have also been used in dy-
30 namic malware detection approaches. For example, TaintDroid is a malware

2
detection system based on anomalies in the app’s data usage behavior [9]. In
[10], the authors created a system that monitored anomalies in Android Dalvik
op-codes frequencies to detect malicious apps. Several approaches utilized ma-
chine learning to classify malware based on their behaviors. For example, the
35 authors in [11] focused on run-time behavior and classified Android malware
into the malware families using inter-process communications in combination
with SVM. A random forest-based approach with set of 42 vectors including
battery, CPU and memory usage as well as network behavior was also used for
Android mawlare detection in [12]. In [13], the authors used system calls and
40 regular expressions to detect data leakage, exploit execution and destructive
apps.
In order to avoid degrading of mobile devices’ performance, solutions based
on distributed computing and collaborative analysis for both static and dynamic
malware analysis have also been proposed [7]. For example, M0Droid is an
45 Android anti-malware solution that analyzes system calls of Android apps on
the server and creates signatures which are pushed to the user devices for threat
detection [14].
Static malware analysis techniques mainly rely on manual human analysis,
which limits the speed and scalability of investigation. Different approaches to
50 automate the static analysis process have also been proposed. [15] suggested
transforming malware source code into Calculus for Communicating Systems
(CCS) statements and utilized formal methods for checking the software’s be-
havior. However, their approach requires human analysts to formally describe
the unwanted behavior, which could still be time-consuming. The authors in
55 [16] proposed a methodology to generate fingerprints of apps that captures bi-
nary and structural characteristics of the app. Machine learning techniques can
be used to automate static malware analysis process. In [17], pattern recogni-
tion techniques are used to detect malware, while other works used standard
machine learning algorithms such as perception, SVM, locality sensitive hashing
60 and decision trees to assist in malware analysis (see [18]). In [19], the authors
extracted network access function calls, process execution, string manipulation,

3
file manipulation and information reading, prior to applying different machine
learning algorithms to classify malicious programs. In [20], the authors ex-
tracted 100 features based on API calls, permissions, intents and related strings
65 of different Android apps and applied Eigen space analysis to detect malicious
programs. Sahs and Khan used Androguard to obtain permissions and control
flow graphs of Android apps and created a SVM-based machine learning model
to classify Android malware [21].
In this paper, we demonstrate the utility of employing machine learning
70 techniques in static analysis of Android malware. Specifically, techniques such
as manifest analysis and code analysis are utilized to detect malicious Android
apps. The contributions of this paper are two-folded:

1. We present a machine learning model for Android malware detection based

on app permissions. This approach is lightweight and computationally
75 inexpensive, and can be deployed on a wide range of mobile devices.
2. We then present a new approach to perform code analysis using machine
learning, which provides higher accuracy and is capable of revealing more
granular app behaviors. Static code analysis of malware is a task generally
undertaken by forensics and malware analysts. However, our research
80 results indicate the potential to automate several aspects of the static
code analysis, such as detecting malicious behavior within the code.

The structure of this paper is as follows. In the next section, we present the
research methodology used in this paper. Research results are then presented,
followed by a discussion of the findings. Finally, the paper is concluded and
85 several future directions are suggested.

2. Methodology

Combination of permissions may give a clear indication about the capabili-

ties of the analyzed app(s). From combining the permissions, it can be induced
weather the app may cause harm or behave maliciously. We hypothesize that
90 malicious apps will have certain patterns and common permission combinations,

4
which can be learned by a machine learning algorithm. On the other hand, the
app code reflects the app’s behavior and, therefore, is a common choice for static
malware analysis. We utilize two machine learning techniques, namely: classi-
fication and clustering. As apps can be classified into malware and goodware,
95 the task of malware detection can be modeled as a classification problem.
Classification is a supervised machine learning technique, which can be used
to identify category or sub-population of a new observation based on labeled
data. Clustering is an unsupervised machine learning technique that is capable
of forming clusters of similar entities. Clustering algorithms are useful when
100 only a small portion of dataset is labeled. The labeled examples can be utilized
to infer the class of unlabeled data. Labels obtained through clustering can be
subsequently used to retrain a classification model with more data.
Also, in this research, we conducted four experiments, namely: permission-
based clustering, permission-based classification, source code-based clustering,
105 and source code-based classification. For the training and testing of our machine
learning models, we utilize M0Droid dataset, which contains 200 malicious and
200 benign Android apps [14].

2.1. Permission-based analysis

In this approach, we use Android’s permission names as features to build a

110 machine learning model since Android security model is based on app permis-
sions. Every app has to acquire the required privileges to access the different
phone features. During an app installation, a user is asked whether to grant the
app access to the permissions requested. Malicious apps usually require certain
permissions. For example, in order to access and exfiltrate sensitive information
115 from the SD card, a malicious app would require access to both the SD card
and Internet. Our approach is to model combinations of the Android permis-
sions requested by such malicious apps. We propose an approach that uses the
appearance of specific permissions as features for a machine learning algorithm.
In this approach, we first extract the permissions from our dataset and cre-
120 ate a model. For training, we use Weka toolkit and evaluate several machine

5
learning algorithms, including SVM, Naive Bayes, C4.5 Decision trees, JRIP
and AdaBoost. Classification algorithms we chose differ in their underlying
concept. Support Vector Machines is a non-probabilistic supervised machine
learning binary classification algorithm. SVM is capable of nonlinear classifica-
125 tion that maps inputs into high dimensional feature space. C4.5 decision tree is
a statistical classifier that builds a decision tree based on information entropy.
Each node of the tree algorithm selects a feature and splits its sets of samples
into subsets until classes can be inferred. Random forest is an ensemble classifi-
cation algorithm that combines a number of decision trees and returns the mode
130 of individual decisions by decision trees. Naive Bayes is a simple probabilistic
classifier that is based on applying Bayes theorem with strong independence
assumption between features. Bayesian network is a probabilistic graphical
model that represents a set of random variables and their inter-dependencies
in directed acyclic graph. JRIP is a propositional rule learner that tries every
135 attribute with every possible value and adds a rule which results to the greatest
information gain. Logistic regression is a statistical regression model where de-
pendent variable is used to estimate the probability of binary response based on
multiple features. AdaBoost is a meta algorithm that can be used with many
other algorithms to improve their performance by combining their outputs into
140 a weighted sum which represents the final output.
1
We then used the modified Weka 3.6.6 library for Android to develop the
OWASP Seraphimdroid Android app, which is using support vector machines
with sequential minimal optimization 2 .
We also apply several clustering techniques in order to evaluate the per-
145 formance of our unsupervised and supervised learning algorithms. Training,
testing and evaluation of our model are performed using Weka Toolkit by ap-
plying the Farthest First, Simple K-means and Expectation maximization (EM)

1 http://www.pervasive.jku.at/Teaching/lvaInfo.php?key=346&do=uebungen
2 https://www.owasp.org/index.php/OWASP_SeraphimDroid_Project

https://github.com/nikolamilosevic86/owasp-seraphimdroid

6
algorithms. Simple K-means is a clustering algorithm where samples are clus-
tered into n clusters, in which each sample belongs to a cluster with the nearest
150 mean. Farthest First algorithm uses farthest-first traversal to find k clusters
that minimize the maximum diameter of a cluster, and Expectation maximiza-
tion (EM) assigns a probability distribution to each instance which indicates
the probability of it belonging to each of the clusters.

2.2. Source code-based analysis

155 The second approach is a static analysis of the app’s source code. Malicious
codes generally use a combination of services, methods, and API calls in a way
that is not usual for non-malicious app [11]. Machine learning algorithms are
capable of learning common combinations of malware services, API and system
calls to distinguish them from non-malicious apps.
160 In this approach, Android apps are first decompiled and then a text mining
classification based on bag-of-words technique is used to train the model. Bag-
of-words technique has already showed promising results for classification of
harmful apps on personal computers [22]. Decompiling Android apps to conduct
static code analysis involves several steps. First, it is necessary to extract the
165 Dalvik Executable file (dex file) from the Android application package (APK
file). The second step is to transform the Dalvik Executable file into a Java
archive using the dex2jar tool 3 . Afterwards, we extract all .class files from the
Java archive and utilize Procyon Java decompiler (version 0.5.29) to decompile
.class files and create .java files. Then, we merge all Java source code files of
170 the same app into one large source file for further processing.
Since Java and natural language text have some degree of similarity, we apply
the technique used in natural language processing, known as ”a bag-of-words”.
In this technique, the text, or Java source code in our case, is represented as a
bag or set of words which disregards the grammar or word order. The model
175 takes into account all words that appear in the code. Our approach considers the

3 https://github.com/pxb1988/dex2jar

7
whole code including import statements, method calls, function arguments, and
instructions. The source code obtained in the previous step is then tokenized
into unigrams that are used as a bag-of-words. We use several machine learn-
ing algorithms for classifications, namely: C4.5 decision trees (in Weka toolkit,
180 it is known as J48), Naive Bayes, Support Vector Machines with Sequential
Minimal Optimization, Random Forests, JRIP, Logistic Regression and Ad-
aBoostM1 with SVM base. We performed our training, testing and evaluation
using Weka Toolkit. For source code analysis, we also utilized ensemble learn-
ing with combinations of three and five algorithms and majority voting decision
185 system. Ensemble learning combines multiple machine learning algorithms over
the same input, in hope to improve the classification performance. The number
of algorithms is chosen in a way that system is able to unambiguously choose
the output class based on majority of votes.
We also experiment with clustering on the source code. Clustering algo-
190 rithms we use include the Farthest First, Simple K-means and Expectation
maximization (EM). A flow diagram of the process is presented in Figure 1.

Figure 1: Workflow of Android file decompiling and machine learning-based malware detection
methodology

8
2.3. Ensemble learning
To improve the performance of our learning algorithms, our tests were per-
formed using ensemble learning with voting for both permission-based and
195 source code-based analysis. Ensemble methods use multiple classification al-
gorithms to obtain better performance than could be obtained from any of the
constituent algorithms individually. The final prediction is chosen as the label
that was predicted by the majority of classifiers. We also experiment with en-
sembles that contained combinations of three and five algorithms. Odd number
200 of algorithms allow us to unambiguously choose the class with majority voting.
For classification algorithms, we use SVM, C4.5, Decision Trees, Random Tree,
Random Forests, JRIP, and Linear Regression.

3. Evaluation and Discussion

We evaluated the performance of our approaches using 10-fold cross valida-

205 tion. In 10-fold cross validation, the original sample was randomly partitioned
into ten equal sized sub-samples. A single sub-sample was retained for the test-
ing, while the remaining nine were used for training. The process was repeated
ten times, and each time using a different sub-sample for testing. The results
were then averaged to produce a single estimation. The main advantage of this
210 method is that all samples were used once only for validation. The metrics we
used for the evaluation of the algorithms are precision, recall and F-measure,
which are widely used in the text mining and machine learning communities.
Classified items can be true positive (TP – items correctly labeled as belonging
to the class), false positive (FP - items incorrectly labeled as belonging to a cer-
215 tain class), false negative (FN - items incorrectly labeled as not belonging to a
certain class), and true negative (TN - items correctly labelled as not belonging
to a certain class).
Given the number of true positives and false negatives, recall is calculated
using the following formula:
TP
Recall =
(T P + F N )

9
The recall is sometimes referred to as ”sensitivity” or the ”true positive rate”.
Given the number of true positive and false positive classified items, precision
(also known as ”positive predictive rate”) is calculated as follows:

TP
P recision =
(T P + F P )

The measure that combines precision and recall is known as F-measure, given
as:
(1 + β 2 ) ∗ Recall ∗ P recision
F = ,
β 2 ∗ P recision + Recall)
where β indicates the relative value of precision. A value of β = 1 (which is
220 usually used) indicates the equal value of recall and precision. A lower value
indicates a larger emphasis on precision and a higher value indicates a larger
emphasis on recall [23].

3.1. Evaluation of permission-based classification

The evaluation of machine learning algorithms performing permission-based

225 classification is presented in Table 1.

Algorithm Precision Recall F-Score

C4.5 decision trees 0.827 0.827 0.827

Random forest 0.871 0.866 0.865

Bayes Networks 0.747 0.747 0.747

SVM with SMO 0.879 0.879 0.879

JRip 0.821 0.819 0.819

Logistic regression 0.823 0.822 0.821

Table 1: Evaluation results of permission-based classification using single machine learning

algorithms

As observed from Table 1, support vector machines with sequential minimal

optimization has the best performance with a F-measure value of 0.879. In other
words, this algorithm correctly classified 87.9% of test instances in 10-fold cross
validation. The algorithm is also efficient, in terms of speed, as it took only 0.04

10
230 seconds to train the model. Instances were also classified faster; thus, making
this approach suitable for real-time classification of (malicious) apps. We then
integrated this model for classification based on permissions with SVM in the
OWASP Seraphimdroid Android app 4 , which can be obtained from Google
Play Store 5 .
235 On the other hand, Bayesian algorithms such as Naive Bayes and Bayesian
networks have the worst performance. This could be due to the small dataset
(comprising only 387 instances) used in this study. Bayesian algorithms usu-
ally require much larger datasets than SVM to train the model with a higher
accuracy. A larger dataset may also the improve SVM model performance.
240 SVM algorithm outperforms Naive Bayes, Bayesian Network, JRip and Lo-
gistic regression on statistical t-test with a confidence interval of 0.05. However,
it is not significantly statistically better than decision trees and random forests.
In Table 2, we present the results of ensemble learning using majority voting.
We experimented with ensembles of three algorithms in order to determine
245 which algorithm(s) contribute to the best results in ensembles. The best three
performing algorithms are SVM with SMO, Logistic Regression and Random
Forest with an F-measure of 0.891. This is only a slight improvement compared
to using the SVM algorithm in isolation. The t-test suggested that ensemble
learning is not significantly better with a confidence interval of 0.05.
250 On the other hand, ensemble algorithms were much slower as more time is
needed to apply multiple machine learning algorithms (in our case, three or five)
and post-process results. Since the significance test showed that the performance
of the ensemble learning algorithm is not significantly better than the single
machine learning algorithm, there is no benefit in using these algorithms in
255 production.
Both results from the single classifier and ensemble method present a promis-
ing performance that can be used in anti-malware systems. This method would

4 https://github.com/nikolamilosevic86/owasp-seraphimdroid
5 https://play.google.com/store/apps/details?id=org.owasp.seraphimdroid

11
be able to detect unknown and new malware samples since it does not rely on
signatures, but rather on learned dangerous permission combination. Our find-
260 ings echoed the findings of previous studies such as [24], which demonstrated
the potential of machine learning algorithms in achieving a high detection rate,
even on new malware samples.

Algorithm Precision Recall F-Score

Random tree+Random forest+C4.5 0.878 0.876 0.876

Random tree+Random forest+SVM with SMO 0.885 0.884 0.884

SVM with SMO+Logistic regression+Random forest 0.892 0.891 0.891

Bayes Nets+SVM with SMO+Logistic Regression 0.879 0.876 0.876

C4.5+ Random forests+ Random tree+

SVM with SMO+Logistic regressing 0.895 0.894 0.894

Table 2: Evaluation results of permission-based classification using ensemble learning

There are, however, limitations with this approach. For the permission-
based approach, we reported an F-measure of 87.9% for single machine learning
265 algorithms. In other words, some malware samples would be undetected and
some non-malicious apps classified as malicious. In our case, 340 apps were cor-
rectly classified, while 47 were incorrectly classified. Using ensemble learning,
the number of misclassified instances was reduced to 42. Our reported perfor-
mance is higher than those reported in [25]. Also, our permission-based analysis
270 model is not computationally expensive and when implemented in the OWASP
Seraphimdroid app, we were able to scan and classify all 83 installed apps on
the test device (i.e., a Nexus 5 device with Quad-core 2.3 GHz, 2 GB RAM) in
under 8 seconds.

3.2. Evaluation of permission-based clustering

275 Clustering refers to the grouping of similar items together, without any
knowledge of how the grouping should be performed. Clustering is different
from supervised learning, where the training set is defined in a way that shows
how to perform classification. In clustering, there are no labels or training sets.

12
The set of elements is clustered into a certain number of groups, which are
280 usually formed based on the elements’ similarity.
Table 3 presents the results of our permission-based clustering approach. In
our case, apps will be grouped according to whether they use a similar set of
permissions. However, if an app uses a similar set of permissions as some mal-
ware, it does not mean that the app is malicious. As it can be seen from Table 3,
285 the results are not as good as classification. The best algorithm incorrectly clus-
tered more than 35% of the instances while permission-based classification only
incorrectly classified around 10.5% of the instances. In our permission-based
analysis, clustering had a higher error rate than classification.

Algorithm Correctly clustered instances Incorrectly clustered instances

SimpleKMeans 229 (59.17%) 158.0 (40.83%)

FarthestFirst 199 (51.42%) 188.0 (48.58%)

EM 250 (64.6%) 137.0 (35.4%)

Table 3: Evaluation results of permission-based clustering

3.3. Evaluation of source code-based classification

290 Of the 400 apps in our data set, we were unable to decompile 32 of them
(10 non-malicious and 22 malicious), perhaps due to code encryption and ob-
fuscation or instability of our Java decompiler. Nevertheless, the remaining 368
source files were sufficient to train a good model.
The evaluation of the classification for the analysis of the app’s source code
295 is presented in Table 4.
As Table 4 shows, over 95% of instances were correctly classified using SVM.
The high accuracy of source code-based classification reveals that the machine
can infer app behavior from its source code. Even though the bag-of-words
model disregards grammar and word order in text (in our context, the source
300 code), it is possible to train a successful machine learning model that is able
to distinguish malicious app from non-malicious app. Other machine learning
algorithms such as Random Forests, Logistic Regression and JRip were also

13
Algorithm Precision Recall F-Score

C4.5 decision trees 0.886 0.886 0.886

Random forest 0.937 0.935 0.935

Naive Bayes 0.825 0.821 0.820

Bayesian networks 0.825 0.821 0.819

SVM with SMO 0.952 0.951 0.951

JRip 0.916 0.916 0.916

Logistic regression 0.935 0.935 0.935

Table 4: Evaluation results of source code-based classification using single machine learning
algorithm

evaluated and had an F-score of over 90%. Therefore, source code appears to be
a viable source of information for a machine learning classification algorithm.
305 Also, with the machine learning-based source code analysis, it is possible to
analyze whether an Android package (apk) is malicious in less than 10 seconds,
which is significantly faster than a human analyst.

Algorithm Precision Recall F-Score

C4.5 decision tree+random tree+random forests 0.950 0.948 0.948

Logistic regression+C4.5+SVM with SMO 0.947 0.946 0.946

Random tree+Random Forest+SVM with SMO 0.825 0.821 0.820

SVM with SMO+Logistic regression+Random forest 0.942 0.940 0.940

SVM with SMO+Logistic regression+

AdaBoostM1 with SVM base 0.952 0.951 0.951

Logistic regression+JRip+Random Forests+

C4.5+SVM with SMO 0.950 0.948 0.948

SVM with SMO+Logistic regression+

Simple Logistic regression+AdaBoostM1 with SVM base 0.958 0.957 0.956

Table 5: Evaluation results of source code-based classification using ensemble learning

In Table 5, we present the results of ensemble learning methods. Ensemble

14
learning with voting had a slight improvement compared to the best results
310 from using single machine learning algorithms (the best F-measure of ensem-
ble learning was 0.956; the F-measure of SVM was 0.951) by combining SVM
with SMO, logistic regression, LogitBoost with simple regression functions as
base learners (simple logistic regression) and AdaBoostM1 with SVM as a base.
Some of the ensembles (e.g. C4.5 decision tree+random tree+random forests
315 or SVM with SMO+Logistic regression+Random Forest) performed worse than
SVM with SMO. Since the F-measure of C4.5 decision trees was 0.886, it had
a negative impact on the ensembles. Ensembles that contained SVM may have
misclassified some instances if the majority of algorithms voted for the wrong
class. The combination of algorithms in one case (SVM with SMO+Logistic re-
320 gression+Simple Logistic regression+AdaBoostM1 with SVM base) had slightly
improved classification performance (by 0.5% in F-measure), but it was not
statistically significant. Our source code analysis approach allows successful
classification of new malware in 95.1% cases with a single machine learning
algorithm.

325 3.4. Evaluation of source-based clustering

Table 6 presents the results of source code clustering. These results were
more promising than the those obtained from permission-based clustering since
the best performance of correctly clustered instances increased from 64.6% to
82.3%. The increase in performance is due to the fact that source code provides
330 a greater amount of data based on which clustering can be done. However,
there were still 17.6% incorrectly clustered instances. Since clustering is a type
of unsupervised machine learning algorithm, it creates clusters that are based on
code similarity. This is not necessarily a good indication of the code’s malicious
behavior. The way clustering maps instances in the absence of any supervi-
335 sion is the main reason that its performance is worse than the classification
algorithms. The results for non-supervised learning can be used for creating
larger labeled data sets. Classification (SVM) performed 14% better than the
best clustering method, which indicates that clustering should not be used for

15
detecting malware but only for expanding small datasets if necessary.

Algorithm Correctly clustered instances Incorrectly clustered instances

SimpleKMeans 303 (82.3%) 65 (17.66%))

FarthestFirst 296 (80.44%) 72 (19.56%)

EM 300 (81.53%) 68 (18.47%)

Table 6: Evaluation results of permission-based clustering

340 4. Conclusion and Future Research Directions

In this paper, we presented two machine learning aided (classification and

clustering) approaches based on app permissions and source code analysis to de-
tect and analyze malicious Android apps. The use of machine learning allows our
algorithms to detect new malware families with high precision and recall rates.
345 Our approach complements existing signature-based anti-malware solutions, as
the latter is not capable of detecting malicious software until the appropriate
signatures are released. Specifically, we demonstrated that the permission-based
method was able to classify malware from goodware in 89% of cases while source
code analysis classification performance was over 95%. Accuracy rates of 95.1%
350 using SVM, and 95.6% using the ensemble learning method are comparable with
existing state-of-the art solutions.
The majority of existing approaches need to perform analysis on the re-
mote server or they require the Android device to be rooted. However, our
permission-based approach can run on Android devices without root access and
355 offers a relatively good accuracy in malware detection. Source code-base analy-
sis approach, to the best of our knowledge, is the only automated Android static
malware analysis method that uses machine learning to scan the entire source
code of an app. Other static malware detection approaches are usually limited
to monitoring a set of API or system calls, ignoring import code snippets such
360 as operator statements and other code features. Our source code-based classi-
fication is computationally more expensive as it requires decompilation of the

16
files prior to analysis. However, detailed analysis of decompiled code does not
take more than 10 seconds per app. Practically, this method can be used to
scan and classify any apps including those on Google Play and other app stores.
365 Future research includes the evaluation of the proposed models using a signif-
icantly bigger labeled balanced data sets and utilizing online learning. Another
research focus is combining static and dynamic software analysis in which mul-
tiple machine learning classifiers are applied to analyze both source code and
dynamic features of apps in run-time.

370 References

[1] A. Boxall, The number of smartphone users in the world is expected to

reach a giant 6.1 billion by 2020 (2015).
URL http://www.digitaltrends.com/mobile/
smartphone-users-number-6-1-billion-by-2020/

375 [2] A. Dehghantanha, K. Franke, Privacy-respecting digital investigation, in:

Privacy, Security and Trust (PST), 2014 Twelfth Annual International Con-
ference on, IEEE, 2014, pp. 129–138.

[3] J. Walls, K.-K. R. Choo, A review of free cloud-based anti-malware

apps for android, in: 14th IEEE International Conference on Trust,
380 Security and Privacy in Computing and Communications (Trust-
Com/BigDataSE/ISPA), IEEE, 2015, pp. 1053–1058.

[4] M. Kitagawa, A. Gupta, R. Cozza, I. Durand, D. Glenn, K. Maita, L. Tay,

T. Tsai, R. Atwal, M. Escherich, E. He, A. Jump, B. Lakehal, C. Lu, T. H.
Nguyen, A. Sato, V. Tripathi, A. Zimmermann, W. Lutman, Market share:
385 Final pcs, ultramobiles and mobile phones, all countries, 2q15 update, Tech.
rep. (2015).

[5] C. Chia, K.-K. R. Choo, D. Fehrenbacher, How cyber-savvy are older mo-
bile device users?, in: M. H. Au, K.-K. R. Choo (Eds.), Mobile Security
and Privacy, Syngress/Elsevier, Waltham, MA, 2017, Ch. 4, pp. 67–83.

17
390 [6] N. Viennot, E. Garcia, J. Nieh, A measurement study of google play, in:
ACM SIGMETRICS Performance Evaluation Review, Vol. 42, ACM, 2014,
pp. 221–233.

[7] A.-D. Schmidt, J. H. Clausen, A. Camtepe, S. Albayrak, Detecting sym-

bian os malware through static function call analysis, in: Malicious and
395 Unwanted Software (MALWARE), 2009 4th International Conference on,
IEEE, 2009, pp. 15–22.

[8] T. K. Buennemeyer, T. M. Nelson, L. M. Clagett, J. P. Dunning, R. C.

Marchany, J. G. Tront, Mobile device profiling and intrusion detection using
smart batteries, in: Hawaii International Conference on System Sciences,
400 Proceedings of the 41st Annual, IEEE, 2008, pp. 296–296.

[9] W. Enck, P. Gilbert, S. Han, V. Tendulkar, B.-G. Chun, L. P. Cox, J. Jung,

P. McDaniel, A. N. Sheth, Taintdroid: an information-flow tracking system
for realtime privacy monitoring on smartphones, ACM Transactions on
Computer Systems (TOCS) 32 (2) (2014) 5.

405 [10] G. Canfora, F. Mercaldo, C. A. Visaggio, Mobile malware detection using

op-code frequency histograms, in: Proceedings of International Conference
on Security and Cryptography (SECRYPT), 2015.

[11] S. K. Dash, G. Suarez-Tangil, S. Khan, K. Tam, M. Ahmadi, J. Kinder,

L. Cavallaro, Droidscribe: Classifying android malware based on runtime
410 behavior, Mobile Security Technologies (MoST 2016) 7148 (2016) 1–12.

[12] M. S. Alam, S. T. Vuong, Random forest classification for detecting an-

droid malware, in: Green Computing and Communications (GreenCom),
2013 IEEE and Internet of Things (iThings/CPSCom), IEEE International
Conference on and IEEE Cyber, Physical and Social Computing, IEEE,
415 2013, pp. 663–669.

[13] T. Isohara, K. Takemori, A. Kubota, Kernel-based behavior analysis for

android malware detection, in: Computational Intelligence and Security

18
(CIS), 2011 Seventh International Conference on, IEEE, 2011, pp. 1011–
1015.

420 [14] M. Damshenas, A. Dehghantanha, K.-K. R. Choo, R. Mahmud, M0droid:

An android behavioral-based malware detection model, Journal of Infor-
mation Privacy and Security 11 (3) (2015) 141–157.

[15] F. Mercaldo, V. Nardone, A. Santone, C. A. Visaggio, Download malware?

no, thanks: how formal methods can block update attacks, in: Proceedings
425 of the 4th FME Workshop on Formal Methods in Software Engineering,
ACM, 2016, pp. 22–28.

[16] E. B. Karbab, M. Debbabi, D. Mouheb, Fingerprinting android packaging:

Generating dnas for malware detection, Digital Investigation 18 (2016) S33–
S45.

430 [17] L. Nataraj, S. Karthikeyan, G. Jacob, B. Manjunath, Malware images:

visualization and automatic classification, in: Proceedings of the 8th inter-
national symposium on visualization for cyber security, ACM, 2011, p. 4.

[18] H. V. Nath, B. M. Mehtre, Static malware analysis using machine learning

methods, Recent Trends in Computer Networks and Distributed Systems
435 Security (2014) 440–450.

[19] V. M. Afonso, M. F. de Amorim, A. R. A. Grégio, G. B. Junquera, P. L.

de Geus, Identifying android malware using dynamically obtained features,
Journal of Computer Virology and Hacking Techniques 11 (1) (2015) 9–17.

[20] S. Y. Yerima, S. Sezer, I. Muttik, Android malware detection: An

440 eigenspace analysis approach, in: Science and Information Conference
(SAI), 2015, IEEE, 2015, pp. 1236–1242.

[21] J. Sahs, L. Khan, A machine learning approach to android malware de-

tection, in: Intelligence and Security Informatics Conference (EISIC), 2012
European, IEEE, 2012, pp. 141–147.

19
445 [22] V. A. Benjamin, H. Chen, Machine learning for attack vector identification
in malicious source code, in: Intelligence and Security Informatics (ISI),
2013 IEEE International Conference on, IEEE, 2013, pp. 21–23.

[23] W. Hersh, Evaluation of biomedical text-mining systems: lessons learned

from information retrieval, Briefings in bioinformatics 6 (4) (2005) 344–356.

450 [24] J. Z. Kolter, M. A. Maloof, Learning to detect and classify malicious exe-
cutables in the wild, The Journal of Machine Learning Research 7 (2006)
2721–2744.

[25] Cyveillance, Cyveillance testing finds av vendors detect on average less than
19% of malware attacks (2010).
455 URL http://www.businesswire.com/news/home/20100804005348/en/
Cyveillance-Testing-Finds-AV-Vendors-Detect-Average

5. Biography of authors

5.1. Nikola Milosevic

Nikola Milosevic is a PhD student at the University of Manchester, School
460 of Computer Science, where his research topics focus around machine learn-
ing and natural language processing. Also he is involved with OWASP (Open
Web Application Security Project) as a founder of OWASP Serbia local chap-
ter, OWASP Manchester local chapter leader and a project leader of OWASP
Seraphimdroid mobile security project.

465 5.2. Ali Dehghantanha

Dr. Ali Dehghan Tanha is a Marie-Curie International Incoming Fellow in
Cyber Forensics, a fellow of the UK Higher Education Academy (HEA) and
an IEEE Sr. member. He has served for many years in a variety of research
and industrial positions. Other than Ph.D in Cyber Security he holds several
470 professional certificates such as GXPN, GREM, CISM, CISSP, and CCFP.

5.3. Kim-Kwang Raymond Choo

Term Sheet - Symita Inc
100% (1)
Term Sheet - Symita Inc
4 pages
Microsoft Defender XDR Markus Lintuala
No ratings yet
Microsoft Defender XDR Markus Lintuala
44 pages
Chapter 3. Ethics in Information Technology
100% (1)
Chapter 3. Ethics in Information Technology
36 pages
Android Malware Detection Using Machine Learning
No ratings yet
Android Malware Detection Using Machine Learning
4 pages
Malware Detection Using ML
No ratings yet
Malware Detection Using ML
19 pages
7.analysis and Detection of Malware in Android Applications Using Machine Learning
No ratings yet
7.analysis and Detection of Malware in Android Applications Using Machine Learning
55 pages
Odusami2018 Chapter AndroidMalwareDetectionASurvey
No ratings yet
Odusami2018 Chapter AndroidMalwareDetectionASurvey
12 pages
Towards A Fair Comparison and Realistic Evaluation Framework of Android Malware
No ratings yet
Towards A Fair Comparison and Realistic Evaluation Framework of Android Malware
18 pages
Multimodal ML Approach
No ratings yet
Multimodal ML Approach
16 pages
Significant Permission Identification For Machine Learning Based Android Malware Detection
No ratings yet
Significant Permission Identification For Machine Learning Based Android Malware Detection
10 pages
BT Nhóm
No ratings yet
BT Nhóm
16 pages
CIC AndMal 2017
No ratings yet
CIC AndMal 2017
5 pages
Heuristic-Based Malware Detection For Android Using Machine Learning
No ratings yet
Heuristic-Based Malware Detection For Android Using Machine Learning
6 pages
A Survey On Android Malware Detection Techniques Using Machine Learning Algorithms
No ratings yet
A Survey On Android Malware Detection Techniques Using Machine Learning Algorithms
8 pages
FINAL REVIEW PAPER Android Dynamic Malware Analysis
No ratings yet
FINAL REVIEW PAPER Android Dynamic Malware Analysis
12 pages
Android Based Malware Detection Technique Using Machine Learning Algorithms
No ratings yet
Android Based Malware Detection Technique Using Machine Learning Algorithms
6 pages
PDF 4
No ratings yet
PDF 4
11 pages
A Survey On Android Malware Detection Techniques Using Supervised Machine Learning
No ratings yet
A Survey On Android Malware Detection Techniques Using Supervised Machine Learning
24 pages
Hybrid Machine Learning Model For Malware Analysis in
No ratings yet
Hybrid Machine Learning Model For Malware Analysis in
18 pages
Malware Analysis On Android Using Supervised Machine Learning Techniques
No ratings yet
Malware Analysis On Android Using Supervised Machine Learning Techniques
12 pages
A Hybrid Approach For Android Mal Ware Detection
No ratings yet
A Hybrid Approach For Android Mal Ware Detection
15 pages
Droiddetector: Android Malware Characterization and Detection Using Deep Learning
No ratings yet
Droiddetector: Android Malware Characterization and Detection Using Deep Learning
10 pages
Detection of Malicious Android Apps Using Machine Learning Techniques
No ratings yet
Detection of Malicious Android Apps Using Machine Learning Techniques
7 pages
Feature Extraction From Android Application Packages and Its Usage in Machine Learning For Malware Classification
No ratings yet
Feature Extraction From Android Application Packages and Its Usage in Machine Learning For Malware Classification
40 pages
Harsha
No ratings yet
Harsha
13 pages
Mining Based Learning Framework For Android Malware Detection
No ratings yet
Mining Based Learning Framework For Android Malware Detection
12 pages
PROJECT
No ratings yet
PROJECT
72 pages
A Comprehensive Survey On Machine Learning Techniques For Android Malware Detection
No ratings yet
A Comprehensive Survey On Machine Learning Techniques For Android Malware Detection
12 pages
Android Malware
No ratings yet
Android Malware
62 pages
V25I0107
No ratings yet
V25I0107
6 pages
Android Malware Detection Report
No ratings yet
Android Malware Detection Report
9 pages
Spre It Zen Barth 2015 Mobile Sandbox
No ratings yet
Spre It Zen Barth 2015 Mobile Sandbox
13 pages
Icrtet 202113
No ratings yet
Icrtet 202113
8 pages
Malware Detection Using Static Analysis
No ratings yet
Malware Detection Using Static Analysis
54 pages
Malware Detection in Android in Different Application Categories
No ratings yet
Malware Detection in Android in Different Application Categories
6 pages
Research BT4260
No ratings yet
Research BT4260
5 pages
PDF 1
No ratings yet
PDF 1
22 pages
Machine Learning Approach For Malware de
No ratings yet
Machine Learning Approach For Malware de
11 pages
Android Malware Detection Using Machine Learning Techniques
No ratings yet
Android Malware Detection Using Machine Learning Techniques
59 pages
Significant Permission Identification For Machine Learning Based Android Malware Detection
No ratings yet
Significant Permission Identification For Machine Learning Based Android Malware Detection
10 pages
A Review of Android Malware Detection Approaches Based On Machine Learning
No ratings yet
A Review of Android Malware Detection Approaches Based On Machine Learning
29 pages
Hybrid Android Malware Detection and Classification Using Deep Neural Networks
No ratings yet
Hybrid Android Malware Detection and Classification Using Deep Neural Networks
26 pages
A Comprehensive Review Paper
No ratings yet
A Comprehensive Review Paper
36 pages
3116 Analisis Statis Deteksi Malware Jurnal Cybersecurity - Id.en
No ratings yet
3116 Analisis Statis Deteksi Malware Jurnal Cybersecurity - Id.en
5 pages
Final Research
No ratings yet
Final Research
12 pages
Malware Analysis On Android Compressed
No ratings yet
Malware Analysis On Android Compressed
165 pages
A Vast Review of Recognizing The Presence of Andro
No ratings yet
A Vast Review of Recognizing The Presence of Andro
17 pages
Tifs 18
No ratings yet
Tifs 18
14 pages
Feature Engineering and Evaluation For Android Malware Detection Scheme
No ratings yet
Feature Engineering and Evaluation For Android Malware Detection Scheme
18 pages
Adaptive Android Malware Detection Using Machine Learning and Semantic Analysis
No ratings yet
Adaptive Android Malware Detection Using Machine Learning and Semantic Analysis
47 pages
Agrawal-Trivedi2021 Chapter MachineLearningClassifiersForA
No ratings yet
Agrawal-Trivedi2021 Chapter MachineLearningClassifiersForA
13 pages
Permission Based Android Malware Detecti
No ratings yet
Permission Based Android Malware Detecti
7 pages
A Systematic Literature Review of Android Malware
No ratings yet
A Systematic Literature Review of Android Malware
18 pages
Malware - Me Project Document
No ratings yet
Malware - Me Project Document
28 pages
Android Malware Analysis: A Survey Paper: Tarang Kumar Barsiya, Dr. Manasi Gyanchandani and Dr. Rajesh Wadhwani
No ratings yet
Android Malware Analysis: A Survey Paper: Tarang Kumar Barsiya, Dr. Manasi Gyanchandani and Dr. Rajesh Wadhwani
8 pages
Android Application Malware Analysis
No ratings yet
Android Application Malware Analysis
29 pages
Android Malware Detection Using Machine Learning Techniques
No ratings yet
Android Malware Detection Using Machine Learning Techniques
50 pages
Iet-Ifs 2014 0099
No ratings yet
Iet-Ifs 2014 0099
8 pages
Malware - Me Project Document
No ratings yet
Malware - Me Project Document
31 pages
Behavioural Based Detection of Android Ransomware Using Machine Learning Techniques
No ratings yet
Behavioural Based Detection of Android Ransomware Using Machine Learning Techniques
34 pages
Hybrid ML-DL Approach For Android Malware Detection
No ratings yet
Hybrid ML-DL Approach For Android Malware Detection
9 pages
Malware Detection in Android Applications
No ratings yet
Malware Detection in Android Applications
3 pages
Android Application Security Essentials
From Everand
Android Application Security Essentials
Pragati Ogal Rai
No ratings yet
Optimizare Mecanisme de Dozare
No ratings yet
Optimizare Mecanisme de Dozare
215 pages
2019 J. Phys. Conf. Ser. 1297 011001
No ratings yet
2019 J. Phys. Conf. Ser. 1297 011001
4 pages
tddd17 Android Malware Fabha972 Andwi954 PDF
No ratings yet
tddd17 Android Malware Fabha972 Andwi954 PDF
11 pages
OEM Office 2019 Cheat Sheet PDF
No ratings yet
OEM Office 2019 Cheat Sheet PDF
2 pages
Refill SCX 4521F
No ratings yet
Refill SCX 4521F
1 page
Empowerment Technology: Dee Hwa Liong Academy
No ratings yet
Empowerment Technology: Dee Hwa Liong Academy
27 pages
Introduction To Internet Technologies
No ratings yet
Introduction To Internet Technologies
23 pages
METI JP - Cybersecurity Guidelines For Commercial Space Systems - V1.1 (2023)
No ratings yet
METI JP - Cybersecurity Guidelines For Commercial Space Systems - V1.1 (2023)
104 pages
Guidelines On Cyber Security Onboard Ships Version 2-0 July2017
100% (1)
Guidelines On Cyber Security Onboard Ships Version 2-0 July2017
51 pages
Content Analysis Admin Guide v23
No ratings yet
Content Analysis Admin Guide v23
185 pages
The Methods of Artificial Intelligence For Malicious Applications Detection in Android Os
No ratings yet
The Methods of Artificial Intelligence For Malicious Applications Detection in Android Os
7 pages
FDS - Lecture Notes - III AIML, CSM
No ratings yet
FDS - Lecture Notes - III AIML, CSM
101 pages
Windows Intrusion Detection Checklist
No ratings yet
Windows Intrusion Detection Checklist
10 pages
Phishing Email
No ratings yet
Phishing Email
13 pages
Cyberark Endpoint Privilege Manager
No ratings yet
Cyberark Endpoint Privilege Manager
2 pages
Blackbook Format Ahmed Final 23232
No ratings yet
Blackbook Format Ahmed Final 23232
54 pages
CYSECA - Company Profile 2022 v2
No ratings yet
CYSECA - Company Profile 2022 v2
25 pages
CIS Controls v8 Mapping To ASD Essential Eight - 2-2023
No ratings yet
CIS Controls v8 Mapping To ASD Essential Eight - 2-2023
82 pages
Honeypot in Network Security
No ratings yet
Honeypot in Network Security
27 pages
Smart Phone Attacks
No ratings yet
Smart Phone Attacks
34 pages
Ksvla4.0 Implguideen
No ratings yet
Ksvla4.0 Implguideen
175 pages
Module 1 To 6
No ratings yet
Module 1 To 6
257 pages
Vault 7 - CIA Hacking Tools Revealed Wikileaks-Org
No ratings yet
Vault 7 - CIA Hacking Tools Revealed Wikileaks-Org
12 pages
Final Exam (Case Study) - CPE32S1 - YU
No ratings yet
Final Exam (Case Study) - CPE32S1 - YU
6 pages
Types of Cyber Attack or Threats
No ratings yet
Types of Cyber Attack or Threats
3 pages
New Synthesis Paper
No ratings yet
New Synthesis Paper
10 pages
UTD-CybersecurityPortfolio-2.2 Workshop Guide-20221123
No ratings yet
UTD-CybersecurityPortfolio-2.2 Workshop Guide-20221123
74 pages
Practical Malware Analysis Based On Sandboxing
No ratings yet
Practical Malware Analysis Based On Sandboxing
6 pages
Computer Viruses and Cybersecurity
No ratings yet
Computer Viruses and Cybersecurity
7 pages
Information Security Notes
No ratings yet
Information Security Notes
5 pages
Data Smart Internet Brochure
No ratings yet
Data Smart Internet Brochure
5 pages
Ebook Painless Guide To Sse en
No ratings yet
Ebook Painless Guide To Sse en
15 pages