16.experimental Comparison of Features and Classifiers For Android Malware Detection
16.experimental Comparison of Features and Classifiers For Android Malware Detection
10-2020
Mariano CECCATO
University of Verona
Wei MINN
Singapore Management University, [email protected]
Citation
SHAR, Lwin Khin; DEMISSIE, Biniam Fisseha; CECCATO, Mariano; and MINN, Wei. Experimental
comparison of features and classifiers for Android malware detection. (2020). MOBILESoft 2020:
Proceedings of the 7th IEEE/ACM International Conference on Mobile Software Engineering and Systems,
Seoul, South Korea, October 5-6. 50-60. Research Collection School Of Information Systems.
Available at: https://ink.library.smu.edu.sg/sis_research/5115
This Conference Proceeding Article is brought to you for free and open access by the School of Information
Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in
Research Collection School Of Information Systems by an authorized administrator of Institutional Knowledge at
Singapore Management University. For more information, please email [email protected].
Experimental Comparison of Features and Classifiers for
Android Malware Detection
Lwin Khin Shar Biniam Fisseha Demissie
Singapore Management University Fondazione Bruno Kessler
[email protected] [email protected]
MamaDroid [30] extract API calls from call graphs. The majority of • We compare the malware detection accuracy of using the
the approaches has relied on static analysis for feature extraction. static analysis-based features, the dynamic analysis-based
Those approaches that apply dynamic analysis [15, 41] have mainly features, and the combined set of static and dynamic analysis-
focused on features at native level API calls (system calls). Typically, based features. To the best of our knowledge, we have not
static analysis-based features cover more information since static observed a hybrid approach that utilizes both analyses to
analysis can reason with the whole program code whereas dynamic extract features on API calls at method and class levels;
analysis-based features are limited to the code that are executed. • We compare the cost in terms of training time requirement
On the other hand, static analysis may have issues dealing with of using different types of features.
complex code such as code obfuscation, and modern malware are
We make the dataset and the scripts used in our experiments
usually crafted with obfuscated code [19]. Hence, in general, static
available [35] so that researchers could replicate or extend our
analysis and dynamic analysis complement each other.
experiments.
Once these features are extracted using program analyses, these
The rest of the paper is organized as follows. Section 2 discusses
approaches typically use machine learning classifiers to train on the
related work. Section 3 thoroughly discuses the methodology and
features and build malware detection model. For instance, Support
an overview of malware detection; it explains the data collection
Vector Machines (SVM), K-Nearest Neighbours, and Random Forest
and features extraction processes, and the machine learning and
were used in [21, 30]; AdaBoost, Naive Bayes, Decision Tree, and
deep learning classifiers we use. Section 4 presents the evaluation
SVM were used in [20].
studies and discusses the experimental results. Section 5 provides
In parallel, other studies [23, 27, 41, 45] have focused on the use
the concluding remarks and proposals for future studies.
of deep learning classifiers, such as Convolutional Neural network
and Recurrent Neural Network, instead of conventional machine
learning classifiers, to build malware detectors. Deep learning clas- 2 RELATED WORK
sifiers use several layers to study various levels of representations Naway and Li [29] reviewed the use of deep learning in combination
and extract higher-level features from the given lower-level ones. with program analysis for Android malware detection. However,
Hence, in general they have built-in feature selection process and their contribution was a literature survey (focusing on the differ-
are better at learning complex patterns. ences between key concepts of different DL classifiers and different
In the view of the proposals of different types of classifiers and feature extraction techniques) rather than an empirical study like
the use of different types of features and different analyses used for ours. Experimental comparisons are available in literature, con-
feature extraction, there is a need for a comprehensive evaluation on trasting different types of features and classifiers to detect Android
the effectiveness of the current state-of-the-art in malware detection malware. However, these approaches usually compare their single
on a common benchmark. This study aims to evaluate the malware proposed method against other recent approaches. Conversely, our
detection accuracy of various classifiers without fine tuning, when study aims at comparing different types of features and classifiers
learnt on different types of features from a common benchmark, that have been used by the research community, on a common
using both static and dynamic program analyses. benchmark.
We use 4572 benign samples and 2399 malware samples. Benign Static analysis-based features. Several approaches rely on static
samples were randomly collected from Androzoo repository [2], analysis to extract features from the app such as requested permis-
which are released from year 2017 to 2019. 1208 malware samples sions [5, 9, 16, 20, 26, 34, 36], the sequence of API calls [11, 23, 27, 30,
are collected from Androzoo repository [2], which are from year 37], the use of API calls [5, 9, 21, 36, 47, 50], or the frequency of API
2017 and 2019, and 1191 malware samples are from Drebin repos- calls [1, 11, 18, 19]. Our study also investigates and compares the
itory [5]. We extract static features from call graph of Android performance of using API use features and API sequence features.
package (apk) codes and dynamic features by executing the app in However, our study is not limited to features extracted with static
an Android emulator using our in-house intent-fuzzer combined analysis, but also with dynamic analysis. Additionally, we also study
with Android’s Monkey testing framework [4]. the performance of combining the features obtained from the two
Specifically, we make the following contributions in this paper. analyses, and we evaluate these features across several classifiers.
Like our study, some approaches [21, 30] extract static features
• We evaluate several conventional machine learning clas- from call graphs. Other approaches rely, instead, on data depen-
sifiers and deep learning classifiers. More specifically, we dency graphs [37, 50] or control flow graph [12]. In future, we plan
assess seven machine learning classifiers, namely K-Nearest to evaluate the difference between using different kinds of graphs.
Neighbors (KNN), Support Vector Machines (SVM), Deci- Considering that analysis at method level led to millions of fea-
sion Tree (DT), Random Forest (RF), AdaBoost (AB), Naive tures, resulting in long training time and memory consumption,
Bayes (NB), and Logistic regression (LR). We assess four deep some approaches [21, 30, 46] abstracted features at class, package,
learning classifiers, namely Simple Artificial Neural Network family, or entity levels, to save memory and time. Our study adopts
(sANN), Complex Artificial Neural Network (cANN), Con- both views, by evaluating features either at method level and at the
volutional Neural Network (CNN), and Recurrent Neural class level.
Network with long short term memory (RNN); Dynamic analysis-based features. Dynamic analysis-based ap-
• We compare the malware detection accuracy of using the proaches such as [15, 41] have mainly focused on features at native
features that characterize the sequence of API calls and that level API calls (system calls). Narudin et al. [28] evaluated the per-
of using the features that characterize the use of API calls; formance of five ML classifiers on network features (API calls that
Experimental Comparison of Features and Classifiers for Android Malware Detection
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea
involve network communication) extracted with dynamic analysis. Benign and Malware samples
Hybrid analysis-based features. Few approaches [3, 25, 48] apply Dynamic Analysis
- Generate test inputs using
Static Analysis
both static analysis and dynamic analysis techniques. However, - Generate call graphs
Monkey test
- Generate test inputs using
these approaches focused on extracting specific features that are Intent-fuzzing
generally considered to be dangerous, such as sending SMS and Call graphs Execution traces
considered in those approaches. This implies that our test generator Static Features Extraction Dynamic Features Extraction
- Extract method sequences - Extract method sequence
has to be more comprehensive to cover more program behaviours. Labels from call graphs
- Extract class signatures
from execution traces
- Extract class signatures
Labels
3 METHODOLOGY
This section presents the overall workflow of the experiments. Fig- calls, i.e., using some heuristics, it can track data flow across some
ure 1 illustrates the workflow, which consists of three stages. The commonly used native calls.
first stage is program analysis, which extracts call graphs and execu- Dynamic analysis is performed in two phases: the first phase
tion traces from benign and malware samples. The second stage is analyzes call graph of the app to extract paths from public entry-
feature extraction in which six datasets are extracted, namely static points (i.e., inter-component communication interfaces) to the leaf
method sequence features, dynamic method sequence features, hy- nodes. Similar to the static analysis phase, we generate the call
brid method sequence features, static class features, dynamic class graph of the app using Soot with FlowDroid plugin for Android. The
features, and hybrid class features, from call graphs and execution call graph is then traversed forward in depth-first search manner
traces. And they are labeled. In the last stage, conventional machine starting from the root node until a leaf node is reached. The output
learning classifiers (denoted as ML classifiers) and deep learning of this step is paths from the roots (entry-points of each component)
classifiers (denoted as DL classifiers) are trained and tested on the to the different leaf nodes (method calls without outgoing edges).
labeled dataset and produce the evaluation results. The following Once the list of paths is available, the next step is to instantiate
subsections discuss each stage in detail. an inter-component communication message (intent) fuzzer to gen-
erate inputs that execute the paths. To this end, we first instrument
the app to collect method execution traces and install the app on
3.1 Program Analysis an Android emulator. We then run our intent fuzzer with statically
In this phase, we perform static and dynamic analysis on the given collected values (such as static strings) from the app as seed (initial
Android application packages (apks). values).
For static analysis, we use FlowDroid [6] with its default settings, The generated inputs are Intent messages that are sent to the
to extract call graphs from an apk. FlowDroid is based on Soot [38]. app under test via the Android Debug Bridge (ADB). Our goal is
Firstly, given an apk, Soot converts it into an intermediate represen- to maximize coverage and collect as many traces as possible. The
tation called Jimple and FlowDroid performs flow analysis on the traces are also used to guide the test generation.
Jimple code. The analysis is flow- and context-sensitive. FlowDroid While this step exercises code parts that involve inter-component
also has an optional feature for handling reflections. We opted to (inter-app) communications, it does not address user interactions
use this feature since Android malware increasingly makes use of such as UI inputs.
reflection to avoid detection. However, like other static analysis In order to complement the first phase, we instantiate the second
tools, FlowDroid also shares inherent limitation of static analysis. phase that uses Google’s Android Monkey tool [4]. Monkey comes
It can only resolve reflective calls when the arguments used in with the Android SDK and is used to randomly generate input
the call are all string constants. Dynamic analysis can overcome events such as tap, input text or toggle WIFI in an attempt to trigger
this limitation. In addition, FlowDroid also handles common native abnormal app behaviors.
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea Lwin Khin Shar, Biniam Fisseha Demissie, Mariano Ceccato, and Wei Minn
The combined test generator covers app behaviors in a more As a result, we obtain one dataset from call graphs that charac-
comprehensive way. While Monkey tool covers GUI-related fea- terizes the sequence of API calls at method level, denoted as static-
tures, our fuzzer focuses on exercising inter-component (inter-app) sequence features. Likewise, we obtain one dataset from execution
interactions. traces, denoted as dynamic-sequence features. We also concatenate
the two sets of features into one dataset, denoted as hybrid-sequence
3.2 Features Extraction features. In general, we will denote them as sequence features.
Figure 3 shows a sample dataset containing the sequence features.
From the call graphs and the execution traces generated in the
previous phase, we extract six datasets as explained in the following:
Extracting from the sequences of API calls. Three datasets are ex- android.webkit.WebSettings: void setPluginsEnabled(boolean)
tracted from the sequences of API calls at method level — one from android.webkit.WebView: void setVisibility(int)
call graphs, one from execution traces, and one from combining the android.os.Handler: void <init>()
former two. Given a call graph, we traverse the graph in a depth java.lang.Boolean: java.lang.Boolean valueOf(boolean)
android.webkit.WebView: void loadUrl(java.lang.String)
first search manner and extract methods as we traverse (hence, se-
quence). If there is a loop, the method is traversed only once. Note
Figure 2: An excerpt of a sequence of API calls from a sample
that we only extract the methods from Android framework classes,
Java classes, and standard Org classes (org.apache, org.xml, etc.).
This is because it is common for malware to be obfuscated to circum- seq0 seq1 ... seqn label
vent malware detectors. The obfuscation often involves renaming benign1 74921 567 ... 84111 0
of library and custom (user defined) methods and classes. Hence, a benign2 12901 4490 ... 3923 0
malware detector will not be resilient to obfuscation if it is trained mal1 23712 6812 ... 0 1
on library and custom methods and classes. A previous study has mal2 23 63011 ... 0 1
shown that a simple renaming obfuscation method can prevent
Figure 3: Example of the sequence of API calls features
popular anti-malware products from detecting the transformed
malware samples [33]. Hence, we skipped methods that are not
from the above-mentioned standard packages as we traverse the WebView Picture SQLiteDatabase label
call graph. Similarly, we extract methods from the execution traces. benign1 1 0 0 0
However, since execution traces are already sequences, depth first benign2 0 0 1 0
search is not necessary. An excerpt of an extracted sequence is mal1 1 1 0 1
shown in Figure 2. mal2 0 1 1 1
Next, we discretized the sequence of method calls we extract Figure 4: Example of the use of API calls features
above so that it can be processed by machine learning and deep
learning classifiers. More precisely, we replace each unique method
with an identifier, resulting in a sequence of numbers. We build Extracting from the uses of API calls. Three datasets are extracted
a dictionary that maps each method call to its identifier. During from the uses of API calls at class level — one from call graphs, one
the testing or deployment phase, we may encounter unknown API from execution traces, and one from combining the former two.
calls. To address this, (1) we consider a large dictionary that covers The rationale for choosing class level features instead of method
nearly 2.9 millions of unique methods from standard libraries and level features is to reduce the amount of features such as those
(2) we replace all unknown API calls with a fixed identifier. approaches in [19, 21, 30]. Method level features would result in
The length of the sequences varies from one app to another. millions of features and yet the classifiers may not achieve a better
Thus, it is necessary to unify the length of the sequences. Since accuracy since the feature vectors of the samples would be sparse.2 .
we have two types of method sequences — from call graphs and The extraction process is the same for both call graphs and execu-
from execution traces, we chose two different uniform sequence tion traces. We initially build a database that stores unique classes.
lengths. Initially, we extracted the whole sequences. We then took Again for obfuscation resiliency, we only consider the Android
the median length of sequences from call graphs as the uniform framework, Java, and standard Org classes as explained above. We
sequence size, denoted as Lcд , for call graph-based method sequence currently maintain 134,558 classes. Given call graphs or execution
features and took the median length of sequences from execution traces, we scan the files and extract the class signatures (sequence
traces as the uniform sequence, denoted as Lt r , for execution traces- does not matter in this case). Each unique class in our database
based method sequence1 . If the length of a given sequence is less corresponds to a feature. The value of a feature is 1 if the corre-
than L, we pad the sequence with zeros; if the length is longer than sponding class is found in the given call graph or execution trace;
L, we trim it to L, from the right. Hence, for each app, we end up otherwise, it is 0.
with a sequence of numbers which is a feature vector. Each number As a result, we obtain one dataset, from call graphs, which char-
in the sequence corresponds to the categorical value of a feature. acterizes the use of API calls at class level, denoted as static-use
The number of features is the uniform sequence length L. features. Likewise, we obtain one dataset from execution traces, de-
noted as dynamic-use features. We also concatenate the two sets of
2 we did a preliminary study on the use of method level and class level features on
a randomly selected sample set containing 50 benign and 50 malware samples and
1L L t r =20000
cд =85000, observed that the classifiers achieved similar results.
Experimental Comparison of Features and Classifiers for Android Malware Detection
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea
features into one dataset, denoted as hybrid-use features. In general, 3.3.2 Deep Learning (DL) Classifiers. Deep learning is a class
we will denote them as use features. of machine learning algorithms that uses multiple layers to pro-
Figure 4 shows a sample dataset containing the use features. gressively extract higher level features from the raw input features.
Deep learning classifiers typically comprise an input layer, one or
3.3 Classifiers more hidden layers, and an output layer. In our context, the input
In the last phase, classifiers are trained and tested on each dataset layer accepts vectors of features — use features or sequence features
extracted in phase 2. The following briefly describes the classifiers (Section 3.2). Each vector represents an app. The output layer is the
used in our evaluations. binary classification (benign/malware) of the given app.
This study uses the following deep learning classifiers:
3.3.1 Conventional Machine Learning (ML) Classifiers. We eval- Artificial Neural Network, ANN is a common deep learning method,
uate seven ML classifiers: which comprises of an input layer, one or more hidden, fully-
K-Nearest Neighbours, KNN is one of the simplest classification connected (linear) layers, and an output layer. We used two different
techniques, with less or no prior knowledge of data distribution. configurations of ANN — different number of hidden layers and
The predicted test sample class is set equal to the true class among different number of neurons in each layer — in our experiments.
the nearest training instances [24]. In our experiments, three neigh- The first ANN is a simple ANN, denoted as sANN, which consists
bours comprised the KNN setting to perform the classifier. of two linear layers, with each layer containing 256 neurons. The
Linear Support Vector Machines, SVM determines a hyperplane second ANN is a more complex ANN, denoted as cANN. It consists
that separates both classes with maximal margin, given vectors of three linear layers — with 512, 256, 128 neurons, respectively. At
of two classes as training data. One of these classes is associated the end of these layers, cANN also has a dropout layer with p=0.5
with malware, whereas the other class corresponds to benign in- to avoid overfitting [39].
stances. An unknown/new instance is classified by mapping it to the Convolutional Neural Network, CNN typically comprises three
vector space and determining whether it falls on the malicious or types of layers — convolutional layer, pooling layer, and linear layer
benign side of the hyperplane [13]. SVM is widely used in malware — between the input layer and the output layer. The convolutional
classification task as it produces explainable detection model. layer utilizes the convolution procedure to accomplish the weight
Decision Trees, DT builds a rule-based model that predicts the sharing. The pooling layer progressively reduce the dimension of
class of a target variable by learning decision rules inferred from the feature map and thus, reduce the amount of parameters and
the given set of features. The depth of the tree can be customized to computation. It can be applied by an average pooling procedure or
fit the model. The deeper the tree, the more complex the decision a max pooling procedure. Thereafter, one or more linear layers and
rules and the fitter the model. Deep trees may not generalize the the output layer, typically a SoftMax function, are placed on the
data well (overfitting problem) and thus, usually it is necessary to top layer for classification and recognition. In our experiments, we
limit the maximum depth of the tree. There are a few variants of built the CNN classifier with the following sequence of layers – the
decision tree such as ID3, C4.5, C5.0, and CART. We use CART [8]. input layer, a convolutional layer followed by a max pooling layer,
Random Forest, RF is an ensemble of classifiers using many deci- another convolutional layer, followed by a max pooling layer, and
sion tree models [7]. A different subset of training data is selected one linear layer, a dropout layer with p=0.5, and finally the output
with a replacement to train each tree. The remaining training data layer with Softmax function.
serves to estimate the error and variable importance. RF has been Recurrent Neural Network, RNN is suitable for handling sequen-
proved to be highly accurate classifier for malware detection [17]. tial data. It has memory units, which retain the information of
In our experiments, we used 10 classifiers to form an ensemble. previous inputs or the state of hidden layers and its output depends
AdaBoost, AB is also an ensemble of classifiers. It fits a sequence on previous inputs. It can also have a special layer called LSTM,
of weak models (i.e., models that are only slightly better than ran- which avoids the error vanishing problem by fixing weight of hid-
dom guessing, such as small decision trees) on repeatedly modified den layers to avoid error decay and retaining not all information
versions of the data. The predictions from all of them are then of input but only selected information which is required for future
combined through a weighted majority vote to produce the final outputs. RNN has shown good results in various fields which use
classification. sequential data such as language processing or speech recogni-
Naive Bayes, NB classifier applies Bayes’ theorem with the “naive” tion [14]. In our experiments, we built the RNN with LSTM units.
assumption of conditional independence between every pair of Our RNN classifier consists of the input layer, one LSTM layer, one
features given the value of the class variable [49]. This assumption linear layer, and the output layer with Softmax function. We also
allows NB to learn the model extremely fast. use dropout with p=0.5.
Logistic Regression, LR is a statistical model that uses a logistic We built the above DL classifiers by using Pytorch’s libraries [31].
function to model the probability of a certain class such as pass/fail, As activation function for linear layers and convolutional layers,
malware/benign, etc. we use rectified linear unit (ReLU) function. We use 30 epochs for
We used scikit-learn Python tool [32] to run the above classifiers. training the DL classifiers. The scripts are written in Python.
We used the tool’s default settings, such as K=3 for KNN, number Table 1 shows a summary of general comparison among the clas-
of estimators = 10 for RF, etc, without any fine tuning. Use features sifiers we used, based on the documentations from scikit-learn [32]
are fed into the classifiers as they are. Sequence features are fed into and Pytorch [31], and the references from [28, 29].
the classifiers as categorical features.
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea Lwin Khin Shar, Biniam Fisseha Demissie, Mariano Ceccato, and Wei Minn
Table 1: Pros and cons of the classifiers [28, 29, 31, 32]
Class Classifier Pros Cons
Naive Bayes very fast classifier; suitable for getting quick classifi- unable to learn complex relationships among features
Statistics
cation results
K-Nearest Neighbours typically more robust than other statistics-based clas- large memory and computation time for training
sifiers for small k value
Linear SVM efficient; easy to analyze output only directly applicable for binary classification problems;
large memory and computation time for training
Logistic Regression can learn relatively complex relationships among fea- unpredictable performance as the learning process may fail to
tures converge (failure of the likelihood maximization algorithm)
Random Forest randomization typically helps achieve good perfor- output is hard to analyze
Rules
mance
Decision Trees fast and scalable classifier; easy to analyze output less effective when learning features with continuous values
AdaBoost built-in feature selection capability, which reduces sensitive to noisy data and outliers
dimensionality and computation time
Simple/Complex Arti- parelellization of learning process and typically consume large memory and computation time for both training
Deep Learn
ficial Neural Network achieves good performance and classification, compared to typical ML models
Convolutional Neural fewer neuron connections needed compared to a stan- fine tuning is usually needed to discover a complete hierarchy
Network dard ANN, i.e., faster learning process; can be varied of features; it also needs a big dataset
to suit the need to a particular classifier problem
Recurrent Neural Net- modeling time dependencies; able to remember serial learning process suffers from vanishing gradient problem; fine
work events tuning to suit a given classifier problem is usually needed to
avoid this problem
4 EVALUATION As a result, after we take the intersection of the apps that can
This section presents the experimental comparison results of fea- be commonly analyzed by static and dynamic tools, we ended up
tures and classifiers for Android malware detection. Specifically, with 4572 benign samples and 2399 malware samples — 1208 from
we investigate the following research questions: Androzoo repository [2] and 1191 from Drebin repository [5]. Note
that several of the malware samples from Drebin are obfuscated
• RQ1: Which type of features — the use of API calls or the and the malware samples from Androzoo are recent.
sequence of API calls — achieves better malware detection In comparison, Table 2 shows the sizes of dataset used by Android
accuracy? malware detection approaches in related work. But note that these
• RQ2: Which type of features — statically extracted, dynami- studies only apply either static or dynamic analysis and evaluate a
cally extracted, or a combination of both — achieves better few classifiers. Whereas we evaluate 11 classifiers and 6 different
malware detection accuracy? types of features. Our dataset size is comparable to the sizes used
• RQ3: Do deep learning classifiers achieve better malware in some recent studies such as [37, 46].
detection accuracy than conventional machine learning clas-
sifiers?
Table 2: Sizes of dataset used in some of the malware detec-
• RQ4: What are the training costs for different types of fea-
tion approaches
tures?
Reference #Benign #Malware
4.1 Experiment Design Droid-sec [48] 250 250
DroidSift [50] 13500 2200
Dataset. Initially we had 20k benign samples collected from An-
Drebin [5] 123453 5560
drozoo repository [2], which are released from year 2017 to 2019.
Narudin et al. [28] 20 1000
We also had 7757 malware samples — 5500 samples from Drebin
Maldozer [22] 37627 33066
repository [5] and 2257 samples from Androzoo repository [2],
RevealDroid [19] 24679 30203
which are from year 2017 to 2019. However, as we evaluate the
Shen et al. [37] 3899 3899
use of both static- and dynamic analysis-based features, we had to
EnMobile [46] 1717 4897
filter those samples that can be analyzed by both static analysis and
MaMadroid [30] 8447 35493
dynamic analysis tools. When we use FlowDroid [6] tool to extract
DaDiDroid [21] 43262 20431
call graphs, some of the apps caused exceptions. And our intent-
fuzzing test generation tool also caused time-outs and crashes for
some of the apps during the dynamic analysis. Therefore, we were Metrics. To assess the accuracy of the classifiers, we use the
not able to extract features for those cases. Note that these are the standard metrics — Recall (probability of detection, Pd), Precision
limitations of the underlying program analysis tools; for future (Pr), and F-measure (F) — which are typically used for evaluating
work, we plan to investigate these issues and address them. Never- Malware detection accuracy [19, 30]. Recall is computed as Pd =
theless, the objective of this experiment is to compare features and tp/(tp + f n); Precision is computed as Pr = tp/(tp + f p); and
classifiers and not to assess the feature collection components. F-measure is computed as F = 2 ∗ (Pr ∗ Pd )/(Pr + Pd ).
Experimental Comparison of Features and Classifiers for Android Malware Detection
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea
Feature type
Use of API calls
Table 3: Results on using static features that characterize the Table 6: Results on using static features that characterize the
sequence of API calls at method level use of API calls at class level
Classifier Pd Pd (sd) Pr Pr (sd) F F (sd) Classifier Pd Pd (sd) Pr Pr (sd) F F (sd)
KNN 0.578 0.109 0.839 0.019 0.683 0.077 KNN 0.617 0.035 0.907 0.041 0.734 0.011
SVM 0.588 0.081 0.838 0.015 0.690 0.055 SVM 0.922 0.011 0.863 0.029 0.892 0.01
DT 0.885 0.047 0.852 0.039 0.868 0.025 DT 0.908 0.008 0.833 0.003 0.869 0.002
RF 0.920 0.019 0.905 0.027 0.913 0.010 RF 0.924 0.020 0.879 0.044 0.901 0.028
AB 0.903 0.024 0.869 0.027 0.885 0.010 AB 0.887 0.010 0.847 0.002 0.867 0.006
NB 0.989 0.010 0.441 0.027 0.610 0.026 NB 0.968 0.004 0.675 0.006 0.795 0.005
LR 0.843 0.031 0.888 0.036 0.865 0.021 LR 0.930 0.023 0.872 0.043 0.900 0.012
sANN 0.865 0.082 0.837 0.238 0.843 0.105 sANN 0.865 0.023 0.923 0.046 0.893 0.028
cANN 0.868 0.188 0.422 0.591 0.496 0.406 cANN 0.865 0.027 0.920 0.045 0.891 0.022
CNN 0.667 0.127 0.461 0.196 0.542 0.186 CNN 0.785 0.036 0.810 0.045 0.797 0.032
RNN 0.365 0.254 0.014 0.018 0.027 0.034 RNN 0.878 0.023 0.895 0.035 0.884 0.022
Table 4: Results on using dynamic features that characterize Table 7: Results on using dynamic features that characterize
the sequence of API calls at method level the use of API calls at class level
Classifier Pd Pd (sd) Pr Pr (sd) F F (sd) Classifier Pd Pd (sd) Pr Pr (sd) F F (sd)
KNN 0.651 0.123 0.782 0.106 0.707 0.058 KNN 0.627 0.253 0.82 0.115 0.706 0.202
SVM 0.633 0.198 0.785 0.079 0.696 0.130 SVM 0.839 0.099 0.828 0.095 0.832 0.078
DT 0.563 0.652 0.722 0.348 0.521 0.397 DT 0.660 0.376 0.678 0.270 0.637 0.195
RF 0.550 0.239 0.691 0.169 0.605 0.171 RF 0.807 0.059 0.830 0.139 0.817 0.084
AB 0.493 0.217 0.698 0.215 0.563 0.121 AB 0.723 0.247 0.789 0.096 0.750 0.178
NB 0.991 0.010 0.376 0.016 0.546 0.016 NB 0.857 0.340 0.613 0.146 0.710 0.209
LR 0.622 0.193 0.777 0.091 0.687 0.140 LR 0.848 0.093 0.831 0.117 0.838 0.088
sANN 0.653 0.024 0.646 0.039 0.649 0.025 sANN 0.853 0.046 0.887 0.046 0.869 0.016
cANN 0.699 0.136 0.393 0.094 0.500 0.085 cANN 0.842 0.031 0.895 0.027 0.868 0.022
CNN 0.700 0.031 0.828 0.083 0.758 0.035 CNN 0.638 0.043 0.618 0.062 0.627 0.037
RNN 0.560 0.147 0.225 0.167 0.317 0.179 RNN 0.852 0.016 0.869 0.062 0.860 0.027
Table 5: Results on using hybrid features that characterize Table 8: Results on using hybrid features that characterize
the sequence of API calls at method level the use of API calls at class level
Classifier Pd Pd (sd) Pr Pr (sd) F F (sd) Classifier Pd Pd (sd) Pr Pr (sd) F F (sd)
KNN 0.521 0.170 0.803 0.073 0.626 0.102 KNN 0.911 0.078 0.833 0.133 0.868 0.064
SVM 0.657 0.133 0.699 0.104 0.672 0.022 SVM 0.916 0.056 0.861 0.135 0.885 0.066
DT 0.862 0.086 0.807 0.076 0.832 0.043 DT 0.880 0.138 0.802 0.147 0.833 0.066
RF 0.877 0.081 0.892 0.089 0.883 0.037 RF 0.905 0.074 0.847 0.138 0.872 0.060
AB 0.887 0.088 0.843 0.039 0.863 0.024 AB 0.827 0.229 0.827 0.156 0.820 0.143
NB 0.984 0.024 0.449 0.055 0.616 0.051 NB 0.863 0.272 0.706 0.150 0.773 0.185
LR 0.807 0.148 0.861 0.112 0.828 0.034 LR 0.925 0.067 0.869 0.125 0.894 0.054
sANN 0.899 0.057 0.742 0.243 0.805 0.127 sANN 0.873 0.039 0.932 0.045 0.901 0.010
cANN 0.923 0.085 0.135 0.137 0.230 0.199 cANN 0.862 0.031 0.935 0.027 0.897 0.016
CNN 0.714 0.134 0.121 0.060 0.204 0.078 CNN 0.799 0.023 0.825 0.030 0.812 0.018
RNN 0.527 0.092 0.106 0.018 0.176 0.028 RNN 0.879 0.035 0.888 0.037 0.884 0.014
interaction events and inputs. Even though static analysis is gener- neurons, dropout value, training epochs, etc., the DL classifiers may
ally weak against code obfuscation and our dataset contains several not perform well.
malware with obfuscated code, this did not have significant effect On the other hand, the DL classifiers did achieve better results
on static analysis because we only used features that represent stan- than the ML classifiers when trained with dynamic features, on both
dard (not user defined) classes and methods to mitigate renaming API sequence and API usage. As shown in Table 4, CNN achieved
obfuscation (see Section 3.2). Pd = 0.7 and Pr = 0.828, which are better than all other classifiers
Regarding the use of hybrid features, we found that it improved trained with dynamic-sequence features, except NB that has Pd =
the accuracy for only some of the classifiers and it had the negative 0.991 but with Pr = 0.376. Similarly, as shown in Table 7, sANN,
effect on other classifiers. For example, KNN’s F-measure when cANN, RNN achieved better results than the ML classifiers.
trained with static-use features is 0.734 and it improved to 0.868 One other interesting finding is that the simplest classifier, Naive
when static-use and dynamic-use features are combined. The same Bayes, achieved Pd = 0.942, averaging across all types of features.
goes for all four DL classifiers when trained with use features. But It is the highest recall among all the classifiers; that is, it detected
the result of all other classifiers decreased. With respect to sequence 2260 out of 2399 malware from our dataset. On the other hand, it
features, only NB’s F-measure was improved when static-sequence only achieved Pr = 0.543 on average; that is, it produced one false
and dynamic-sequence features are combined. For all the other cases, alarm for every two malware reports. When detecting malware is of
the F-measure decreased. Therefore, we note that the usefulness utmost importance, Naive Bayes can be a good option. When better
of combining static analysis and dynamic analysis is contextual. precision is more desirable, RF can be considered, which achieved
Since the performances of classifiers using dynamic analysis-based Pd = 0.831 and Pr = 0.841, averaging across all types of features. It
features are generally poor, those features may have polluted the is the most balanced classifier and thus, can be considered as the
classifiers learning. Hence, we note that for combining statically- best, at least in our experiments.
and dynamically-extracted information, data preprocessing, such as
feature selection to remove redundant or irrelevant features, should Static Dynamic Hybrid
1.0
be applied. ●
●
classifier
art of test generation for Android apps to improve coverage. We find 0.6
●
● DT
● KNN
that the limitation of static analysis can be mitigated by focusing ●
● LR
on the standard methods and classes. For using hybrid analysis, 0.4
● NB
recall
the recall and precision grid of the classifiers, averaged across all 0.6 technology
Deep learning
types of features. Instead of F-measures, we will discuss here the ●
Machine learning
performance of classifiers based on recall and precision. This is 0.4
0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75
because in some contexts, e.g., in highly security-critical systems, a precision
higher recall at the expense of some precision loss is more desirable Classifier type Pd (median) Pd (mean) Pd (sd)
whereas in some other contexts, a higher precision could be more Machine learning 0.86 0.80 0.15
desirable. Deep learning 0.85 0.77 0.14
As shown in Figure 6, overall, averaging across all types of Pr (median) Pr (mean) Pr (sd)
features, in terms of both recall and precision, the ML classifiers Machine learning 0.83 0.78 0.12
achieved better scores than the DL classifiers. The ML classifiers Deep learning 0.82 0.64 0.32
achieved the median and mean recall of 0.86 and 0.80, respectively,
and achieved the median and mean precision of 0.83 and 0.78, re- Figure 6: Recall and precision of classifiers
spectively. The DL classifiers achieved the median and mean recall
4.2.4 RQ4: Training Costs. As Android platform is constantly
of 0.85 and 0.77, respectively, and achieved the median and mean
evolving, a malware detector may often need to be re-trained to
precision of 0.82 and 0.64, respectively.
learn the new characteristics of Android. A slow training and anal-
One possible reason why the ML classifiers generally perform
ysis of Android apps could allow malware to remain undetected
better than the DL classifiers may be due to the same fine tuning
long enough and cause undesirable effects on end users. Hence it is
issue as discussed in Section 4.2.1. We used the readily-available ML
important to assess the training cost of classifiers on using different
classifiers from scikit-learn tool [32] with its built-in settings, which
types of features.
may have been optimal for malware detection. By contrast, since
In Figure 7, we plot the time taken by the classifiers when training
there are many different ways to build DL classifiers, Pytorch [31]
with different types of features. The figure illustrates the average
only provides abstract neural network classes on which custom DL
time taken for training the classifiers on each type of features. In
classifiers with specific configurations of DL layers are usually built.
our five fold cross validation setting, the training time is computed
As such, without fine tuning the parameters such as the number
as the time taken to train on the four folds of the dataset. We first
of layers, the sequence of different types of layers, the number of
compare the training time required for static, dynamic, and hybrid
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea Lwin Khin Shar, Biniam Fisseha Demissie, Mariano Ceccato, and Wei Minn
cases. We then compare the training time required for API usage could be the main focus of our future plan. In addition, it would
features versus API sequence features. also be interesting to investigate if applying data preprocessing
As shown in Figure 7, intuitively, since there are more number such as feature selection would result in better performance for
of features for hybrid case, classifiers took the longest to train with hybrid features.
hybrid features. Comparing the static and dynamic cases, classifiers Our dataset is imbalanced. Our dataset’s benign-to-malware ratio
using the static-based features took longer than those using the is 1.9 to 1, which may, in theory, affect precision and recall. However,
dynamic-based features. On average, the training cost of using while it is challenging to approximate the actual ratio of benign
static-based features (1141 seconds) is 1.78 times more than the apps versus malware apps in the wild, it is more likely that they are
training cost of using dynamic-based features (641 seconds). not balanced. Some study has chosen imbalanced dataset [5, 21]
We can also observe from Figure 7 that API use features are and some has chosen balanced dataset [23, 37]. Certain Android
much faster to train with than API sequence features. On average, markets have been known to have a benign-to-malware ratio of 1.5
the training cost of using API sequence features (1485 seconds) is to 1 [10]. Hence, our dataset could reflect the reality better. We plan
two times more than that of using API use features (719 seconds). to investigate the implication of different dataset ratios in future.
Hence, overall it can be concluded that while dynamic-based Our analysis does not consider native calls although FlowDroid,
features are less accurate, they are much faster to train with, com- the underlying static analysis tool we use, handles common native
pared to static-based features. API use features are both faster and calls using some heuristics. It is a challenging task to extract features
simpler (no fine tuning required to achieve good accuracy) to train that characterize native calls using static analysis. Dynamic runtime
with, compared to API sequence features. analysis approaches such as [15, 41] could be used. Our study also
did not consider API calls frequency. A recent study [30] found
that their proposed malware detection model is less accurate when
Static
Use of API calls
● trained on API calls frequency features instead of API sequence
features. We plan to include this evaluation in our future work.
Static ●
Sequence of API calls
Hybrid
5 CONCLUSION
Feature type
●
Use of API calls
In this work, we evaluated six different types of features and eleven
Hybrid
classifiers. The features characterize the use of API calls at class
●
Sequence of API calls
level and the sequence of API calls at method level. Both static
analysis and dynamic analysis are used. The classifiers include both
Dynamic
Use of API calls
●
conventional machine learning and deep learning models. To assess
the accuracy, recall, precision, and mainly F-measure were used. We
Dynamic
Sequence of API calls
●
also discussed the training costs. The experiments were conducted
on a common benchmark, containing 4572 benign samples and 2399
0 2500 5000 7500 10000
time malware samples.
Feature type median mean sd Our results show that compared to the features which character-
Static - Use of API calls 395.01 580.36 614.93 ize the sequence of API calls, the features which characterize the
Static - Sequence of API calls 532.42 1700.96 2624.58 use of API calls are faster and simpler to train with and produce
Dynamic - Use of API calls 100.45 243.83 379.08 classifiers with better accuracies in general. Static analysis-based
Dynamic - Sequence of API calls 369.50 1038.47 1514.59 features characterize more program behaviours compared to dy-
Hybrid - Use of API calls 440.87 1333.75 2095.63 namic analysis-based features. Hence, they produced classifiers
Hybrid - Sequence of API calls 716.40 1715.60 3018.26 with better accuracies but they came with training cost which is
1.78 times longer on average. Overall, the best F-measure (0.913)
Figure 7: Training time (in seconds) for different types of was achieved by a ML classifier, Random Forest classifier, which
features was trained with the static API sequence-based features. The sec-
ond best F-measure (0.901) was achieved by a DL classifier, a simple
4.3 Limitations Artificial Neural Network model, which was trained with the hybrid
The main limitation of this work is that our study excludes fine- API usage-based features. In our future work, we plan to investi-
tuning the parameters or data preprocessing except specifying gate into data preprocessing, feature selection, and parameter fine
API sequence features as categorical. Tuning the parameters on the tuning to produce optimal classifiers and evaluate their impacts.
eleven classifiers and the six types of features we used would require We also plan to evaluate frequency-based features.
huge amount of time and resources. Therefore, this study reports the
malware detection accuracy of baseline classifiers, without being
ACKNOWLEDGMENT
optimized. Hence, researchers are to consider the results regarding
the performances of classifiers as one data point, a starting point for The work of L. K. Shar is supported by the National Research Foun-
further exploration of optimized classifiers. This is the subject of our dation Singapore, under the National Satellite of Excellence in Mo-
future work. Especially, based on our current results, we observed bile System Security and Cloud Security (NRF2018NCR-NSOE004-
that this limitation hurts the deep learning classifiers more. This 0001).
Experimental Comparison of Features and Classifiers for Android Malware Detection
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea
REFERENCES [26] X. Liu and J. Liu. A two-layered permission-based android malware detection
[1] Y. Aafer, W. Du, and H. Yin. Droidapiminer: Mining api-level features for robust scheme. In 2014 2nd IEEE International Conference on Mobile Cloud Computing,
malware detection in android. In International conference on security and privacy Services, and Engineering, pages 142–148. IEEE, 2014.
in communication systems, pages 86–103. Springer, 2013. [27] N. McLaughlin, J. Martinez del Rincon, B. Kang, S. Yerima, P. Miller, S. Sezer,
[2] K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon. Androzoo: Collecting millions of Y. Safaei, E. Trickel, Z. Zhao, A. Doupé, et al. Deep android malware detection.
android apps for the research community. In Proceedings of the 13th International In Proceedings of the Seventh ACM on Conference on Data and Application Security
Conference on Mining Software Repositories, pages 468–471. ACM, 2016. and Privacy, pages 301–308. ACM, 2017.
[3] H. Alshahrani, H. Mansourt, S. Thorn, A. Alshehri, A. Alzahrani, and H. Fu. Dde- [28] F. A. Narudin, A. Feizollah, N. B. Anuar, and A. Gani. Evaluation of machine
fender: Android application threat detection using static and dynamic analysis. learning classifiers for mobile malware detection. Soft Computing, 20(1):343–357,
In 2018 IEEE International Conference on Consumer Electronics (ICCE), pages 1–6. 2016.
IEEE, 2018. [29] A. Naway and Y. Li. A review on the use of deep learning in android malware
[4] Android. UI/Application Exerciser Monkey. https://developer.android.com/ detection. arXiv preprint arXiv:1812.10360, 2018.
studio/test/monkey, 2019. [30] L. Onwuzurike, E. Mariconti, P. Andriotis, E. D. Cristofaro, G. Ross, and G. Stringh-
[5] D. Arp, M. Spreitzenbarth, H. Gascon, K. Rieck, and C. Siemens. Drebin: Effective ini. Mamadroid: Detecting android malware by building markov chains of be-
and explainable detection of android malware in your pocket. 2014. havioral models (extended version). ACM Transactions on Privacy and Security
[6] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein, Y. Le Traon, D. Octeau, (TOPS), 22(2):14, 2019.
and P. McDaniel. Flowdroid: Precise context, flow, field, object-sensitive and [31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
lifecycle-aware taint analysis for Android apps. In Proceedings of the 35th ACM N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance
SIGPLAN Conference on Programming Language Design and Implementation, PLDI deep learning library. In Advances in Neural Information Processing Systems,
’14, pages 259–269, New York, NY, USA, 2014. ACM. pages 8024–8035, 2019.
[7] I. Barandiaran. The random subspace method for constructing decision forests. [32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
IEEE Trans. Pattern Anal. Mach. Intell, 20(8):1–22, 1998. del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-
[8] L. Breiman. Classification and regression trees. Routledge, 2017. peau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning
[9] P. P. Chan and W.-K. Song. Static detection of android malware by using per- in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
missions and api calls. In 2014 International Conference on Machine Learning and [33] V. Rastogi, Y. Chen, and X. Jiang. Droidchameleon: evaluating android anti-
Cybernetics, volume 1, pages 82–87. IEEE, 2014. malware against transformation attacks. In Proceedings of the 8th ACM SIGSAC
[10] K. Chen, P. Wang, Y. Lee, X. Wang, N. Zhang, H. Huang, W. Zou, and P. Liu. symposium on Information, computer and communications security, pages 329–334,
Finding unknown malice in 10 seconds: Mass vetting for new threats at the 2013.
google-play scale. In 24th {USENIX } Security Symposium ( {USENIX } Security 15), [34] B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero, P. G. Bringas, and G. Álvarez.
pages 659–674, 2015. Puma: Permission usage to detect malware in android. In International Joint Con-
[11] S. Chen, M. Xue, Z. Tang, L. Xu, and H. Zhu. Stormdroid: A streaminglized ference CISISâĂŹ12-ICEUTE 12-SOCO 12 Special Sessions, pages 289–298. Springer,
machine learning-based system for detecting android malware. In Proceedings 2013.
of the 11th ACM on Asia Conference on Computer and Communications Security, [35] L. K. Shar. Experimental comparison of features and machine learning classifiers
pages 377–388, 2016. for android malware detection. https://github.com/sharlwinkhin/msoft20, 2020.
[12] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant. Semantics-aware [36] A. Sharma and S. K. Dash. Mining api calls and permissions for android malware
malware detection. In 2005 IEEE Symposium on Security and Privacy (S&P’05), detection. In International Conference on Cryptology and Network Security, pages
pages 32–46. IEEE, 2005. 191–205. Springer, 2014.
[13] N. Cristianini, J. Shawe-Taylor, et al. An introduction to support vector machines [37] F. Shen, J. Del Vecchio, A. Mohaisen, S. Y. Ko, and L. Ziarek. Android mal-
and other kernel-based learning methods. Cambridge university press, 2000. ware detection using complex-flows. IEEE Transactions on Mobile Computing,
[14] L. Deng, D. Yu, et al. Deep learning: methods and applications. Foundations and 18(6):1231–1245, 2018.
Trends® in Signal Processing, 7(3–4):197–387, 2014. [38] Soot. Soot - a java optimization framework, https://github.com/sable/soot. 2018.
[15] G. Dini, F. Martinelli, A. Saracino, and D. Sgandurra. Madam: a multi-level anom- [39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
aly detector for android malware. In International Conference on Mathematical Dropout: a simple way to prevent neural networks from overfitting. The journal
Methods, Models, and Architectures for Computer Network Security, pages 240–253. of machine learning research, 15(1):1929–1958, 2014.
Springer, 2012. [40] Symantec. Internet Security Threat Report. https://www.symantec.com/content/
[16] W. Enck, M. Ongtang, and P. McDaniel. On lightweight mobile phone applica- dam/symantec/docs/reports/istr-24-2019-en.pdf, 2019.
tion certification. In Proceedings of the 16th ACM conference on Computer and [41] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi. Malware detection
communications security, pages 235–245. ACM, 2009. with deep neural network using process behavior. In 2016 IEEE 40th Annual
[17] M. Eskandari and S. Hashemi. A graph mining approach for detecting unknown Computer Software and Applications Conference (COMPSAC), volume 2, pages
malwares. Journal of Visual Languages & Computing, 23(3):154–162, 2012. 577–582. IEEE, 2016.
[18] M. Fan, J. Liu, X. Luo, K. Chen, T. Chen, Z. Tian, X. Zhang, Q. Zheng, and T. Liu. [42] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi. Malware detection
Frequent subgraph based familial classification of android malware. In 2016 IEEE with deep neural network using process behavior. In 2016 IEEE 40th Annual
27th International Symposium on Software Reliability Engineering (ISSRE), pages Computer Software and Applications Conference (COMPSAC), volume 2, pages
24–35. IEEE, 2016. 577–582. IEEE, 2016.
[19] J. Garcia, M. Hammad, and S. Malek. Lightweight, obfuscation-resilient detection [43] R. Vinayakumar, K. Soman, P. Poornachandran, and S. Sachin Kumar. Detecting
and family identification of android malware. ACM Transactions on Software android malware using long short-term memory (lstm). Journal of Intelligent &
Engineering and Methodology (TOSEM), 26(3):11, 2018. Fuzzy Systems, 34(3):1277–1288, 2018.
[20] C.-Y. Huang, Y.-T. Tsai, and C.-H. Hsu. Performance evaluation on permission- [44] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal. Data Mining: Practical machine
based detection for android malware. In Advances in Intelligent Systems and learning tools and techniques. Morgan Kaufmann, 2016.
Applications-Volume 2, pages 111–120. Springer, 2013. [45] K. Xu, Y. Li, R. H. Deng, and K. Chen. Deeprefiner: Multi-layer android mal-
[21] M. Ikram, P. Beaume, and M. A. Kaafar. Dadidroid: An obfuscation resilient tool ware detection system applying deep neural networks. In 2018 IEEE European
for detecting android malware via weighted directed call graph modelling. arXiv Symposium on Security and Privacy (EuroS&P), pages 473–487. IEEE, 2018.
preprint arXiv:1905.09136, 2019. [46] W. Yang, M. Prasad, and T. Xie. Enmobile: Entity-based characterization and
[22] E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb. Android malware detection analysis of mobile malware. In 2018 IEEE/ACM 40th International Conference on
using deep learning on api method sequences. arXiv preprint arXiv:1712.08996, Software Engineering (ICSE), pages 384–394. IEEE, 2018.
2017. [47] S. Y. Yerima, S. Sezer, and I. Muttik. High accuracy android malware detection
[23] E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb. Maldozer: Automatic frame- using ensemble learning. IET Information Security, 9(6):313–320, 2015.
work for android malware detection using deep learning. Digital Investigation, [48] Z. Yuan, Y. Lu, Z. Wang, and Y. Xue. Droid-sec: deep learning in android malware
24:S48–S59, 2018. detection. In ACM SIGCOMM Computer Communication Review, volume 44, pages
[24] Y. Liao and V. R. Vemuri. Use of k-nearest neighbor classifier for intrusion 371–372. ACM, 2014.
detection. Computers & security, 21(5):439–448, 2002. [49] H. Zhang. The optimality of naive bayes. AA, 1(2):3, 2004.
[25] M. Lindorfer, M. Neugschwandtner, L. Weichselbaum, Y. Fratantonio, V. Van [50] M. Zhang, Y. Duan, H. Yin, and Z. Zhao. Semantics-aware android malware
Der Veen, and C. Platzer. Andrubis–1,000,000 apps later: A view on current classification using weighted contextual api dependency graphs. In Proceedings
android malware behaviors. In 2014 third international workshop on building of the 2014 ACM SIGSAC conference on computer and communications security,
analysis datasets and gathering experience returns for security (BADGERS), pages pages 1105–1116, 2014.
3–17. IEEE, 2014.