0% found this document useful (0 votes)
32 views12 pages

16.experimental Comparison of Features and Classifiers For Android Malware Detection

Uploaded by

ziklonn x
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views12 pages

16.experimental Comparison of Features and Classifiers For Android Malware Detection

Uploaded by

ziklonn x
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Singapore Management University

Institutional Knowledge at Singapore Management University

Research Collection School Of Information School of Information Systems


Systems

10-2020

Experimental comparison of features and classifiers for Android


malware detection
Lwin Khin SHAR
Singapore Management University, [email protected]

Biniam Fisseha DEMISSIE


Fondazione Bruno Kessler

Mariano CECCATO
University of Verona

Wei MINN
Singapore Management University, [email protected]

Follow this and additional works at: https://ink.library.smu.edu.sg/sis_research

Part of the Software Engineering Commons

Citation
SHAR, Lwin Khin; DEMISSIE, Biniam Fisseha; CECCATO, Mariano; and MINN, Wei. Experimental
comparison of features and classifiers for Android malware detection. (2020). MOBILESoft 2020:
Proceedings of the 7th IEEE/ACM International Conference on Mobile Software Engineering and Systems,
Seoul, South Korea, October 5-6. 50-60. Research Collection School Of Information Systems.
Available at: https://ink.library.smu.edu.sg/sis_research/5115

This Conference Proceeding Article is brought to you for free and open access by the School of Information
Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in
Research Collection School Of Information Systems by an authorized administrator of Institutional Knowledge at
Singapore Management University. For more information, please email [email protected].
Experimental Comparison of Features and Classifiers for
Android Malware Detection
Lwin Khin Shar Biniam Fisseha Demissie
Singapore Management University Fondazione Bruno Kessler
[email protected] [email protected]

Mariano Ceccato Wei Minn


University of Verona Singapore Management University
[email protected] [email protected]

ABSTRACT learning classifiers do perform better than conventional classifiers


Android platform has dominated the smart phone market for years when trained on dynamic analysis-based features.
now and, consequently, gained a lot of attention from attackers.
Malicious apps (malware) pose a serious threat to the security and KEYWORDS
privacy of Android smart phone users. Available approaches to Malware detection, machine learning, deep learning, Android
detect mobile malware based on machine learning rely on features
ACM Reference Format:
extracted with static analysis or dynamic analysis techniques. Dif- Lwin Khin Shar, Biniam Fisseha Demissie, Mariano Ceccato, and Wei Minn.
ferent types of machine learning classifiers (such as support vector 2020. Experimental Comparison of Features and Classifiers for Android
machine and random forest) deep learning classifiers (based on Malware Detection. In IEEE/ACM 7th International Conference on Mobile
deep neural networks) are then trained on extracted features, to Software Engineering and Systems (MOBILESoft ’20), October 5–6, 2020, Seoul,
produce models that can be used to detect mobile malware. The Republic of Korea. ACM, New York, NY, USA, 11 pages. https://doi.org/10.
usually-analyzed features include permissions requested/used, fre- 1145/3387905.3388596
quency of API calls, use of API calls, and sequence of API calls. The
API calls are analyzed at various granularity levels such as method, 1 INTRODUCTION
class, package, and family.
Android platform has dominated the smart phone market for years
In the view of the proposals of different types of classifiers and
now. With currently more than two billion devices running Android,
the use of different types of features and different underlying analy-
it is the most popular end-user operating system in the world.
ses used for feature extraction, there is a need for a comprehensive
Its market dominance and open source nature has also made it
evaluation on the effectiveness of the current state-of-the-art stud-
interesting for attackers. Symantec [40] reported that in 2018, it
ies in malware detection on a common benchmark. In this work,
detected an average of 10573 mobile malware per day; found that
we provide a baseline comparison of several conventional machine
one in 36 mobile devices had high risk apps installed; and one in
learning classifiers and deep learning classifiers, without fine tun-
14.5 apps accesses high risk user data. Hence, Android malware
ing. We also provide the evaluation of different types of features
detection is currently an active area of research.
that characterize the use of API calls at class level and the sequence
There have been a number of approaches proposed by the re-
of API calls at method level. Features have been extracted from
search community to detect Android malware. Many approaches
a common benchmark of 4572 benign samples and 2399 malware
have built malware detection models based on permissions re-
samples, using both static analysis and dynamic analysis.
quest/use [5, 9, 16, 20, 26, 34, 36].
Among other interesting findings, we observed that classifiers
However, since benign apps also often request permissions clas-
trained on the use of API calls generally perform better than those
sified as dangerous for legitimate reasons, permission-based ap-
trained on the sequence of API calls. Classifiers trained on static
proaches can be prone to false positives [16]. More recent ap-
analysis-based features perform better than those trained on dy-
proaches have built detection models based on sequence of API
namic analysis-based features. Deep learning classifiers, despite
calls [23, 30, 41], use of API calls [5, 9, 36, 47] or frequency of API
their sophistication, are not necessarily better than conventional
calls [1, 19].
classifiers, especially when they are not optimized. However, deep
The API calls can be extracted at various granularity levels such
as method, class, package, and family. Since there are millions of
unique methods in Android, some approaches [19, 21, 30] that
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed are based on the use or the frequency of API calls have proposed
for profit or commercial advantage and that copies bear this notice and the full citation to abstract API calls at class, package, and/or family levels. This
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
reduced the number of features significantly and yet produced
to post on servers or to redistribute to lists, requires prior specific permission and/or a comparable or even better results [19, 21, 30].
fee. Request permissions from [email protected]. To extract these features, in general two types of techniques are
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea
used — static analysis [5, 9, 19, 21, 30, 46] and dynamic analysis [15,
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7959-5/20/05. . . $15.00 41]. For instance, Drebin [5] extracts permissions and API calls by
https://doi.org/10.1145/3387905.3388596 scanning manifest files and disassembled code. DadiDroid [21] and
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea Lwin Khin Shar, Biniam Fisseha Demissie, Mariano Ceccato, and Wei Minn

MamaDroid [30] extract API calls from call graphs. The majority of • We compare the malware detection accuracy of using the
the approaches has relied on static analysis for feature extraction. static analysis-based features, the dynamic analysis-based
Those approaches that apply dynamic analysis [15, 41] have mainly features, and the combined set of static and dynamic analysis-
focused on features at native level API calls (system calls). Typically, based features. To the best of our knowledge, we have not
static analysis-based features cover more information since static observed a hybrid approach that utilizes both analyses to
analysis can reason with the whole program code whereas dynamic extract features on API calls at method and class levels;
analysis-based features are limited to the code that are executed. • We compare the cost in terms of training time requirement
On the other hand, static analysis may have issues dealing with of using different types of features.
complex code such as code obfuscation, and modern malware are
We make the dataset and the scripts used in our experiments
usually crafted with obfuscated code [19]. Hence, in general, static
available [35] so that researchers could replicate or extend our
analysis and dynamic analysis complement each other.
experiments.
Once these features are extracted using program analyses, these
The rest of the paper is organized as follows. Section 2 discusses
approaches typically use machine learning classifiers to train on the
related work. Section 3 thoroughly discuses the methodology and
features and build malware detection model. For instance, Support
an overview of malware detection; it explains the data collection
Vector Machines (SVM), K-Nearest Neighbours, and Random Forest
and features extraction processes, and the machine learning and
were used in [21, 30]; AdaBoost, Naive Bayes, Decision Tree, and
deep learning classifiers we use. Section 4 presents the evaluation
SVM were used in [20].
studies and discusses the experimental results. Section 5 provides
In parallel, other studies [23, 27, 41, 45] have focused on the use
the concluding remarks and proposals for future studies.
of deep learning classifiers, such as Convolutional Neural network
and Recurrent Neural Network, instead of conventional machine
learning classifiers, to build malware detectors. Deep learning clas- 2 RELATED WORK
sifiers use several layers to study various levels of representations Naway and Li [29] reviewed the use of deep learning in combination
and extract higher-level features from the given lower-level ones. with program analysis for Android malware detection. However,
Hence, in general they have built-in feature selection process and their contribution was a literature survey (focusing on the differ-
are better at learning complex patterns. ences between key concepts of different DL classifiers and different
In the view of the proposals of different types of classifiers and feature extraction techniques) rather than an empirical study like
the use of different types of features and different analyses used for ours. Experimental comparisons are available in literature, con-
feature extraction, there is a need for a comprehensive evaluation on trasting different types of features and classifiers to detect Android
the effectiveness of the current state-of-the-art in malware detection malware. However, these approaches usually compare their single
on a common benchmark. This study aims to evaluate the malware proposed method against other recent approaches. Conversely, our
detection accuracy of various classifiers without fine tuning, when study aims at comparing different types of features and classifiers
learnt on different types of features from a common benchmark, that have been used by the research community, on a common
using both static and dynamic program analyses. benchmark.
We use 4572 benign samples and 2399 malware samples. Benign Static analysis-based features. Several approaches rely on static
samples were randomly collected from Androzoo repository [2], analysis to extract features from the app such as requested permis-
which are released from year 2017 to 2019. 1208 malware samples sions [5, 9, 16, 20, 26, 34, 36], the sequence of API calls [11, 23, 27, 30,
are collected from Androzoo repository [2], which are from year 37], the use of API calls [5, 9, 21, 36, 47, 50], or the frequency of API
2017 and 2019, and 1191 malware samples are from Drebin repos- calls [1, 11, 18, 19]. Our study also investigates and compares the
itory [5]. We extract static features from call graph of Android performance of using API use features and API sequence features.
package (apk) codes and dynamic features by executing the app in However, our study is not limited to features extracted with static
an Android emulator using our in-house intent-fuzzer combined analysis, but also with dynamic analysis. Additionally, we also study
with Android’s Monkey testing framework [4]. the performance of combining the features obtained from the two
Specifically, we make the following contributions in this paper. analyses, and we evaluate these features across several classifiers.
Like our study, some approaches [21, 30] extract static features
• We evaluate several conventional machine learning clas- from call graphs. Other approaches rely, instead, on data depen-
sifiers and deep learning classifiers. More specifically, we dency graphs [37, 50] or control flow graph [12]. In future, we plan
assess seven machine learning classifiers, namely K-Nearest to evaluate the difference between using different kinds of graphs.
Neighbors (KNN), Support Vector Machines (SVM), Deci- Considering that analysis at method level led to millions of fea-
sion Tree (DT), Random Forest (RF), AdaBoost (AB), Naive tures, resulting in long training time and memory consumption,
Bayes (NB), and Logistic regression (LR). We assess four deep some approaches [21, 30, 46] abstracted features at class, package,
learning classifiers, namely Simple Artificial Neural Network family, or entity levels, to save memory and time. Our study adopts
(sANN), Complex Artificial Neural Network (cANN), Con- both views, by evaluating features either at method level and at the
volutional Neural Network (CNN), and Recurrent Neural class level.
Network with long short term memory (RNN); Dynamic analysis-based features. Dynamic analysis-based ap-
• We compare the malware detection accuracy of using the proaches such as [15, 41] have mainly focused on features at native
features that characterize the sequence of API calls and that level API calls (system calls). Narudin et al. [28] evaluated the per-
of using the features that characterize the use of API calls; formance of five ML classifiers on network features (API calls that
Experimental Comparison of Features and Classifiers for Android Malware Detection
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea

involve network communication) extracted with dynamic analysis. Benign and Malware samples

In contrast to these approaches, we consider all the standard APIs,


Stage 1
and we evaluate both ML and DL classifiers. Program Analysis

Hybrid analysis-based features. Few approaches [3, 25, 48] apply Dynamic Analysis
- Generate test inputs using
Static Analysis
both static analysis and dynamic analysis techniques. However, - Generate call graphs
Monkey test
- Generate test inputs using
these approaches focused on extracting specific features that are Intent-fuzzing

generally considered to be dangerous, such as sending SMS and Call graphs Execution traces

connecting to Internet. By contrast, we do not discriminate features, Stage 2


and we consider a more complete set of features, which are not Features Extraction

considered in those approaches. This implies that our test generator Static Features Extraction Dynamic Features Extraction
- Extract method sequences - Extract method sequence
has to be more comprehensive to cover more program behaviours. Labels from call graphs
- Extract class signatures
from execution traces
- Extract class signatures
Labels

from call graphs from execution traces


While most dynamic analysis approaches have largely used Monkey
(UI) test generator [29], our approach employs a combination of
Features Concatenation
Monkey test generator and intent fuzzing to also cover component
interactions.
Deep learning vs Machine learning. Recently, deep learning for static-seq. dynamic-seq. hybrid-seq. static-use dynamic-use hybrid-use
features features features features features features
Android malware detection has been endorsed [23, 27, 43, 45]. Droid-
sec [48] compared a deep belief network classifier against conven- Stage 3 Classifiers Training
tional ML classifiers such as NB, SVM, and LR. But their study Conventional
and Testing
Deep Learning (DL)
Machine Learning
Classifiers
excludes Random Forest, and their dataset was limited to only 250 (ML) Classifiers

malware and 250 benign samples. Their results showed that DL


Malware Detection Models
classifier is more accurate. On the other hand, MaMaDroid [30] and Evaluation Results

found that conventional ML classifiers like Random Forest and


Figure 1: The workflow of the experiments
K-Nearest Neighbours perform better.

3 METHODOLOGY
This section presents the overall workflow of the experiments. Fig- calls, i.e., using some heuristics, it can track data flow across some
ure 1 illustrates the workflow, which consists of three stages. The commonly used native calls.
first stage is program analysis, which extracts call graphs and execu- Dynamic analysis is performed in two phases: the first phase
tion traces from benign and malware samples. The second stage is analyzes call graph of the app to extract paths from public entry-
feature extraction in which six datasets are extracted, namely static points (i.e., inter-component communication interfaces) to the leaf
method sequence features, dynamic method sequence features, hy- nodes. Similar to the static analysis phase, we generate the call
brid method sequence features, static class features, dynamic class graph of the app using Soot with FlowDroid plugin for Android. The
features, and hybrid class features, from call graphs and execution call graph is then traversed forward in depth-first search manner
traces. And they are labeled. In the last stage, conventional machine starting from the root node until a leaf node is reached. The output
learning classifiers (denoted as ML classifiers) and deep learning of this step is paths from the roots (entry-points of each component)
classifiers (denoted as DL classifiers) are trained and tested on the to the different leaf nodes (method calls without outgoing edges).
labeled dataset and produce the evaluation results. The following Once the list of paths is available, the next step is to instantiate
subsections discuss each stage in detail. an inter-component communication message (intent) fuzzer to gen-
erate inputs that execute the paths. To this end, we first instrument
the app to collect method execution traces and install the app on
3.1 Program Analysis an Android emulator. We then run our intent fuzzer with statically
In this phase, we perform static and dynamic analysis on the given collected values (such as static strings) from the app as seed (initial
Android application packages (apks). values).
For static analysis, we use FlowDroid [6] with its default settings, The generated inputs are Intent messages that are sent to the
to extract call graphs from an apk. FlowDroid is based on Soot [38]. app under test via the Android Debug Bridge (ADB). Our goal is
Firstly, given an apk, Soot converts it into an intermediate represen- to maximize coverage and collect as many traces as possible. The
tation called Jimple and FlowDroid performs flow analysis on the traces are also used to guide the test generation.
Jimple code. The analysis is flow- and context-sensitive. FlowDroid While this step exercises code parts that involve inter-component
also has an optional feature for handling reflections. We opted to (inter-app) communications, it does not address user interactions
use this feature since Android malware increasingly makes use of such as UI inputs.
reflection to avoid detection. However, like other static analysis In order to complement the first phase, we instantiate the second
tools, FlowDroid also shares inherent limitation of static analysis. phase that uses Google’s Android Monkey tool [4]. Monkey comes
It can only resolve reflective calls when the arguments used in with the Android SDK and is used to randomly generate input
the call are all string constants. Dynamic analysis can overcome events such as tap, input text or toggle WIFI in an attempt to trigger
this limitation. In addition, FlowDroid also handles common native abnormal app behaviors.
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea Lwin Khin Shar, Biniam Fisseha Demissie, Mariano Ceccato, and Wei Minn

The combined test generator covers app behaviors in a more As a result, we obtain one dataset from call graphs that charac-
comprehensive way. While Monkey tool covers GUI-related fea- terizes the sequence of API calls at method level, denoted as static-
tures, our fuzzer focuses on exercising inter-component (inter-app) sequence features. Likewise, we obtain one dataset from execution
interactions. traces, denoted as dynamic-sequence features. We also concatenate
the two sets of features into one dataset, denoted as hybrid-sequence
3.2 Features Extraction features. In general, we will denote them as sequence features.
Figure 3 shows a sample dataset containing the sequence features.
From the call graphs and the execution traces generated in the
previous phase, we extract six datasets as explained in the following:
Extracting from the sequences of API calls. Three datasets are ex- android.webkit.WebSettings: void setPluginsEnabled(boolean)
tracted from the sequences of API calls at method level — one from android.webkit.WebView: void setVisibility(int)
call graphs, one from execution traces, and one from combining the android.os.Handler: void <init>()
former two. Given a call graph, we traverse the graph in a depth java.lang.Boolean: java.lang.Boolean valueOf(boolean)
android.webkit.WebView: void loadUrl(java.lang.String)
first search manner and extract methods as we traverse (hence, se-
quence). If there is a loop, the method is traversed only once. Note
Figure 2: An excerpt of a sequence of API calls from a sample
that we only extract the methods from Android framework classes,
Java classes, and standard Org classes (org.apache, org.xml, etc.).
This is because it is common for malware to be obfuscated to circum- seq0 seq1 ... seqn label
vent malware detectors. The obfuscation often involves renaming benign1 74921 567 ... 84111 0
of library and custom (user defined) methods and classes. Hence, a benign2 12901 4490 ... 3923 0
malware detector will not be resilient to obfuscation if it is trained mal1 23712 6812 ... 0 1
on library and custom methods and classes. A previous study has mal2 23 63011 ... 0 1
shown that a simple renaming obfuscation method can prevent
Figure 3: Example of the sequence of API calls features
popular anti-malware products from detecting the transformed
malware samples [33]. Hence, we skipped methods that are not
from the above-mentioned standard packages as we traverse the WebView Picture SQLiteDatabase label
call graph. Similarly, we extract methods from the execution traces. benign1 1 0 0 0
However, since execution traces are already sequences, depth first benign2 0 0 1 0
search is not necessary. An excerpt of an extracted sequence is mal1 1 1 0 1
shown in Figure 2. mal2 0 1 1 1
Next, we discretized the sequence of method calls we extract Figure 4: Example of the use of API calls features
above so that it can be processed by machine learning and deep
learning classifiers. More precisely, we replace each unique method
with an identifier, resulting in a sequence of numbers. We build Extracting from the uses of API calls. Three datasets are extracted
a dictionary that maps each method call to its identifier. During from the uses of API calls at class level — one from call graphs, one
the testing or deployment phase, we may encounter unknown API from execution traces, and one from combining the former two.
calls. To address this, (1) we consider a large dictionary that covers The rationale for choosing class level features instead of method
nearly 2.9 millions of unique methods from standard libraries and level features is to reduce the amount of features such as those
(2) we replace all unknown API calls with a fixed identifier. approaches in [19, 21, 30]. Method level features would result in
The length of the sequences varies from one app to another. millions of features and yet the classifiers may not achieve a better
Thus, it is necessary to unify the length of the sequences. Since accuracy since the feature vectors of the samples would be sparse.2 .
we have two types of method sequences — from call graphs and The extraction process is the same for both call graphs and execu-
from execution traces, we chose two different uniform sequence tion traces. We initially build a database that stores unique classes.
lengths. Initially, we extracted the whole sequences. We then took Again for obfuscation resiliency, we only consider the Android
the median length of sequences from call graphs as the uniform framework, Java, and standard Org classes as explained above. We
sequence size, denoted as Lcд , for call graph-based method sequence currently maintain 134,558 classes. Given call graphs or execution
features and took the median length of sequences from execution traces, we scan the files and extract the class signatures (sequence
traces as the uniform sequence, denoted as Lt r , for execution traces- does not matter in this case). Each unique class in our database
based method sequence1 . If the length of a given sequence is less corresponds to a feature. The value of a feature is 1 if the corre-
than L, we pad the sequence with zeros; if the length is longer than sponding class is found in the given call graph or execution trace;
L, we trim it to L, from the right. Hence, for each app, we end up otherwise, it is 0.
with a sequence of numbers which is a feature vector. Each number As a result, we obtain one dataset, from call graphs, which char-
in the sequence corresponds to the categorical value of a feature. acterizes the use of API calls at class level, denoted as static-use
The number of features is the uniform sequence length L. features. Likewise, we obtain one dataset from execution traces, de-
noted as dynamic-use features. We also concatenate the two sets of
2 we did a preliminary study on the use of method level and class level features on
a randomly selected sample set containing 50 benign and 50 malware samples and
1L L t r =20000
cд =85000, observed that the classifiers achieved similar results.
Experimental Comparison of Features and Classifiers for Android Malware Detection
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea

features into one dataset, denoted as hybrid-use features. In general, 3.3.2 Deep Learning (DL) Classifiers. Deep learning is a class
we will denote them as use features. of machine learning algorithms that uses multiple layers to pro-
Figure 4 shows a sample dataset containing the use features. gressively extract higher level features from the raw input features.
Deep learning classifiers typically comprise an input layer, one or
3.3 Classifiers more hidden layers, and an output layer. In our context, the input
In the last phase, classifiers are trained and tested on each dataset layer accepts vectors of features — use features or sequence features
extracted in phase 2. The following briefly describes the classifiers (Section 3.2). Each vector represents an app. The output layer is the
used in our evaluations. binary classification (benign/malware) of the given app.
This study uses the following deep learning classifiers:
3.3.1 Conventional Machine Learning (ML) Classifiers. We eval- Artificial Neural Network, ANN is a common deep learning method,
uate seven ML classifiers: which comprises of an input layer, one or more hidden, fully-
K-Nearest Neighbours, KNN is one of the simplest classification connected (linear) layers, and an output layer. We used two different
techniques, with less or no prior knowledge of data distribution. configurations of ANN — different number of hidden layers and
The predicted test sample class is set equal to the true class among different number of neurons in each layer — in our experiments.
the nearest training instances [24]. In our experiments, three neigh- The first ANN is a simple ANN, denoted as sANN, which consists
bours comprised the KNN setting to perform the classifier. of two linear layers, with each layer containing 256 neurons. The
Linear Support Vector Machines, SVM determines a hyperplane second ANN is a more complex ANN, denoted as cANN. It consists
that separates both classes with maximal margin, given vectors of three linear layers — with 512, 256, 128 neurons, respectively. At
of two classes as training data. One of these classes is associated the end of these layers, cANN also has a dropout layer with p=0.5
with malware, whereas the other class corresponds to benign in- to avoid overfitting [39].
stances. An unknown/new instance is classified by mapping it to the Convolutional Neural Network, CNN typically comprises three
vector space and determining whether it falls on the malicious or types of layers — convolutional layer, pooling layer, and linear layer
benign side of the hyperplane [13]. SVM is widely used in malware — between the input layer and the output layer. The convolutional
classification task as it produces explainable detection model. layer utilizes the convolution procedure to accomplish the weight
Decision Trees, DT builds a rule-based model that predicts the sharing. The pooling layer progressively reduce the dimension of
class of a target variable by learning decision rules inferred from the feature map and thus, reduce the amount of parameters and
the given set of features. The depth of the tree can be customized to computation. It can be applied by an average pooling procedure or
fit the model. The deeper the tree, the more complex the decision a max pooling procedure. Thereafter, one or more linear layers and
rules and the fitter the model. Deep trees may not generalize the the output layer, typically a SoftMax function, are placed on the
data well (overfitting problem) and thus, usually it is necessary to top layer for classification and recognition. In our experiments, we
limit the maximum depth of the tree. There are a few variants of built the CNN classifier with the following sequence of layers – the
decision tree such as ID3, C4.5, C5.0, and CART. We use CART [8]. input layer, a convolutional layer followed by a max pooling layer,
Random Forest, RF is an ensemble of classifiers using many deci- another convolutional layer, followed by a max pooling layer, and
sion tree models [7]. A different subset of training data is selected one linear layer, a dropout layer with p=0.5, and finally the output
with a replacement to train each tree. The remaining training data layer with Softmax function.
serves to estimate the error and variable importance. RF has been Recurrent Neural Network, RNN is suitable for handling sequen-
proved to be highly accurate classifier for malware detection [17]. tial data. It has memory units, which retain the information of
In our experiments, we used 10 classifiers to form an ensemble. previous inputs or the state of hidden layers and its output depends
AdaBoost, AB is also an ensemble of classifiers. It fits a sequence on previous inputs. It can also have a special layer called LSTM,
of weak models (i.e., models that are only slightly better than ran- which avoids the error vanishing problem by fixing weight of hid-
dom guessing, such as small decision trees) on repeatedly modified den layers to avoid error decay and retaining not all information
versions of the data. The predictions from all of them are then of input but only selected information which is required for future
combined through a weighted majority vote to produce the final outputs. RNN has shown good results in various fields which use
classification. sequential data such as language processing or speech recogni-
Naive Bayes, NB classifier applies Bayes’ theorem with the “naive” tion [14]. In our experiments, we built the RNN with LSTM units.
assumption of conditional independence between every pair of Our RNN classifier consists of the input layer, one LSTM layer, one
features given the value of the class variable [49]. This assumption linear layer, and the output layer with Softmax function. We also
allows NB to learn the model extremely fast. use dropout with p=0.5.
Logistic Regression, LR is a statistical model that uses a logistic We built the above DL classifiers by using Pytorch’s libraries [31].
function to model the probability of a certain class such as pass/fail, As activation function for linear layers and convolutional layers,
malware/benign, etc. we use rectified linear unit (ReLU) function. We use 30 epochs for
We used scikit-learn Python tool [32] to run the above classifiers. training the DL classifiers. The scripts are written in Python.
We used the tool’s default settings, such as K=3 for KNN, number Table 1 shows a summary of general comparison among the clas-
of estimators = 10 for RF, etc, without any fine tuning. Use features sifiers we used, based on the documentations from scikit-learn [32]
are fed into the classifiers as they are. Sequence features are fed into and Pytorch [31], and the references from [28, 29].
the classifiers as categorical features.
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea Lwin Khin Shar, Biniam Fisseha Demissie, Mariano Ceccato, and Wei Minn

Table 1: Pros and cons of the classifiers [28, 29, 31, 32]
Class Classifier Pros Cons
Naive Bayes very fast classifier; suitable for getting quick classifi- unable to learn complex relationships among features
Statistics
cation results
K-Nearest Neighbours typically more robust than other statistics-based clas- large memory and computation time for training
sifiers for small k value
Linear SVM efficient; easy to analyze output only directly applicable for binary classification problems;
large memory and computation time for training
Logistic Regression can learn relatively complex relationships among fea- unpredictable performance as the learning process may fail to
tures converge (failure of the likelihood maximization algorithm)
Random Forest randomization typically helps achieve good perfor- output is hard to analyze
Rules
mance
Decision Trees fast and scalable classifier; easy to analyze output less effective when learning features with continuous values
AdaBoost built-in feature selection capability, which reduces sensitive to noisy data and outliers
dimensionality and computation time
Simple/Complex Arti- parelellization of learning process and typically consume large memory and computation time for both training
Deep Learn
ficial Neural Network achieves good performance and classification, compared to typical ML models
Convolutional Neural fewer neuron connections needed compared to a stan- fine tuning is usually needed to discover a complete hierarchy
Network dard ANN, i.e., faster learning process; can be varied of features; it also needs a big dataset
to suit the need to a particular classifier problem
Recurrent Neural Net- modeling time dependencies; able to remember serial learning process suffers from vanishing gradient problem; fine
work events tuning to suit a given classifier problem is usually needed to
avoid this problem

4 EVALUATION As a result, after we take the intersection of the apps that can
This section presents the experimental comparison results of fea- be commonly analyzed by static and dynamic tools, we ended up
tures and classifiers for Android malware detection. Specifically, with 4572 benign samples and 2399 malware samples — 1208 from
we investigate the following research questions: Androzoo repository [2] and 1191 from Drebin repository [5]. Note
that several of the malware samples from Drebin are obfuscated
• RQ1: Which type of features — the use of API calls or the and the malware samples from Androzoo are recent.
sequence of API calls — achieves better malware detection In comparison, Table 2 shows the sizes of dataset used by Android
accuracy? malware detection approaches in related work. But note that these
• RQ2: Which type of features — statically extracted, dynami- studies only apply either static or dynamic analysis and evaluate a
cally extracted, or a combination of both — achieves better few classifiers. Whereas we evaluate 11 classifiers and 6 different
malware detection accuracy? types of features. Our dataset size is comparable to the sizes used
• RQ3: Do deep learning classifiers achieve better malware in some recent studies such as [37, 46].
detection accuracy than conventional machine learning clas-
sifiers?
Table 2: Sizes of dataset used in some of the malware detec-
• RQ4: What are the training costs for different types of fea-
tion approaches
tures?
Reference #Benign #Malware
4.1 Experiment Design Droid-sec [48] 250 250
DroidSift [50] 13500 2200
Dataset. Initially we had 20k benign samples collected from An-
Drebin [5] 123453 5560
drozoo repository [2], which are released from year 2017 to 2019.
Narudin et al. [28] 20 1000
We also had 7757 malware samples — 5500 samples from Drebin
Maldozer [22] 37627 33066
repository [5] and 2257 samples from Androzoo repository [2],
RevealDroid [19] 24679 30203
which are from year 2017 to 2019. However, as we evaluate the
Shen et al. [37] 3899 3899
use of both static- and dynamic analysis-based features, we had to
EnMobile [46] 1717 4897
filter those samples that can be analyzed by both static analysis and
MaMadroid [30] 8447 35493
dynamic analysis tools. When we use FlowDroid [6] tool to extract
DaDiDroid [21] 43262 20431
call graphs, some of the apps caused exceptions. And our intent-
fuzzing test generation tool also caused time-outs and crashes for
some of the apps during the dynamic analysis. Therefore, we were Metrics. To assess the accuracy of the classifiers, we use the
not able to extract features for those cases. Note that these are the standard metrics — Recall (probability of detection, Pd), Precision
limitations of the underlying program analysis tools; for future (Pr), and F-measure (F) — which are typically used for evaluating
work, we plan to investigate these issues and address them. Never- Malware detection accuracy [19, 30]. Recall is computed as Pd =
theless, the objective of this experiment is to compare features and tp/(tp + f n); Precision is computed as Pr = tp/(tp + f p); and
classifiers and not to assess the feature collection components. F-measure is computed as F = 2 ∗ (Pr ∗ Pd )/(Pr + Pd ).
Experimental Comparison of Features and Classifiers for Android Malware Detection
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea

Given that we have six datasets, our assessment includes six


Static
experiments. We use stratified cross validation, a standard statistical Use of API calls

analysis method [44], to evaluate the performances. Ten-fold cross


Static
validation is widely used [19, 21, 30, 46]. But given that we are Sequence of API calls

evaluating several classifiers and features, we instead used five-


fold cross validation, which was also used in [23, 42]. The data is Hybrid

Feature type
Use of API calls

randomly divided into five sets. A classifier is trained on four sets


and then tested on the remaining set. This process is repeated five Hybrid
Sequence of API calls
times; each time testing on a different set. The order of training
and test set is randomized. This test design overcomes the ordering Dynamic
Use of API calls
effects due to randomization. This is important to avoid a malignant
increase in performance by a certain ordering of training and test Dynamic
Sequence of API calls
data. Isolating a test set from the training set also conforms to
hold-out test design which is important to evaluate the classifier’s 0.00 0.25 0.50 0.75
F−measure
capability to predict new malware [44]. The mean and the standard
deviation of all five trials is computed to make an evaluation. Feature type median mean sd
The experiments were performed on a Linux machine with 40 Static - Use of API calls 0.88 0.86 0.06
cores Intel CPU E5-2640 2.40GHz and 330GB RAM. It took us about Static - Sequence of API calls 0.69 0.67 0.26
a month to extract call graphs and execution traces from all the 27k Dynamic - Use of API calls 0.82 0.77 0.09
plus samples. It took us about two weeks to extract the six datasets Dynamic - Sequence of API calls 0.60 0.60 0.12
from the final benchmark set which contains 6971 samples in total. Hybrid - Use of API calls 0.87 0.86 0.04
Hybrid - Sequence of API calls 0.67 0.61 0.28

4.2 Result Comparisons Figure 5: F-measures of different types of features


Table 3, Table 4 and Table 5 show the results of classifiers us- In general, the results can be considered good given that we use
ing static-sequence features, dynamic-sequence features, and hybrid- features that are not based on custom (user defined) methods and
sequence features, respectively. Table 6, Table 7 and Table 8 show the classes. Approaches that use such features could see a malignant
results of classifiers using static-use features, dynamic-use features, increase in performance because they may then learn on the dis-
and hybrid-use features, respectively. The columns ‘Pd’, ‘Pr’, ‘F’ rep- criminative features used by the malware. But on the other hand,
resent the mean recall, the mean precision, and the mean F-measure those approaches may later suffer from class and method renaming
results across cross validations. The columns ‘Pd (sd)’, ‘Pr (sd)’, ‘F obfuscation strategies [5].
(sd)’ represent the standard deviations across cross validations. Our observation from a few randomly selected malware samples
is that API sequence features contain more semantic information as
4.2.1 RQ1: Use of API calls vs Sequence of API calls. As shown in they capture a sequence of API calls involved in a malware activity,
Tables 3, 4, 5, 6, 7, and 8, the F-measures of classifiers using use fea- whereas API use features capture malware patterns in a simpler
tures are statistically better than those of classifiers using sequence manner — based on what APIs are used in a malware activity.
features (according to Wilcoxon signed-ranks test). The standard On the other hand, based on the results, we believe that sequence
deviations of F-measures for the classifiers using use features are features are generally harder to train with. Parameters of the classi-
generally quite low, with the maximum standard deviation of 0.209 fiers such as maximum depth parameter in Decision Tree or Random
(NB using dynamic API usage features), whereas the standard de- Forest, K parameter in KNN, number of neurons and memory units
viations of F-measures for the classifiers using sequence features in deep learning classifiers, etc. need to be fine tuned to improve the
are statistically higher (according to Wilcoxon signed-ranks test), performance. We randomly sampled 100 apps and fine tuned the
with the maximum standard deviation of 0.406 (cANN using static- parameters of some of the classifiers for training with API sequence
sequence features). features, and we observed that the accuracy did improve, especially
In Figure 5, we plot the F-measures achieved by classifiers by for CNN and RNN deep learning classifiers. Note that the task of
using static-sequence features, dynamic-sequence features, hybrid- optimizing several classifiers on different types of features took a
sequence features, static-use features, dynamic-use features, and hybrid- lot of iterations and a lot of resources — training time and com-
use features, respectively. The figure clearly demonstrates that fea- putation. Therefore, in this study, we used the default settings for
tures which characterize the use of API calls generally produce the ML classifiers and we did not fine tune the DL classifiers at all.
better results than features which characterize the sequence of API We leave the task of optimizing the classifiers and systematically
calls across all types of program analyses. evaluating the performance of optimal classifiers as future work.
But there are a few exceptions. The best classifier in terms of Overall, we conclude that API use features produce better results,
F-measure is Random Forest, F = 0.913 (Table 3), which is actually on average. That is, classifiers trained with use features would detect
trained with static-sequence features. AdaBoost also achieved better Android malware with good accuracy and they work out of the
F-measure scores when trained with static-sequence and hybrid- box. On the other hand, since API sequence features provide more
sequence features. On the other hand, the second best classifier is semantic information, if time and effort could be spent on fine
sANN, F = 0.901 (Table 8), which is trained with hybrid-use features. tuning the classifier, sequence features could be a better choice.
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea Lwin Khin Shar, Biniam Fisseha Demissie, Mariano Ceccato, and Wei Minn

Table 3: Results on using static features that characterize the Table 6: Results on using static features that characterize the
sequence of API calls at method level use of API calls at class level
Classifier Pd Pd (sd) Pr Pr (sd) F F (sd) Classifier Pd Pd (sd) Pr Pr (sd) F F (sd)
KNN 0.578 0.109 0.839 0.019 0.683 0.077 KNN 0.617 0.035 0.907 0.041 0.734 0.011
SVM 0.588 0.081 0.838 0.015 0.690 0.055 SVM 0.922 0.011 0.863 0.029 0.892 0.01
DT 0.885 0.047 0.852 0.039 0.868 0.025 DT 0.908 0.008 0.833 0.003 0.869 0.002
RF 0.920 0.019 0.905 0.027 0.913 0.010 RF 0.924 0.020 0.879 0.044 0.901 0.028
AB 0.903 0.024 0.869 0.027 0.885 0.010 AB 0.887 0.010 0.847 0.002 0.867 0.006
NB 0.989 0.010 0.441 0.027 0.610 0.026 NB 0.968 0.004 0.675 0.006 0.795 0.005
LR 0.843 0.031 0.888 0.036 0.865 0.021 LR 0.930 0.023 0.872 0.043 0.900 0.012
sANN 0.865 0.082 0.837 0.238 0.843 0.105 sANN 0.865 0.023 0.923 0.046 0.893 0.028
cANN 0.868 0.188 0.422 0.591 0.496 0.406 cANN 0.865 0.027 0.920 0.045 0.891 0.022
CNN 0.667 0.127 0.461 0.196 0.542 0.186 CNN 0.785 0.036 0.810 0.045 0.797 0.032
RNN 0.365 0.254 0.014 0.018 0.027 0.034 RNN 0.878 0.023 0.895 0.035 0.884 0.022

Table 4: Results on using dynamic features that characterize Table 7: Results on using dynamic features that characterize
the sequence of API calls at method level the use of API calls at class level
Classifier Pd Pd (sd) Pr Pr (sd) F F (sd) Classifier Pd Pd (sd) Pr Pr (sd) F F (sd)
KNN 0.651 0.123 0.782 0.106 0.707 0.058 KNN 0.627 0.253 0.82 0.115 0.706 0.202
SVM 0.633 0.198 0.785 0.079 0.696 0.130 SVM 0.839 0.099 0.828 0.095 0.832 0.078
DT 0.563 0.652 0.722 0.348 0.521 0.397 DT 0.660 0.376 0.678 0.270 0.637 0.195
RF 0.550 0.239 0.691 0.169 0.605 0.171 RF 0.807 0.059 0.830 0.139 0.817 0.084
AB 0.493 0.217 0.698 0.215 0.563 0.121 AB 0.723 0.247 0.789 0.096 0.750 0.178
NB 0.991 0.010 0.376 0.016 0.546 0.016 NB 0.857 0.340 0.613 0.146 0.710 0.209
LR 0.622 0.193 0.777 0.091 0.687 0.140 LR 0.848 0.093 0.831 0.117 0.838 0.088
sANN 0.653 0.024 0.646 0.039 0.649 0.025 sANN 0.853 0.046 0.887 0.046 0.869 0.016
cANN 0.699 0.136 0.393 0.094 0.500 0.085 cANN 0.842 0.031 0.895 0.027 0.868 0.022
CNN 0.700 0.031 0.828 0.083 0.758 0.035 CNN 0.638 0.043 0.618 0.062 0.627 0.037
RNN 0.560 0.147 0.225 0.167 0.317 0.179 RNN 0.852 0.016 0.869 0.062 0.860 0.027

Table 5: Results on using hybrid features that characterize Table 8: Results on using hybrid features that characterize
the sequence of API calls at method level the use of API calls at class level
Classifier Pd Pd (sd) Pr Pr (sd) F F (sd) Classifier Pd Pd (sd) Pr Pr (sd) F F (sd)
KNN 0.521 0.170 0.803 0.073 0.626 0.102 KNN 0.911 0.078 0.833 0.133 0.868 0.064
SVM 0.657 0.133 0.699 0.104 0.672 0.022 SVM 0.916 0.056 0.861 0.135 0.885 0.066
DT 0.862 0.086 0.807 0.076 0.832 0.043 DT 0.880 0.138 0.802 0.147 0.833 0.066
RF 0.877 0.081 0.892 0.089 0.883 0.037 RF 0.905 0.074 0.847 0.138 0.872 0.060
AB 0.887 0.088 0.843 0.039 0.863 0.024 AB 0.827 0.229 0.827 0.156 0.820 0.143
NB 0.984 0.024 0.449 0.055 0.616 0.051 NB 0.863 0.272 0.706 0.150 0.773 0.185
LR 0.807 0.148 0.861 0.112 0.828 0.034 LR 0.925 0.067 0.869 0.125 0.894 0.054
sANN 0.899 0.057 0.742 0.243 0.805 0.127 sANN 0.873 0.039 0.932 0.045 0.901 0.010
cANN 0.923 0.085 0.135 0.137 0.230 0.199 cANN 0.862 0.031 0.935 0.027 0.897 0.016
CNN 0.714 0.134 0.121 0.060 0.204 0.078 CNN 0.799 0.023 0.825 0.030 0.812 0.018
RNN 0.527 0.092 0.106 0.018 0.176 0.028 RNN 0.879 0.035 0.888 0.037 0.884 0.014

4.2.2 RQ2: Static vs Dynamic vs Hybrid. Referring to Figure 5, we


can also observe that the static analysis-based features significantly was only able to cover 19357 and 163292 distinct standard classes
outperform the dynamic analysis-based features. The mean and and methods, respectively. By contrast, our analysis on all the call
median of classifiers using static-use features are better than those graphs showed that static analysis covered 134558 and 2898245
of classifiers using dynamic-use features. The same goes for static- distinct classes and methods, respectively. Therefore, the static
sequence features versus dynamic-sequence features. The overall analysis-based features characterize more program behaviours and
standard deviation for dynamic-based features is slightly better were more informative for the classifiers than the dynamic analysis-
than that of static-analysis features. based features.
We looked at our data files and found that the sizes of execu- Note that to cover program behaviors as much as possible, we
tion traces are much smaller than the sizes of the call graphs. Our used an intent fuzzer that handles inter-component (inter-app) in-
analysis on all the execution traces showed that dynamic analysis teractions complemented with GUI test generator that handles user
Experimental Comparison of Features and Classifiers for Android Malware Detection
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea

interaction events and inputs. Even though static analysis is gener- neurons, dropout value, training epochs, etc., the DL classifiers may
ally weak against code obfuscation and our dataset contains several not perform well.
malware with obfuscated code, this did not have significant effect On the other hand, the DL classifiers did achieve better results
on static analysis because we only used features that represent stan- than the ML classifiers when trained with dynamic features, on both
dard (not user defined) classes and methods to mitigate renaming API sequence and API usage. As shown in Table 4, CNN achieved
obfuscation (see Section 3.2). Pd = 0.7 and Pr = 0.828, which are better than all other classifiers
Regarding the use of hybrid features, we found that it improved trained with dynamic-sequence features, except NB that has Pd =
the accuracy for only some of the classifiers and it had the negative 0.991 but with Pr = 0.376. Similarly, as shown in Table 7, sANN,
effect on other classifiers. For example, KNN’s F-measure when cANN, RNN achieved better results than the ML classifiers.
trained with static-use features is 0.734 and it improved to 0.868 One other interesting finding is that the simplest classifier, Naive
when static-use and dynamic-use features are combined. The same Bayes, achieved Pd = 0.942, averaging across all types of features.
goes for all four DL classifiers when trained with use features. But It is the highest recall among all the classifiers; that is, it detected
the result of all other classifiers decreased. With respect to sequence 2260 out of 2399 malware from our dataset. On the other hand, it
features, only NB’s F-measure was improved when static-sequence only achieved Pr = 0.543 on average; that is, it produced one false
and dynamic-sequence features are combined. For all the other cases, alarm for every two malware reports. When detecting malware is of
the F-measure decreased. Therefore, we note that the usefulness utmost importance, Naive Bayes can be a good option. When better
of combining static analysis and dynamic analysis is contextual. precision is more desirable, RF can be considered, which achieved
Since the performances of classifiers using dynamic analysis-based Pd = 0.831 and Pr = 0.841, averaging across all types of features. It
features are generally poor, those features may have polluted the is the most balanced classifier and thus, can be considered as the
classifiers learning. Hence, we note that for combining statically- best, at least in our experiments.
and dynamically-extracted information, data preprocessing, such as
feature selection to remove redundant or irrelevant features, should Static Dynamic Hybrid
1.0
be applied. ●

classifier

Sequence of API calls


● ●
Overall we conclude that static analysis is more desirable than 0.8
● AB
● cANN
dynamic analysis, unless we can further improve the state-of-the- ●
● ●

● CNN

art of test generation for Android apps to improve coverage. We find 0.6

● DT
● KNN
that the limitation of static analysis can be mitigated by focusing ●

● LR
on the standard methods and classes. For using hybrid analysis, 0.4
● NB
recall

data preprocessing should be considered. 1.0 ● RF


● RNN
●● ●●
●● ●
● ● sANN

Use of API calls


0.8 ●

● SVM
4.2.3 RQ3: ML Classifiers vs DL Classifiers. In Figure 6, we plot

the recall and precision grid of the classifiers, averaged across all 0.6 technology
Deep learning
types of features. Instead of F-measures, we will discuss here the ●

Machine learning
performance of classifiers based on recall and precision. This is 0.4

0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75
because in some contexts, e.g., in highly security-critical systems, a precision
higher recall at the expense of some precision loss is more desirable Classifier type Pd (median) Pd (mean) Pd (sd)
whereas in some other contexts, a higher precision could be more Machine learning 0.86 0.80 0.15
desirable. Deep learning 0.85 0.77 0.14
As shown in Figure 6, overall, averaging across all types of Pr (median) Pr (mean) Pr (sd)
features, in terms of both recall and precision, the ML classifiers Machine learning 0.83 0.78 0.12
achieved better scores than the DL classifiers. The ML classifiers Deep learning 0.82 0.64 0.32
achieved the median and mean recall of 0.86 and 0.80, respectively,
and achieved the median and mean precision of 0.83 and 0.78, re- Figure 6: Recall and precision of classifiers
spectively. The DL classifiers achieved the median and mean recall
4.2.4 RQ4: Training Costs. As Android platform is constantly
of 0.85 and 0.77, respectively, and achieved the median and mean
evolving, a malware detector may often need to be re-trained to
precision of 0.82 and 0.64, respectively.
learn the new characteristics of Android. A slow training and anal-
One possible reason why the ML classifiers generally perform
ysis of Android apps could allow malware to remain undetected
better than the DL classifiers may be due to the same fine tuning
long enough and cause undesirable effects on end users. Hence it is
issue as discussed in Section 4.2.1. We used the readily-available ML
important to assess the training cost of classifiers on using different
classifiers from scikit-learn tool [32] with its built-in settings, which
types of features.
may have been optimal for malware detection. By contrast, since
In Figure 7, we plot the time taken by the classifiers when training
there are many different ways to build DL classifiers, Pytorch [31]
with different types of features. The figure illustrates the average
only provides abstract neural network classes on which custom DL
time taken for training the classifiers on each type of features. In
classifiers with specific configurations of DL layers are usually built.
our five fold cross validation setting, the training time is computed
As such, without fine tuning the parameters such as the number
as the time taken to train on the four folds of the dataset. We first
of layers, the sequence of different types of layers, the number of
compare the training time required for static, dynamic, and hybrid
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea Lwin Khin Shar, Biniam Fisseha Demissie, Mariano Ceccato, and Wei Minn

cases. We then compare the training time required for API usage could be the main focus of our future plan. In addition, it would
features versus API sequence features. also be interesting to investigate if applying data preprocessing
As shown in Figure 7, intuitively, since there are more number such as feature selection would result in better performance for
of features for hybrid case, classifiers took the longest to train with hybrid features.
hybrid features. Comparing the static and dynamic cases, classifiers Our dataset is imbalanced. Our dataset’s benign-to-malware ratio
using the static-based features took longer than those using the is 1.9 to 1, which may, in theory, affect precision and recall. However,
dynamic-based features. On average, the training cost of using while it is challenging to approximate the actual ratio of benign
static-based features (1141 seconds) is 1.78 times more than the apps versus malware apps in the wild, it is more likely that they are
training cost of using dynamic-based features (641 seconds). not balanced. Some study has chosen imbalanced dataset [5, 21]
We can also observe from Figure 7 that API use features are and some has chosen balanced dataset [23, 37]. Certain Android
much faster to train with than API sequence features. On average, markets have been known to have a benign-to-malware ratio of 1.5
the training cost of using API sequence features (1485 seconds) is to 1 [10]. Hence, our dataset could reflect the reality better. We plan
two times more than that of using API use features (719 seconds). to investigate the implication of different dataset ratios in future.
Hence, overall it can be concluded that while dynamic-based Our analysis does not consider native calls although FlowDroid,
features are less accurate, they are much faster to train with, com- the underlying static analysis tool we use, handles common native
pared to static-based features. API use features are both faster and calls using some heuristics. It is a challenging task to extract features
simpler (no fine tuning required to achieve good accuracy) to train that characterize native calls using static analysis. Dynamic runtime
with, compared to API sequence features. analysis approaches such as [15, 41] could be used. Our study also
did not consider API calls frequency. A recent study [30] found
that their proposed malware detection model is less accurate when
Static
Use of API calls
● trained on API calls frequency features instead of API sequence
features. We plan to include this evaluation in our future work.
Static ●
Sequence of API calls

Hybrid
5 CONCLUSION
Feature type


Use of API calls
In this work, we evaluated six different types of features and eleven
Hybrid
classifiers. The features characterize the use of API calls at class

Sequence of API calls
level and the sequence of API calls at method level. Both static
analysis and dynamic analysis are used. The classifiers include both
Dynamic
Use of API calls

conventional machine learning and deep learning models. To assess
the accuracy, recall, precision, and mainly F-measure were used. We
Dynamic
Sequence of API calls

also discussed the training costs. The experiments were conducted
on a common benchmark, containing 4572 benign samples and 2399
0 2500 5000 7500 10000
time malware samples.
Feature type median mean sd Our results show that compared to the features which character-
Static - Use of API calls 395.01 580.36 614.93 ize the sequence of API calls, the features which characterize the
Static - Sequence of API calls 532.42 1700.96 2624.58 use of API calls are faster and simpler to train with and produce
Dynamic - Use of API calls 100.45 243.83 379.08 classifiers with better accuracies in general. Static analysis-based
Dynamic - Sequence of API calls 369.50 1038.47 1514.59 features characterize more program behaviours compared to dy-
Hybrid - Use of API calls 440.87 1333.75 2095.63 namic analysis-based features. Hence, they produced classifiers
Hybrid - Sequence of API calls 716.40 1715.60 3018.26 with better accuracies but they came with training cost which is
1.78 times longer on average. Overall, the best F-measure (0.913)
Figure 7: Training time (in seconds) for different types of was achieved by a ML classifier, Random Forest classifier, which
features was trained with the static API sequence-based features. The sec-
ond best F-measure (0.901) was achieved by a DL classifier, a simple
4.3 Limitations Artificial Neural Network model, which was trained with the hybrid
The main limitation of this work is that our study excludes fine- API usage-based features. In our future work, we plan to investi-
tuning the parameters or data preprocessing except specifying gate into data preprocessing, feature selection, and parameter fine
API sequence features as categorical. Tuning the parameters on the tuning to produce optimal classifiers and evaluate their impacts.
eleven classifiers and the six types of features we used would require We also plan to evaluate frequency-based features.
huge amount of time and resources. Therefore, this study reports the
malware detection accuracy of baseline classifiers, without being
ACKNOWLEDGMENT
optimized. Hence, researchers are to consider the results regarding
the performances of classifiers as one data point, a starting point for The work of L. K. Shar is supported by the National Research Foun-
further exploration of optimized classifiers. This is the subject of our dation Singapore, under the National Satellite of Excellence in Mo-
future work. Especially, based on our current results, we observed bile System Security and Cloud Security (NRF2018NCR-NSOE004-
that this limitation hurts the deep learning classifiers more. This 0001).
Experimental Comparison of Features and Classifiers for Android Malware Detection
MOBILESoft ’20, October 5–6, 2020, Seoul, Republic of Korea

REFERENCES [26] X. Liu and J. Liu. A two-layered permission-based android malware detection
[1] Y. Aafer, W. Du, and H. Yin. Droidapiminer: Mining api-level features for robust scheme. In 2014 2nd IEEE International Conference on Mobile Cloud Computing,
malware detection in android. In International conference on security and privacy Services, and Engineering, pages 142–148. IEEE, 2014.
in communication systems, pages 86–103. Springer, 2013. [27] N. McLaughlin, J. Martinez del Rincon, B. Kang, S. Yerima, P. Miller, S. Sezer,
[2] K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon. Androzoo: Collecting millions of Y. Safaei, E. Trickel, Z. Zhao, A. Doupé, et al. Deep android malware detection.
android apps for the research community. In Proceedings of the 13th International In Proceedings of the Seventh ACM on Conference on Data and Application Security
Conference on Mining Software Repositories, pages 468–471. ACM, 2016. and Privacy, pages 301–308. ACM, 2017.
[3] H. Alshahrani, H. Mansourt, S. Thorn, A. Alshehri, A. Alzahrani, and H. Fu. Dde- [28] F. A. Narudin, A. Feizollah, N. B. Anuar, and A. Gani. Evaluation of machine
fender: Android application threat detection using static and dynamic analysis. learning classifiers for mobile malware detection. Soft Computing, 20(1):343–357,
In 2018 IEEE International Conference on Consumer Electronics (ICCE), pages 1–6. 2016.
IEEE, 2018. [29] A. Naway and Y. Li. A review on the use of deep learning in android malware
[4] Android. UI/Application Exerciser Monkey. https://developer.android.com/ detection. arXiv preprint arXiv:1812.10360, 2018.
studio/test/monkey, 2019. [30] L. Onwuzurike, E. Mariconti, P. Andriotis, E. D. Cristofaro, G. Ross, and G. Stringh-
[5] D. Arp, M. Spreitzenbarth, H. Gascon, K. Rieck, and C. Siemens. Drebin: Effective ini. Mamadroid: Detecting android malware by building markov chains of be-
and explainable detection of android malware in your pocket. 2014. havioral models (extended version). ACM Transactions on Privacy and Security
[6] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein, Y. Le Traon, D. Octeau, (TOPS), 22(2):14, 2019.
and P. McDaniel. Flowdroid: Precise context, flow, field, object-sensitive and [31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
lifecycle-aware taint analysis for Android apps. In Proceedings of the 35th ACM N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance
SIGPLAN Conference on Programming Language Design and Implementation, PLDI deep learning library. In Advances in Neural Information Processing Systems,
’14, pages 259–269, New York, NY, USA, 2014. ACM. pages 8024–8035, 2019.
[7] I. Barandiaran. The random subspace method for constructing decision forests. [32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
IEEE Trans. Pattern Anal. Mach. Intell, 20(8):1–22, 1998. del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-
[8] L. Breiman. Classification and regression trees. Routledge, 2017. peau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning
[9] P. P. Chan and W.-K. Song. Static detection of android malware by using per- in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
missions and api calls. In 2014 International Conference on Machine Learning and [33] V. Rastogi, Y. Chen, and X. Jiang. Droidchameleon: evaluating android anti-
Cybernetics, volume 1, pages 82–87. IEEE, 2014. malware against transformation attacks. In Proceedings of the 8th ACM SIGSAC
[10] K. Chen, P. Wang, Y. Lee, X. Wang, N. Zhang, H. Huang, W. Zou, and P. Liu. symposium on Information, computer and communications security, pages 329–334,
Finding unknown malice in 10 seconds: Mass vetting for new threats at the 2013.
google-play scale. In 24th {USENIX } Security Symposium ( {USENIX } Security 15), [34] B. Sanz, I. Santos, C. Laorden, X. Ugarte-Pedrero, P. G. Bringas, and G. Álvarez.
pages 659–674, 2015. Puma: Permission usage to detect malware in android. In International Joint Con-
[11] S. Chen, M. Xue, Z. Tang, L. Xu, and H. Zhu. Stormdroid: A streaminglized ference CISISâĂŹ12-ICEUTE 12-SOCO 12 Special Sessions, pages 289–298. Springer,
machine learning-based system for detecting android malware. In Proceedings 2013.
of the 11th ACM on Asia Conference on Computer and Communications Security, [35] L. K. Shar. Experimental comparison of features and machine learning classifiers
pages 377–388, 2016. for android malware detection. https://github.com/sharlwinkhin/msoft20, 2020.
[12] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant. Semantics-aware [36] A. Sharma and S. K. Dash. Mining api calls and permissions for android malware
malware detection. In 2005 IEEE Symposium on Security and Privacy (S&P’05), detection. In International Conference on Cryptology and Network Security, pages
pages 32–46. IEEE, 2005. 191–205. Springer, 2014.
[13] N. Cristianini, J. Shawe-Taylor, et al. An introduction to support vector machines [37] F. Shen, J. Del Vecchio, A. Mohaisen, S. Y. Ko, and L. Ziarek. Android mal-
and other kernel-based learning methods. Cambridge university press, 2000. ware detection using complex-flows. IEEE Transactions on Mobile Computing,
[14] L. Deng, D. Yu, et al. Deep learning: methods and applications. Foundations and 18(6):1231–1245, 2018.
Trends® in Signal Processing, 7(3–4):197–387, 2014. [38] Soot. Soot - a java optimization framework, https://github.com/sable/soot. 2018.
[15] G. Dini, F. Martinelli, A. Saracino, and D. Sgandurra. Madam: a multi-level anom- [39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
aly detector for android malware. In International Conference on Mathematical Dropout: a simple way to prevent neural networks from overfitting. The journal
Methods, Models, and Architectures for Computer Network Security, pages 240–253. of machine learning research, 15(1):1929–1958, 2014.
Springer, 2012. [40] Symantec. Internet Security Threat Report. https://www.symantec.com/content/
[16] W. Enck, M. Ongtang, and P. McDaniel. On lightweight mobile phone applica- dam/symantec/docs/reports/istr-24-2019-en.pdf, 2019.
tion certification. In Proceedings of the 16th ACM conference on Computer and [41] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi. Malware detection
communications security, pages 235–245. ACM, 2009. with deep neural network using process behavior. In 2016 IEEE 40th Annual
[17] M. Eskandari and S. Hashemi. A graph mining approach for detecting unknown Computer Software and Applications Conference (COMPSAC), volume 2, pages
malwares. Journal of Visual Languages & Computing, 23(3):154–162, 2012. 577–582. IEEE, 2016.
[18] M. Fan, J. Liu, X. Luo, K. Chen, T. Chen, Z. Tian, X. Zhang, Q. Zheng, and T. Liu. [42] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi. Malware detection
Frequent subgraph based familial classification of android malware. In 2016 IEEE with deep neural network using process behavior. In 2016 IEEE 40th Annual
27th International Symposium on Software Reliability Engineering (ISSRE), pages Computer Software and Applications Conference (COMPSAC), volume 2, pages
24–35. IEEE, 2016. 577–582. IEEE, 2016.
[19] J. Garcia, M. Hammad, and S. Malek. Lightweight, obfuscation-resilient detection [43] R. Vinayakumar, K. Soman, P. Poornachandran, and S. Sachin Kumar. Detecting
and family identification of android malware. ACM Transactions on Software android malware using long short-term memory (lstm). Journal of Intelligent &
Engineering and Methodology (TOSEM), 26(3):11, 2018. Fuzzy Systems, 34(3):1277–1288, 2018.
[20] C.-Y. Huang, Y.-T. Tsai, and C.-H. Hsu. Performance evaluation on permission- [44] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal. Data Mining: Practical machine
based detection for android malware. In Advances in Intelligent Systems and learning tools and techniques. Morgan Kaufmann, 2016.
Applications-Volume 2, pages 111–120. Springer, 2013. [45] K. Xu, Y. Li, R. H. Deng, and K. Chen. Deeprefiner: Multi-layer android mal-
[21] M. Ikram, P. Beaume, and M. A. Kaafar. Dadidroid: An obfuscation resilient tool ware detection system applying deep neural networks. In 2018 IEEE European
for detecting android malware via weighted directed call graph modelling. arXiv Symposium on Security and Privacy (EuroS&P), pages 473–487. IEEE, 2018.
preprint arXiv:1905.09136, 2019. [46] W. Yang, M. Prasad, and T. Xie. Enmobile: Entity-based characterization and
[22] E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb. Android malware detection analysis of mobile malware. In 2018 IEEE/ACM 40th International Conference on
using deep learning on api method sequences. arXiv preprint arXiv:1712.08996, Software Engineering (ICSE), pages 384–394. IEEE, 2018.
2017. [47] S. Y. Yerima, S. Sezer, and I. Muttik. High accuracy android malware detection
[23] E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb. Maldozer: Automatic frame- using ensemble learning. IET Information Security, 9(6):313–320, 2015.
work for android malware detection using deep learning. Digital Investigation, [48] Z. Yuan, Y. Lu, Z. Wang, and Y. Xue. Droid-sec: deep learning in android malware
24:S48–S59, 2018. detection. In ACM SIGCOMM Computer Communication Review, volume 44, pages
[24] Y. Liao and V. R. Vemuri. Use of k-nearest neighbor classifier for intrusion 371–372. ACM, 2014.
detection. Computers & security, 21(5):439–448, 2002. [49] H. Zhang. The optimality of naive bayes. AA, 1(2):3, 2004.
[25] M. Lindorfer, M. Neugschwandtner, L. Weichselbaum, Y. Fratantonio, V. Van [50] M. Zhang, Y. Duan, H. Yin, and Z. Zhao. Semantics-aware android malware
Der Veen, and C. Platzer. Andrubis–1,000,000 apps later: A view on current classification using weighted contextual api dependency graphs. In Proceedings
android malware behaviors. In 2014 third international workshop on building of the 2014 ACM SIGSAC conference on computer and communications security,
analysis datasets and gathering experience returns for security (BADGERS), pages pages 1105–1116, 2014.
3–17. IEEE, 2014.

You might also like