0% found this document useful (0 votes)
2 views

A Performance-Sensitive Malware Detection System Using Deep Learning on Mobile Devices

The document presents MobiTive, a performance-sensitive Android malware detection system that utilizes deep learning on mobile devices to provide real-time protection against malware threats. It addresses the limitations of server-side detection by implementing a pre-installed solution that leverages customized deep neural networks and a novel feature extraction method from binary code. MobiTive demonstrates high accuracy (96.78%) and efficiency (less than 3 seconds for detection) across various mobile devices, making it a practical solution for enhancing mobile security.

Uploaded by

wsy19981106
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

A Performance-Sensitive Malware Detection System Using Deep Learning on Mobile Devices

The document presents MobiTive, a performance-sensitive Android malware detection system that utilizes deep learning on mobile devices to provide real-time protection against malware threats. It addresses the limitations of server-side detection by implementing a pre-installed solution that leverages customized deep neural networks and a novel feature extraction method from binary code. MobiTive demonstrates high accuracy (96.78%) and efficiency (less than 3 seconds for detection) across various mobile devices, making it a practical solution for enhancing mobile security.

Uploaded by

wsy19981106
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL.

16, 2021 1563

A Performance-Sensitive Malware Detection System


Using Deep Learning on Mobile Devices
Ruitao Feng , Sen Chen , Member, IEEE, Xiaofei Xie, Guozhu Meng, Shang-Wei Lin,
and Yang Liu , Senior Member, IEEE

Abstract— Currently, Android malware detection is mostly time detection performance and accuracy on different mobile
performed on server side against the increasing number of devices; (5) the potential based on the evolution trend of mobile
malware. Powerful computing resource provides more exhaustive devices’ specifications; and finally we further propose a practical
protection for app markets than maintaining detection by a single solution (MobiTive) to detect Android malware on mobile devices.
user. However, apart from the applications (apps) provided by
the official market (i.e., Google Play Store), apps from unofficial Index Terms— Android malware, malware detection, deep
markets and third-party resources are always causing serious neural network, mobile platform, performance.
security threats to end-users. Meanwhile, it is a time-consuming
task if the app is downloaded first and then uploaded to the I. I NTRODUCTION
server side for detection, because the network transmission has
a lot of overhead. In addition, the uploading process also suffers
from the security threats of attackers. Consequently, a last line
of defense on mobile devices is necessary and much-needed.
W ITH the currently increasing number of Android
devices and applications (apps), plenty of Android
users are benefited from that. The security and privacy con-
In this paper, we propose an effective Android malware detection cerns are also increasingly becoming the focus point to various
system, MobiTive, leveraging customized deep neural networks mobile users and stakeholders. For example, more and more
to provide a real-time and responsive detection environment on users store their personal data in mobile devices [1], [2]
mobile devices. MobiTive is a pre-installed solution rather than through various popular apps such as shopping, banking, and
an app scanning and monitoring engine using after installation,
which is more practical and secure. Although a deep learning-
social apps. Consequently, since the last decade, attackers shift
based approach can be maintained on server side efficiently their attention to mobile apps. That makes Android malware
for malware detection, original deep learning models cannot undoubtedly become one of the most important security threats
be directly deployed and executed on mobile devices due to in this security field [3], [4].
various performance limitations, such as computation power, Therefore, how to detect Android malware becomes a
memory size, and energy. Therefore, we evaluate and investi- severe problem. End-users always expect a secure environment
gate the following key points: (1) the performance of different which is maintained by the app markets. In other words, they
feature extraction methods based on source code or binary
consider their app sources are all trustable and secure enough.
code; (2) the performance of different feature type selections for
deep learning on mobile devices; (3) the detection accuracy of It is not surprising that the demands of Android malware
different deep neural networks on mobile devices; (4) the real- detection approaches have been proposed, such as signature-
based approaches [5], [6], behavior-based approaches [7], [8],
data-flow analysis-based approaches [9], [10]. We note that
Manuscript received April 5, 2020; revised August 4, 2020; accepted
August 31, 2020. Date of publication September 23, 2020; date of current machine learning-based approach [11]–[18] is one of the
version December 11, 2020. This work was supported in part by the Singapore most promising techniques in detecting Android malware.
Ministry of Education Academic Research Fund Tier 1 under Award 2018-T1- With the available big data and hardware evolution over the
002-069, in part by the National Research Foundation, Prime Ministers Office,
Singapore through its National Cybersecurity Research and Development
past decade, deep learning has achieved tremendous success
Program under Award RF2018 NCR-NCR005-0001, in part by the Singapore in many cutting-edge domains, including Android malware
National Research Foundation through NCR under Award NSOE003-0001, detection. Actually, all of the above protecting solutions are
in part by the NRF Investigatorship under Grant NRFI06-2020-0022, in part by mostly on server side for app markets. However, when a new
the National Research Foundation, Prime Ministers Office, Singapore through
NCR under Award NRF2018NCR-NSOE004-0001, in part by the National
Android malware family is reported, not all the app markets
Natural Science Foundation of China under Grant 61902395, and in part by are able to respond in a responsive time. The current analysis
the NVIDIA AI Tech Center (NVAITC). The associate editor coordinating the workflow always follows analyzing malicious behaviors within
review of this manuscript and approving it for publication was Prof. Debdeep apps, building the detection models with the generated features
Mukhopadhyay. (Corresponding author: Sen Chen.)
Ruitao Feng, Xiaofei Xie, Shang-Wei Lin, and Yang Liu are with the and then performing the detection on the entire apps. Since
School of Computer Science and Engineering, Nanyang Technological Uni- the number of the real-world Android apps is extremely
versity, Singapore 639798 (e-mail: [email protected]; [email protected]; large, e.g., there are more than 3 million Android apps on
[email protected]; [email protected]). Google Play Store, it is a time-consuming task to perform the
Sen Chen is with the College of Intelligence and Computing, Tianjin
University, Tianjin 300350, China, and also with the School of Computer Sci- complete detection with that large number of apps. Moreover,
ence and Engineering, Nanyang Technological University, Singapore 639798 the apps from unofficial markets and third-party resources like
(e-mail: [email protected]). XDA [19] are more vulnerable in the wild. The security of
Guozhu Meng is with the Institute of Information Engineering, Chinese
Academy of Sciences, Beijing 100864, China (e-mail: mengguozhu@
these kinds of apps is indeed unpredictable and uncontrollable.
iie.ac.cn). The traditional server-side based malware detection surely
Digital Object Identifier 10.1109/TIFS.2020.3025436 has unignorable drawbacks when detecting such apps, because
1556-6013 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
1564 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

(1) it is a time-consuming task to upload the apps to server such as dynamic behavior analysis to demonstrate the
before the installation, especially for large apps; (2) the effectiveness of our approach. We also investigate the
uploading process via the Internet is not secure. For example, development trend of Android mobile phones to further
attackers may modify the malware during the uploading period understand the system usability.
such that an incorrect “benign” result is returned. As a result, According to the evaluation metrics of accuracy and time
the users will install the malware. Hence, a last line of defense cost from different features and neural networks, we propose
on mobile devices is necessary and much-needed. To address an effective and efficient Android malware detection system
the severe problem, we intend to conduct Android malware on mobile devices, named MobiTive. MobiTive leverages (1) a
detection on mobile devices instead of server side. newly-proposed feature extraction method from binary code;
Actually, machine learning-based approaches have achieved (2) a performance-based feature type selection mechanism;
better performance compared with other approaches in (3) a novel feature updating method through malicious behav-
Android malware detection [11], [13], [17], [20], [21]. In this ior mining and understanding; (4) a customized deep neural
paper, we intend to deploy the trained deep learning (DL) network for classification. So that, MobiTive can provide a
models from server-side to mobile devices. While a compu- real-time and fast responsive environment on mobile devices.
tationally intensive deep learning software could be executed In our comprehensive experiments, (1) we first divide the
efficiently on server-side with the GPU support, such deep feature preparation procedure into two steps, which are raw
learning models usually cannot be directly deployed and data extraction and feature extraction, and evaluate the perfor-
executed on other platforms supported by small mobile devices mance (time cost) separately to decide the feature selection.
due to various computation resource limitations such as the (2) With the selected features, we then provide an accu-
computation power, memory size, and energy. In our previous racy comparison between different feature categories. (3) The
work [23], we leverage TensorFlow Lite [24] to migrate the behavior-based feature updating method performs around
deep learning models. We proposed a convolutional neural 1%∼5% accuracy increase. (4) We provide a comprehensive
network (CNN)-based Android malware detection system on comparison between seven different neural networks (e.g.,
mobile platform, which leveraged three kinds of features from CNN, LSTM, and GRU) to show the potential improvement
decompiled Android apps according to the performance-based of our customized DL models on network definition. (5) We
feature selection mechanism. We have substantially extended further evaluate the performance and accuracy of MobiTive on
our previous work from the following aspects: different real mobile devices by using our customized RNN
• In the conference version [23], we only focused on the model and compare with dynamic device-end solutions. (6)
performance of different feature types extracted from In the last part of our experiments, we perform an analy-
decompiled files such as smali files. To reach the best sis of the performance trend on mobile devices from three
performance on mobile devices, we take the installation different aspects and integrate the results to provide a strong
mechanism in the Android operation system into account. evidence on the potential of MobiTive in practice. Specifically,
Specifically, we analyze and extract two types of features MobiTive achieves a relatively higher classification accuracy
(i.e., manifest properties and API calls) from Dalvik (i.e., 96.78% accuracy) on real testing data in the wild and
binary files directly instead of the decompiled files. mobile devices with relatively lower overhead (i.e., less than
• Meanwhile, to enrich the malicious behavior coverage of 3 seconds on average for one app).
our selected features, we perform an empirical analysis In summary, we make the following main contributions.
to understand the existing malicious behaviors, most of • We propose MobiTive, a device-end solution to protect
which are collected from industrial malware analysis mobile devices from malware threats in real-time effi-
reports (e.g., Symantec Threats [25]). According to the ciently by leveraging customized deep neural networks
understanding, we further update the feature inputs with and binary features. This research work aims to detect
the matching results between text-based behavior descrip- malware directly on mobile devices as a pre-installed and
tions and code level features (details on our website [26]). run-time solution rather than detecting them on common
• To figure out the potential detection accuracy promotion servers or monitoring them after installation.
of different deep neural networks, we not only apply • We propose a new feature extraction method from binary
our new extracted features with CNN models, but also code, as well as a feature updating method based on
present six more kinds of recurrent neural networks mod- the understanding of malicious behaviors. Due to the
els (e.g., LSTM and GRU). Finally, we customized one high performance demand of mobile devices, we evaluate
RNN model to adopt the device-based detection scenario. the different performance (time cost) and accuracy with
Moreover, we further compare with four other existing various feature types and neural networks, and further
Android malware detection approaches to demonstrate the provide a comparison against four existing Android mal-
effectiveness and efficiency of our approach. ware detection approaches. Besides, we also investigate
• To investigate the effectiveness of our system on multi- the accuracy on multi-class classification task.
class classification task, we demonstrate the result on • We investigate the different performance on multiple
classifying 701,300 Android malware into 21 families devices from different manufacturers, and further pro-
with our system. vide insights of the current quality and potential for
• To peek into the average usability and best practice our approach according to the feature extraction and
for our new system, we evaluate our system on six prediction time cost on six real mobile devices. Mean-
real mobile devices from different manufacturers such as while, an additional comparison on run-time efficiency
Google, Huawei, and Samsung, which released between and discussion on effectiveness is provided to show the
2015 and 2019. Meanwhile, we conduct a run-time advantages against dynamic malware detection system
performance evaluation with other device-end solution based on behavior analysis.

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
FENG et al.: PERFORMANCE-SENSITIVE MALWARE DETECTION SYSTEM USING DL ON MOBILE DEVICES 1565

II. P RELIMINARIES C. Deep Learning Model Migration and Quantization


In this section, we briefly introduce the structure of Android After a DL model finishes the training process and is ready
apps and Dalvik executable, the existing Android security to deploy to a target device, it oftentime goes through either
mechanisms, and the migration/quantization procedure of quantization, or platform migration, or both, before deployed
trained DL models on PC/Server side. to end-user applications. This is because the training phase
requires a vast amount of computation and energy resources.
As the model size and the complexity of the tasks grow, more
A. Android Apps and Dalvik Executable data are needed to train the network till reaching optimality,
To execute the code of Android apps, developers compile which could spend days, if not weeks, in training on high-
their source code and other components into an Android performance GPU clusters. On the other hand, the deployment
application package (APK). APK is a compressed application of the DNN models is usually faced with the resource-
file for Android platform, which contains a manifest file (i.e., constrained environment with limited computation, power, etc.
AndroidManifest.xml), Dex files, resources, assets, etc. The Due to environment differences of a target platform (e.g.,
manifest file contains the meta-data for Android apps, which mobile phones, green energy embedded systems) and training
defines the package name and application ID, app components platform (e.g., often equipped with GPUs), a DL model often
(e.g., Intent filters, activities, and services), permissions, device goes through a customization phase to cater specific software
compatibility (e.g., uses-feature and uses-sdk). Dex files as and hardware constraints of a target platform. Quantization
extension are Dalvik executable file, which can be executed on reduces the precision of a DL model so as to improve
Dalvik virtual machine in Android OS and converted from Java the computation efficiency, reduce memory consumption and
bytecode via an alternative instruction set. Dalvik executable storage size, which has become a common practice when
file contains 21 kinds of contents, which can be divided migrating a large DL model trained on the cloud system to
into metadata and program information, etc. The metadata a mobile or IoT devices with low computation power.
information of Dalvik executable file is provided by the header, Recently, the rapid development of system-on-chip (SoC)
checksum, signature, etc. After them, it follows with the acceleration (e.g., Qualcomm Snapdragon, Kirin 970, Sam-
size and offset values of program information (e.g., class sung Exynos9) for AI applications provides the hardware sup-
definition identifiers, method identifiers, type identifiers, string port and foundation for universal deployment across platforms,
identifiers) and map offset, which provides concrete mappings especially on mobile device, edge computing device. Some
between static information (e.g., strings and method names). lightweight solutions are proposed for mobile platforms such
as CoreML [28], TensorFlow Lite [24], Caffe2 Mobile [29]
B. Security Mechanisms and PyTorch Android [30]. It proposes a chance to deploy the
DL-based malware detection task on a mobile device directly.
The existing security mechanisms can be mainly divided
into three categories, which are application market, Android III. A PPROACH
OS platform, and device-end aspects in practice.
From the aspect of Android market, the official market A. Overview of MobiTive
(i.e., Google Play Store) provides a security verification when To achieve our target, we propose MobiTive, whose func-
the APK uploaded. For instance, Google provides protec- tionality could be divided into two main parts (i.e., parts of
tion backed by its machine learning algorithm. Some high- server side and mobile side), as shown in Fig. 1. The first
quality third-party markets also present security check for part of our system contains feature preparation, DL model
the uploaded applications. For example, ApkMirror [27] not training, model migration and quantization. The second part
only provides the signature verification, but also performs a is the deployment phase on mobile devices by using the
protection service provided by GuardSquare. However, most of migrated/quantized models.
current security check service provided by third-party markets In our previous work [23], we involved multiple features
is very simple and limited. Some of them only contain a sig- (i.e., manifest properties, API calls, and opcode sequences)
nature verification, which can be bypassed easily. Therefore, extracted from decompiled apps. In this paper, to improve
the users, who download applications from the third-party the performance of MobiTive, we propose a new feature
markets, have to install and use it at their own security risk. extraction method. Instead of decompiling APK into source
On the mobile devices, there exists a lot of antivirus code, like smali code, we extract and vectorize the manifest
applications provided. The most famous security applications, properties and API calls from binary code directly (step ).1
like Avast and AVG, mainly provide their antivirus services We combine a performance-based feature selection mechanism
by monitoring the privacy-sensitive components (e.g., run-time and behavior-based feature updating method to generate the
permission requests), and scanning the signature of suspicious feature dictionary (step ).
2 With the customized deep neural
apps with their local or on-cloud virus database. Besides networks and extracted feature vectors (steps  3 and ),4
the protection from outside, Android OS also provides some the first part allows to provide a trained DL model and a
strong built-in security mechanisms, like application sandbox, feature dictionary for the second part (step ).
5 To make the
etc. Application sandbox mechanism provides an independent model adaptive to mobile devices, we then migrate the pre-
execution environment for every application. Hence, the attack built DL model to a TensorFlow Lite model. Also, a quantiza-
from an application can only work on its own requested com- tion phase [31], which is a general technique to reduce model
ponents. For instance, if Bluetooth permissions and actions size while also providing lower latency with little degradation
liked activities are not required in the application, the attack in accuracy, is presented as a performance optimization for the
can never access the functions provided by Bluetooth. mobile devices (step ).
6

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
1566 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

Fig. 1. Overview of MobiTive.

Fig. 1 shows that the second part loads the quantized DL TABLE I
model and feature dictionary into mobile devices. After that, S ELECTED F EATURES
when an application is downloaded from market or third-
party market, MobiTive can extract feature vectors from it and
deliver the result to MobiTive (steps →
A ).
C After predicting
with the loaded DL model, we obtain a certain level of
confidence based on predictive output to know whether the
downloaded Android app is a malware or not. (steps → D ).E

B. Feature Preparation different methods. By evaluating the performance (time cost)


To determine the features used in MobiTive, we perform a of all these potential feature types based on the two different
comparison of the extracting performance for most commonly- extraction steps, which are raw data and feature vector extrac-
used features in previous malware detection approaches [11], tion (steps  1 and ),3 we select unzipping APK to extract
[13], [17], [32], [33]. Based on the performance-based fea- raw data first and select API calls and manifest properties
ture selection method, manifest properties and API calls are as our feature types due to the better extraction performance
selected in our device-end scenario (Feature Selection). Also, compared with others (details in §IV-B).
to update the feature dictionary and improve the representa- To get the feature vectors (step ), 3 we build a feature
tives, we propose a behavior-based method based on industrial dictionary (step ) 2 according to the two types of features
malware reports (Feature Updating). To get the features from selected by performance comparison. Specifically, we build the
APK, we unzip the package instead of decompiling it to reduce manifest property dictionary by following the official Android
time cost. Among the unzipped binary files, we can extract the documentation. As shown in Table I, the manifest properties
features from raw data (Feature Extraction). contain 613 features in total, including 324 used permissions,
1) Feature Selection: Manifest properties such as used 213 intents, and 76 hardware features. In terms of the API call
permissions, intents, and hardware features are widely-used dictionary, we conduct a data-driven analysis to determine the
features to detect Android malware [11], [13], [17]. Android- feature lists. Specifically, by parsing the API calls from more
Manifest.xml file can be easily decoded from APK file through than 60k real-world Android apps collected from Google Play
existing tools, which benefits the feature extraction procedure. Store and malware, we collect 2,989,011 unique API calls in
It is belonging to a lightweight feature type, which would be total. We summarize three rules to reduce the size of API calls
adopted by the performance-sensitive system, like MobiTive. through manual analysis. Firstly, we remove the obfuscated
In terms of the usefulness, API calls are more representative API calls. Secondly, we delete the API calls that are not related
and important feature types because almost all malicious to potential malicious behaviors, such as View loading API.
behaviors would be demonstrated by API callings. Apart from Last, we remove the third-party API calls, because these API
the individual API call, API call sequence may contains calls exist and customize in an app, may rarely appear in
more semantics, such as opcode code sequence. However, other apps. As shown in Table I, after pruning, the number
the extraction procedure of these two feature types causes a lot of selected API calls is only 1,509. The details of feature lists
of time due to analyzing source code or smali code. A novel can be found on our website [26]. We build a feature dictionary
feature extraction method for API calls is much-needed due based on the 1,509 API calls and 613 manifest properties for
to the energy limitation of mobile devices. Besides the above matching on the features of permission, intent, hardware, and
two widely-used feature types, we also evaluate other potential API calls (step ).
2
structural features by their performance behaviors, such as 2) Feature Updating: The quality of machine learning-
inter-procedural control-flow graph (ICFG) and call graph based detection approaches highly depends on the selected
(CG). ICFG not only provides the control-flow graph but features, which means that a more comprehensive feature cov-
also contains the inter-procedural between the components erage of malicious behaviors makes more benefits MobiTive.
within apps. CG represents the calling relations between To enrich the feature coverage of malicious behaviors,

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
FENG et al.: PERFORMANCE-SENSITIVE MALWARE DETECTION SYSTEM USING DL ON MOBILE DEVICES 1567

TABLE II TABLE IV
D EEP N EURAL N ETWORK A RCHITECTURE : GRU AND LSTM D EEP N EURAL N ETWORK A RCHITECTURE :
B IDIRECTIONAL GRU AND LSTM

TABLE III
D EEP N EURAL N ETWORK A RCHITECTURE : TABLE V
S TACKED GRU AND LSTM D EEP N EURAL N ETWORK A RCHITECTURE : CNN

we collect hundreds of industrial malware reports from


Symantec Threats [25]. With the collected text-based reports,
we perform a manual analysis and summarize 23 kinds of
basic potential malicious behaviors as a supplement for the
selected features. Note that, the malware reports detail the
malicious behaviors and the core code level features, including
both API calls and manifest properties. Also, a behavior-based
feature understanding and verification by three co-authors are we present seven widely-used networks to train the classifier,
performed to ensure the manual results. As a result, except with the input feature vectors generated by step . 3 As shown
the features in the original feature dictionary, there are 46 new in Table II, III and IV, we customize the RNN models
API calls and 12 new manifest properties in total, which are to adopt the device-end scenario and improve performance.
updated for a new feature dictionary. We also extend the new For simple RNNs in Table II, the first computational layer
API calls with their package name. For example, if a new API is a LSTM/GRU layer with 128 neural units. After the
call has package name as “android/net/Uri”, we extract all the computation, the dimension of input tensor will reduce to
API calls under this package. As shown in Table I, there are 128 from (1, 2,915). Then, there will be a dropout layer,
781 API calls extended according to the 46 new API calls. the dropout rate is 0.5. At last, the result is passed to a softmax
Finally, we supplement our feature dictionary and update to classifier function to get the final training result. For stacked
2,290 API calls and 625 manifest properties. The new feature RNNs in Table III, there will two stacked LSTM/GRU with
dictionary is used to get the feature vector of each app for dropout layers instead. For bidirectional RNNs in Table IV,
model training. we apply a bidirectional LSTM/GRU layer instead of the
3) Feature Vector Extraction: As we mentioned in feature original LSTM/GRU layer.
selection, the traditional feature vector extraction methods Moreover, we build the convolutional neural network (CNN)
cause a lot of time due to the cost of decompiling and with reference to the conference version [23]. As shown
extracting from source code such as Java code and smali code. in Table V, the first layer of the CNN model is Zero Padding
To improve the extraction performance, we propose a novel Layer. With input feature vectors, we need to fit it to the
feature vector extraction method from binary code instead of training part. Hence, we add two nonsense dimensions to
source code. Specifically, by analyzing the inner architecture the end of input since the kernel size of our convolutional
of Dalvik binary file (classes.dex), we find there exists an API layer is 3. Then, the resulting vector is reshaped to a matrix,
table, which is used to match the executable symbols and whose horizontal dimension is 3, and send to the next layer.
API strings. We extract API calls by parsing the API table The second layer is the convolution layer with a 3 kernel,
in classes.dex file based on the address and offset defined which receives the embedded matrix as its input and applies
in the metadata. Meanwhile, to get access to the information the convolution filter to produce activation maps for each
in binary format AndroidManifest.xml, we firstly generate a batch. Before delivering the batches to the hidden layer,
standard output with a XML decoder, Axmldec [34]. By ana- a global max pooling is used after activation to reduce the
lyzing the decoded manifest file, the manifest properties can dimensions. Finally, the vector is passed to a hidden full layer,
be extracted. which is a multi-layer perception, for classification. To detect
the relation between the result vector, we construct two sub-
layers in the hidden layer, each of them contains a Rectified
C. DL Model Construction Linear Unit activation function. At last, the result from the
1) DL Model Training: To discover the potential accuracy hidden layer is passed to a softmax classifier function to get
improvement and usability for different deep neural networks, the final training result.

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
1568 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

2) DL Model Migration and Quantization: To deploy our API provided in Keras (2.2.4), the basic data structure used
pre-trained DL model on mobile devices, we convert and in the computation with TensorFlow Lite (0.0.0-nightly) on
migrate the pre-trained model to a TensorFlow Lite model, Android devices is bytebuffer. Thus, firstly, there will be a
which is supported by Android operating system (step ). 6 step to convert the input vector and model into bytebuffer
Specifically, we migrate the TensorFlow model to a mobile format. Secondly, by loading the model into a TensorFlow Lite
readable TensorFlow Lite model with a TensorFlow Lite con- interpreter, we can feed the input bytebuffer into the interpreter
verter [24]. Apart from the model migration, we also quantize and get the result matrix. At last, by using an argmax function
our pre-trained model to improve the performance on mobile on the result matrix, the final prediction result can be obtained.
devices, which does not affect the accuracy of detection much. IV. E XPERIMENTS
In the experiments, we measure the performance of accuracy
and time cost affected by the model migration and quantization In this section, our experiments are technically organized
(details in § IV-C.3). into three subsections based on the model deployment envi-
ronments (i.e., PC/server and mobile). First, the goals of our
D. Real-Time Detection System experiments on PC/server are to investigate: (1) the perfor-
Before conducting a real-time detection, the quantized Ten- mance of extraction time of different raw data (techniques)
sorFlow Lite model and feature dictionary should be deployed and feature types; (2) the effectiveness of behavior-based fea-
to the detection system in advance (step ).7 There are three ture updating method; (3) the detection accuracy of different
main steps before completing the prediction. The first step deep neural networks; (4) the comparison with other existing
of MobiTive is feature preparation. When an APK file is learning-based Android malware detection solutions; (5) the
received in step , A MobiTive first unzips it into original accuracy of multi-class classification on malware families.
assembly files such as AndroidManifest.xml, classes.dex, and Second, based on the observed findings and obtained results,
other resources. Features of API calls and manifest properties we further evaluate: (1) the performance of feature preparation
will be extracted accordingly. We implement an API parser to on six different real devices with six different app sizes
extract the API calls from classes.dex directly based on the from 5MB to 50MB; (2) the efficiency of detecting with
understanding of Dalvik binary code. Since the raw binary different RNN models on real devices; (3) the usability (i.e.,
AndroidManifest.xml cannot be analyzed directly, we use a performance and accuracy) of MobiTive on six different real
third-party decoder library, AXML [35], to get the decoded mobile devices; (4) the efficiency of MobiTive by comparing
manifest file. By analyzing the decoded manifest file, the three to dynamic behavior-based run-time detection systems.
kinds of manifest properties will be extracted from the XML In the end, we conduct a study on the hardware performance
tag. Hence, we can get both the manifest property vector and trend of Android mobile devices to provide insights into the
API call vector in step . B All the two types of features are future usability of MobiTive.
transformed into a vector, we connect them together as the
input of TensorFlow Lite model (step ). C With the quantized
model, MobiTive can perform the prediction in step  D and A. Experiment Environment
show the final prediction result as a feedback in step .E With The experiments on server side are run on a Ubuntu
the prediction result, the system can raise a warning to help 16.04 server with two Intel Xeon E5-2699 V3 CPUs, 192GB
users blocking the installation of the detected malware and RAM, and NVIDIA GeForce 2080Ti GPU. To evaluate our
further save its information (e.g., name, version, checksum) approach, we select 6 different Android mobile devices to
in local database. Also, besides the actions on local devices, evaluate the performance and accuracy of our approach on
reporting the malicious applications’ information to the corre- real mobile devices. Among them, there are four common
sponding market and synchronizing the malware information specification devices (Nexus 6P, Huawei Mate 10, HTC U11,
to the updating server can be another two options. and LG G6), a flagship device (Huawei P30), and a low-profile
To deploy an update for the MobiTive in practice, the ser- device (Samsung Galaxy J7 Pro) (detailed specifications pro-
vice provider firstly need to collect the new detected malware vided on our website [26]). The implementation language of
and update the training dataset. After updating, it will be able our system on server is Python 3. To get access to the raw
to obtain a new pre-trained model on server. Then, the new data and features, we use seven different kinds of existing
model can be packed as a system patch and deployed to tools, which are axmldec [34], AXML [35], ApkTool [37],
devices within an update directly. As a result, the updated AndroGuard [38], Dex2jar [36], Soot [39], and FlowDroid [9].
system surely will improve the effectiveness and robustness axmldec is a C++ project which can be used to decode
of the protection on device based on the new delivered model. binary manifest file into readable XML format file. AXML
More implementation details: The AXML version used in is a library designed to parse binary Android XML files. It
MobiTive is v1.0.1. The API parser used in MobiTive on is written in Java and can be used in an Android app as an
Android devices is implemented based on the Dex2jar [36] external library. ApkTool is a tool for reverse engineering,
(2.1-nightly-28). Unlike the original Dex2jar project, we do which can decompile the apk file and generate the resources,
not decompile the Dalvik executable files (i.e.,.dex files) back which contains manifest, smali files, and etc. AndroGuard is
into.smali files or .class files. Instead, we only involve the a Python tool, which cannot only decode the resources but
binary formatting functions in Dex2jar and collect the API also disassemble bytecode to Java code. Also, with the help
calls from the decoded API table. The API parser is served as of AndroGuard, we can easily generate the call graphs (CG)
an external lib file in the MobiTive. Technically, the classifica- and data-flow graph for an Android app. Dex2jar is a project
tion functionality of MobiTive on Android devices is consist which contains tools to work with Android .dex and Java .class
of 3 main parts. Different from the well established high level files. Soot is a Java optimization framework, which can be

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
FENG et al.: PERFORMANCE-SENSITIVE MALWARE DETECTION SYSTEM USING DL ON MOBILE DEVICES 1569

used to extract the call graph (CG). FlowDroid is a static taint


analysis tool for Android apps. It is applied to generate inter-
procedural control-flow graph (ICFG). Apart from the above
existing tools for feature extraction, JitPack is a novel package
repository for JVM and Android projects, which can build
the project to a ready-to-use artifacts (i.e., jar and aar). The
deep neural networks and training projects are implemented
with Keras [40], Numpy, Scikit-learn [41], and TensorFlow
libraries [42].

B. Effectiveness of Feature Extraction, Feature Updating,


Feature Category Selection, and Neural Network Selection Fig. 2. Extraction time of different raw data types.

1) Performance Evaluation of Feature Extraction: In this


experiment, we split the feature extraction time into two parts
respectively (i.e., APK→raw data→feature) along with the
technical procedures in feature preparation phase to show the
performance advantages of our selected features.
a) Dataset: To mitigate the uncertain influence from
apps’ size on the time cost of feature extraction, we randomly
collect 60 Android apps among 6 different sizes (i.e., 5MB,
10MB, 20MB, 30MB, 40MB, and 50MB) to provide a clear
performance comparison between different extraction methods
such as feature extraction from source code and binary code.
b) Setup: In this experiment, we first evaluate the extrac-
Fig. 3. Extraction time of different feature types.
tion time (APK→raw data) of 3 different raw data types,
which are widely-used in the existing static analysis based
malware detection work (i.e., ICFG extracted by FlowDroid, achieves a better performance than Soot on CG extraction.
CG extracted by Soot and AndroGuard, decompiled files As for MobiTive, the detection should be performed in a
obtained by ApkTool), together with our selected extracting responsive period comparing to the app installing time, users
method (i.e., binary code obtained by unzipping). cannot buy it if the reacting time takes too long. The extraction
Secondly, apart from the above raw data extracting meth- time costs of decompiling, which applied in our previous
ods, we further evaluate the extracting performance (raw work [23], is acceptable but also limited, by comparing with
data→feature) of 3 different feature types (i.e., manifest the app installing time on average. Considering the time cost
properties, API calls, and opcode sequence [22]) that gen- of extracting raw data by unzipping, 5MB apps take only
erated from two kinds of raw data types (i.e., decompiling 0.037 seconds and 50MB apps take 0.594 seconds, which
and unzipping). We do not further evaluate the graph-related reaches a much better performance than the processing time
features due to the large time cost of raw data extracting. of decompiling. Therefore, we decide to use unzipping as our
For the decompiled manifest and smali files, we use a XML raw data extraction method.
tag parser to extract manifest properties from manifest file (2) Feature extraction: Fig. 3 shows that the time con-
decompiled by ApkTool. To extract API calls, we obtain the sumption of features extracted from decompiled files is much
result by matching the API call dictionary and smali files longer than the same features generated directly from unzipped
directly. We extract the opcode sequences for each smali file by binary files. Specifically, in terms of the time cost of API calls,
matching it to the opcode list [43]. For the unzipped binary extracting them from 5MB apps only takes only 0.042 seconds
manifest and Dalvik binary files, we evaluate the extraction on average. However, if we extract the API calls from decom-
time of 2 different feature types (i.e., manifest properties and piled smali files, it takes 2.923 seconds. For the 50MB apps,
API calls). To extract manifest properties from the binary it will cost 0.601 and 5.002 seconds for extracting the API
manifest file, we apply Axmldec to extract manifest properties. calls. Considering the extraction time of manifest properties,
We extract API calls by loading the API table directly with the 5MB and 50MB apps will take 2.89 and 4.841 seconds, when
offset and size defined in the metadata of the Dalvik binary we extract the manifest properties from the decoded manifest
file. file by ApkTool. When we extract them from the unzipped
c) Results: We demonstrate the results from the 2 aspects binary manifest file, the time is reduced to 0.041 and 0.599.
(APK→raw data and raw data→feature) as below. Apart from manifest properties and API calls, we find that
(1) Raw data extraction: Fig. 2 shows the extraction time the extraction time of opcode sequence is much larger than
of ICFG and CG is too large to be accepted performance- the other two feature types. For 50MB apps, it will take over
sensitive approach like MobiTive. Specifically, extracting ICG 6 seconds on average. Therefore, to improve the performance
via FlowDroid takes 196.868 seconds on 50MB apps and even of MobiTive in feature extraction, we decide to use the two
17.963 seconds on 5MB apps on average. Generating CG with feature types with shorter extraction time as our model inputs.
Soot takes 126 seconds on 50MB apps and 57 seconds on 2) Accuracy Evaluation of Behavior-Based Feature Updat-
5MB apps, and it costs 32.13 seconds and 3.498 seconds ing Method: In this experiment, we evaluate the effectiveness
accordingly when prepared with AndroGuard. AndroGuard of the behavior-based feature updating method presented in

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
1570 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

TABLE VI TABLE VII


U SED D ATASET D ETECTION R ESULTS OF F EATURE U PDATING

TABLE VIII
§III-B.2 by comparing the results between the features used D ETECTION R ESULTS OF F EATURE C ATEGORIES AND N ETWORKS
in our previous work [23] (MobiDroid) and MobiTive.
a) Dataset: As shown in Table VI, we collect more
than 70k Android apps in total as our evaluation subject.
Specifically, these apps consist of 29,010 Android malware,
and others are benign apps crawled from Google Play Store.
However, these might be malware on the official market. To
filter the potential malware as far as possible, we upload
them to VirusTotal [44], which is an online antivirus service
with over 60 security scanners, to make a verification. The
29,010 malicious samples contain 5,560 apps that downloaded
from Drebin [11], 1,260 apps validated in Genome project [3],
20,000 crawled from VirusShare, and the remaining are used we first evaluate the effect of two newly-updated feature
in KuafuDet [17], including 360 from Contagio Mobile Web- categories (Table I) on detection accuracy separately. Second,
site [45] and 1,830 from Pwnzen Infotech Inc., [46]. In we investigate the effect of computational architecture in
summary, we collect a large-scale dataset of benign and mali- different deep neural networks on detection accuracy.
cious samples for the following experiments. Since our dataset a) Dataset: The dataset configuration used in this exper-
comes from multiple sources, there have a lot of duplicated iment is same as §IV-B.2.a (Table VI).
samples. Therefore, we perform a hash check for eliminating b) Setup: To find out the correlation between the two
redundant apps among malicious and benign apps. During selected features (i.e., manifest properties and API calls),
the data prepossessing, which has raw data decompiling and we investigate their corresponding accuracy by accepting both
feature vector generation steps, we receive some failed cases single and combined feature categories as the input of a same
due to the capabilities of API parser. The rest of the failures are neural network with a same training data configuration. To
just caused by the broken APK packages, we also remove them determine the best deep neural network, we evaluate seven
directly. As a result, we choose 18,000 benign and malicious widely-used neural networks by using the combined two
samples respectively from our dataset to conduct the following feature categories.
experiments. In training stage, we divide these 18,000 malware c) Results: We demonstrate the results from the 2 aspects
and 18,000 benign apps into three parts, 80% of them are (feature category selection and network selection) as below.
configured as training data, other 20% are equally split into (1) Feature selection: As shown in Table VIII, the accuracy
validating and testing data. of the three CNN models is 79.89%, 93.17% and 95.11%. By
b) Setup: Because our previous work MobiDroid [23] comparing the accuracy of feature categories, we decide to use
applied three types of features (i.e., API calls, manifest prop- manifest properties and API calls together as an input bundle
erties, and opcode sequence), in this experiment, we determine in our approach since the input with two feature types has the
to take API calls and manifest properties (i.e., 1,509 and 613), best result.
which our behavior-based feature updating method may benefit (2) Network selection: In general, RNN models perform
on, as the feature of MobiDroid to reveal the improvement on a better accuracy than CNN models. A possible reason is
detection. Meanwhile, the updated version used in MobiTive that RNN has an internal state (memory), which can also
has 2,290 API calls and 625 manifest properties. For each take the correlation between the different feature positions
feature version, we apply three kinds of deep neural networks, into consideration. In the training stage, this internal state will
which presented in §III-C.1, to investigate whether the feature make RNN be able to keep the highly potential related in a
updating method can improve the accuracy of our system. long-term and finally keeps the most corresponding feature
c) Results: In Table VII, the accuracy of updated feature positions. However, CNN considers every different feature
version on CNN, LSTM and GRU is 95.11%, 96.56% and position individually in training. In terms of RNN mod-
96.75%. Comparing to the previous results, there is around els, GRU and bidirectional GRU achieve a similar accuracy
1%∼5% improvement after feature updating. Therefore, based (96.75% vs. 96.78%), which is better than other RNN models’
on the result, we accept updating features summarised from accuracy. They also have a better recall than precision (96.78%
potential malicious behaviors a part of our input feature set. vs. 96.72% for GRU and 97.00% vs. 96.57% for Bidirectional
3) Accuracy Evaluation of Feature Category Selection and GRU). Besides, we also compare the size of original pre-
Deep Neural Network Selection: In this experiment, to find trained model with the quantized and non-quantized models.
out the correlation between selected features and the effective- In Fig. 4, we can find that the size of the original pre-
ness of different deep neural networks on detection accuracy, trained model reduces 3 times on RNNs and 5 times on

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
FENG et al.: PERFORMANCE-SENSITIVE MALWARE DETECTION SYSTEM USING DL ON MOBILE DEVICES 1571

TABLE IX
C OMPARISON OF M OBI T IVE A GAINST E XISTING A PPROACHES

Fig. 4. Comparison of model size changes on migration and quantization.


TABLE X
CNN by migrating it to the TensorFlow Lite model. In the D ETECTION R ESULT W ITH M ULTI -C LASS D ATASET
experiment, whether the quantization configuration is enabled
or not, the migrated model size for both RNN and CNN
models will keep unchanged.
4) Comparison Between the Existing Learning-Based
Android Malware Detection Systems and MobiTive: In this
experiment, we evaluate our MobiTive together with several
existing learning-based Android malware detection systems on
both two directions, effectiveness and efficiency.
a) Dataset: The dataset configuration used in this exper-
iment is same as § IV-B.2.a (Table VI).
b) Setup: We briefly compare our MobiTive with 4 open-
source learning-based Android malware detection approaches,
which applies different types of features (i.e., vector, sequence,
and graph) as their inputs. There are three reasons to help us
illustrating why we select these four approaches. By conduct-
ing a study on the corresponding literature which published
in recent years, we first survey them on the feature types,
and then select one representative work from each organized
column. Further, by searching on the Github and sending
emails to the authors, we obtain the source code of these
four approaches and further evaluate them with our dataset VirusTotal [44], which is an online detection platform,
to provide a more concrete comparison. Since one of our to retrieve their types as our ground truth in multi-class clas-
basic concept in this paper is balancing the performance and sification. Finally, we have 21 virus-labels of these Android
accuracy to satisfy user’s real usage, we not only evaluate the malware, which locate in 5 types (i.e., Adware, Spyware,
accuracy, but also compare the average feature extraction time Riskware, Trojan, and File-Infector). In training stage, we also
for each approach on our dataset. accept the same data split in binary classification (i.e., 80%,
c) Results: As shown in Table IX, comparing to 10% and 10%) as the train/validate/test data split portion.
McLaughlin et al. [33], MalDozer [47], Apk2Vec [48], there b) Setup: To evaluate the effectiveness on classifying
are obvious improvements on both the accuracy and feature malware into different virus types, we train a multi-class
extraction time cost on PC. Considering the approaches based malware classifier on our collected dataset in Table X with
on sequential features, the accuracy of MobiTive is higher the determined deep neural network (i.e., GRU).
than McLaughlin et al. [33] and MalDozer (96.75% vs. c) Results: As shown in Table X, we reach an overall
94.79% and 96.25%), and the time cost of extracting feature accuracy at 94.45% in multi-class malware classification task.
is almost improved for 100 and 70 times than them (0.051 vs. Considering those families respectively, on each of the two
5.065 and 3.515 seconds). Considering our previous work, Riskware families, our MobiTive performs a perfect pre-
MobiDroid [23], the accuracy of MobiTive is a little lower diction, which has an accuracy at 100.00%. The detection
than MobiDroid (96.75% vs. 96.87%), however, with a tiny accuracy of Malware families, cnzz and commplat in File-
decrease at 0.12% on the accuracy, the feature extraction time Infector type, also reach 100.00%, and the other family,
cost of MobiTive is almost 100 times shorter than MobiDroid domob, has an accuracy at 93.33%. Each of the malware
(0.051 seconds vs. 5.892 seconds). types located in Trojan achieves an accuracy above 95% (i.e.,
5) Accuracy Evaluation of Multi-Class Classification: In 95.92%, 97.28%, 97.47%, and 100.00%). For the only spy-
this experiment, we evaluate the effectiveness of our MobiTive ware, the accuracy reaches 99.74% among 391 test malware
on predicting malware in different virus types such as Spyware applications. For Adware families, with our selected features,
and Trojan. MobiTive achieves an accuracy above 95.00% in detecting
a) Dataset: We collect 70,130 Android applications from anydown, baiduprotect, feiwo, fictus, gappusin, leadbolt (i.e.,
VirusShare as shown in Table X and classify them with 100.00%, 98.35%, 95.86%, 96.53%, 96.16%, and 96.80%),

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
1572 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

however, it fails to provide a dependable prediction result on


admogo, adwo, dowgin, kuguo, kyview (i.e., 84.38%, 86.08%,
78.88%, 88.83%, 79.72%).
Remark: To face the high latency during feature preparation,
we find extracting API calls and manifest properties from
unzipped Dalvik binary and binary manifest file will cost less
than 1 second. To validate the effect of our newly supple-
mented features, we find the RNNs have an improvement at
over 1% on the accuracy, and the accuracy of CNN increased
by 5%. Meanwhile, by comparing the result on different
feature categories and deep neural networks, we find that (1)
two feature types combined input has a much better result
Fig. 5. Unzipping time on different mobile devices.
than single feature type; (2) on average, the RNN models
have a better result than CNN. GRU models have a better
accuracy than the LSTM models on our dataset. Moreover,
by comparing 4 existing approaches with MobiTive, it achieves
a better performance with a considerable detection accuracy.
To validate the effect on multi-class classification, we find that
MobiTive can efficiently handle most malware families (i.e.,
17/21 obtain an accuracy larger than 95%).

C. Effectiveness Evaluation of MobiTive on Mobile Devices


1) Performance Evaluation of Feature Preparation on Real
Devices: We evaluate the performance of feature preparation
on real mobile devices in this experiment. The time cost of
feature preparation step includes unzipping time and feature Fig. 6. Extraction time on different mobile devices.
extraction time.
a) Dataset: The dataset configuration used in this exper-
iment is same as §IV-B.1.a. apps are 0.516 and 5.452 seconds. Considering flagship device
b) Setup: We first evaluate the performance of raw data (Huawei P30), they are limited to less than 0.49 second. For
extraction by investigating the time cost of unzipping the 5MB apps, it only takes 0.092 second, which is very fast.
applications with 6 different sizes on 6 different real Android 2) Performance Evaluation of RNN Models on Real Device:
devices. Second, with the extracted raw data (i.e., binary Besides analyzing the performance of unzipping and extracting
manifest file and Dalvik executable file), we further evaluate features, in this experiment, we further evaluate the efficiency
the performance of feature extraction by investigating the time of prediction with different RNN models on real devices.
cost of extracting the features from raw data on the devices. a) Dataset: To make sure that the test accuracy is com-
c) Results: We introduce the results from 2 aspects parable to the results obtained on server, the testing data used
(unzipping time and feature extraction time) as below. in this mobile-end experiment is same to the testing data
(1) Unzipping time evaluation on real devices: Fig. 5 generated by the data split function mentioned in §IV-B.2.a
shows the average unzipping time of 50MB apps on common (Table VI), including 1,800 malware and 1,800 benign samples
specification devices (Huawei Mate 10, HTC U11, Nexus respectively. To get rid of the influence of feature preparation
6P, LG G6) locates between 1.023 and 2.586 seconds. For phase in this performance evaluation against RNN models,
5MB apps, it locates between 0.119 and 0.264s. For the we directly use a set of feature vectors extracted from testing
performance of low-profile device (Samsung Galaxy J7 Pro), data as the input.
the unzipping time of 5MB and 50MB apps are 0.261 and b) Setup: We first convert and migrate the RNN models
3.918s. Considering flagship device (Huawei P30), they are obtained in §IV-B.3 (i.e., simple RNN LSTM/GRU, stacked
limited to less than 0.6 second. For 5MB apps, it only takes LSTM/GRU, and bidirectiontal LSTM/GRU) to TensorFlow
0.046s. Lite models and further deploy them on real device (e.g.,
(2) Feature extraction time evaluation on real devices: Apart Huawei P30). Secondly, to provide an insight for both the
from the performance evaluation of unzipping time, we further prediction accuracy and performance, we investigate the pre-
evaluate the feature extraction time. To extract the API calls, diction time for each model with our dataset and organize
we package our API call parser used on server side into a jar the result together with the accuracy obtained in §IV-B.3
with the help of JitPack. The API call parser is used to extract (Table VIII).
API calls from binary code. Since the XML decoder (axmldec) c) Results: As shown in Fig. 7, by comparing the pre-
used on sever is implemented in C++, we apply AXML as a diction time of different RNN models, which presented in the
lib to extract the manifest properties, which is more suitable histogram, we can see that the pre-trained model with GRU
on the mobile side. Fig. 6 shows the average feature extraction has the best performance than any others. Meanwhile, from the
time of 50MB apps on common specification devices locates grey accuracy line in this figure, we can see it has the second
between 2.089 and 6.216 seconds. For 5MB apps, it locates highest accuracy among them, which only has a small differ-
between 0.146 and 0.22 seconds. On low-profile device (Sam- ence comparing to the accuracy of bidirectional GRU (96.75%
sung Galaxy J7 Pro), the extraction time of 5MB and 50MB vs. 96.78%). Considering all situations, we select GRU model

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
FENG et al.: PERFORMANCE-SENSITIVE MALWARE DETECTION SYSTEM USING DL ON MOBILE DEVICES 1573

TABLE XI
RUN -T IME P ERFORMANCE C OMPARISON OF M OBI T IVE
A GAINST DYNAMIC A NDROID A NALYSIS T OOL

Fig. 7. The accuracy and prediction time of RNN models on Huawei P30.

to evaluate the performance and accuracy of our approach on


the real mobile devices. we can see that the detection time only cost less than 1%
3) Accuracy and Prediction Time on Different Real Devices: among the total time on common spec devices. Thus, reducing
In this experiment, we evaluate the effectiveness of MobiTive the time cost in feature preparation will bring a considerable
on real Android mobile devices by conducting a comparison performance improvement for our detection system. It is also
experiment on test accuracy and the total prediction time on a strong motivation for us. Additionally, as a result of the
real devices. installation mechanism, the downloaded Android APK will
a) Dataset: To make sure that the test accuracy is com- be always unzipped by the Android operation system. Thus,
parable to the results obtained on server, the testing data there will be a same step between our approach and the
used in this mobile-end experiment is same to the testing installing procedures, which is extracting the same raw data
data generated by the data split function mentioned in §IV- from the target APK. If we can deploy our approach on the
B.2.a (Table VI), including 1,800 malware and 1,800 benign Android operation system framework directly, the time cost
samples. of unzipping step in our approach will be saved. It is a new
b) Setup: We first convert and migrate the GRU model research point of this field in the future.
obtained in §IV-B.3 to quantized/non-qunatized TensorFlow 4) Performance Comparison Between MobiTive and
Lite models and deploy them on real devices. Second, Dynamic Run-Time Detection: In this experiment, we evaluate
we record the average prediction time and detection accuracy and discuss the efficiency of our MobiTive against other tool,
by testing the quantized/non-quantized GRU models. Note which based on dynamic behavior analysis.
that, to provide an insight for the performance of each process- a) Setup: To evaluate the run-time performance of our
ing phase, we record the time cost by 3 parts (i.e., raw data MobiTive against other tool, which applied dynamic behavior
unzipping, feature extraction, and prediction). analysis as the baseline technique, we select Inspeckage [50],
c) Results: With the obtained GRU model in §IV-B.3 which is a state-of-art tool developed to offer dynamic analysis
(accuracy: 96.75%), in Table XII, by comparing the accuracy for Android applications, as the target system in this exper-
of non-quantized and quantized models, we find that the iment. We investigate the run-time performance cost of both
accuracy of quantized model will almost equal to the non- Inspeckage and MobiTive on three aspects (i.e., CPU, memory,
quantized model for RNN (GRU). However, by comparing energy) with the help of Android Profiler [51].
the prediction time of them, it shows that the performance b) Results: In Table XI, the result shows that if detect-
of predicting with quantized model is a little better than ing a single application successfully in the same limited
non-quantized model. In this experiment, the result shows time period, the CPU usage and energy consumption of
that the difference of prediction time, which brought by the Inspeckage look better than MobiTive (i.e., CPU: P40% vs.
quantization technique, is less than 0.01 microseconds. As P10% and Q M vs. Q L ), but the average memory usage is
a result of the current inadequate support for the operators 70MB larger than ours (i.e., 60 vs. 130MB). More importantly,
in Tensorflow Lite, the structure of current applied deep the differences of Inspeckage and MobiTive on their basic
neural networks are relatively simple. However, with a more mechanisms cause that more factors need to be involved in this
complicated neural network, the quantization technique will evaluation. First, the protection of MobiTive is only provided
definitely provide a performance boost during the prediction before the installation of target application. In other words,
phase [49]. every application will only trigger MobiTive’s detection once
By calculating the unzipping, analyzing, and prediction time in each installation and cost 0.46s on average. Considering
together, the time is always acceptable for mobile users (i.e., the protection of Inspeckage is provided by monitoring and
less than 3 seconds on average, less than 1 second in best prac- analyzing the behaviors of applications in real time, namely,
tice). By comparing the specifications of these devices used it will be assigned to every application, no matter running in
in our experiment with the summarized devices’ specifications foreground or background. Meanwhile, unlike MobiTive can
(details on our website [26]), we find that the performance be slept after every detection, the Inspeckage has to be kept
benchmark result of most newly released devices are better running all the time if any applications are running. Thus,
than the common devices selected in our experiments. Thus, the CPU usage, memory usage and energy usage of Inspeckage
we can claim that the current mobile phones can support our will turn to be P10% × n, 130 × n and Q L × t × n, where t
off-line prediction system smoothly. is the execution time to finish one detection on an application
Composition of overall time analysis: Comparing the feature and n is the total number of protected applications running in
preparation and prediction time in common cases in Table XII. foreground/background. In conclusion, in common scenario of

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
1574 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

TABLE XII
A CCURACY AND P ERFORMANCE OF M OBI T IVE ON R EAL M OBILE D EVICES

Android devices’ usage, malware detection tool using dynamic


analysis, like the Inspeckage, will definitely lead to a higher
performance cost than MobiTive.
Remark: To determine the feature preparation and prediction
performance on real devices, we find (1) the feature prepara-
tion time is less than 4 seconds on average; (2) the prediction
time of RNN models is less than 2ms on average. GRU costs
0.44ms with the best performance. Meanwhile, by comparing
the result on six mobile devices, we find MobiTive costs
(1) less than 3 seconds on common devices, (2) less than
0.5 second on flagship device. Meanwhile, by comparing the
run-time performance of MobiTive and Inspeckage, we find
that MobiTive can serve user with a much more efficient Fig. 8. Benchmark score comparison among different chipsets from Exynos,
experience as an on-device protection system. Kirin, and Snapdragon.

D. Analysis of Hardware Performance Evolution Trend of


Android Mobile Devices
In this section, we conduct a study from three different
aspects to investigate the hardware performance evolution
trend of Android mobile devices.
To provide insights into the current and future usability of
MobiTive, we study 45 widely-used chipsets, which released
between 2016 and 2019. They are collected from three well- Fig. 9. Top 21 chipsets assembled in Android mobile phones released in 2019.
known brands, Exynos, Kirin, and Snapdragon We select
4–5 chipsets from each brand and compare them with 3 differ-
ent kinds of benchmark test scores (i.e., Greekbench 4.4 64 Bit
Single-Core score, Greekbench 4.4 64 Bit Multi-Core score,
and Octane V2 total score). As shown in Fig 8, we present
the results along time line to reveal the fast evolution trend of
the chipsets. From the polylines, which refer to the different
score results, we can see that the performance of the chipsets
has doubled during the past 5 years. We detail the full
Fig. 10. RAM size of Android mobile devices released in 2019.
specifications of chipsets on our website [26]. Besides the
analysis of chipsets, we also investigate the clock freqency and mobile phones can support MobiTive smoothly and achieve a
RAM size on 167 Android mobile devices, which is released responsive detection.
in 2019, and present them on our website [26]. As shown Remark: By collecting and analyzing the specifications
in Fig. 9, we can see that the current frequency of new released of chipsets and devices, we find the evolution trend of
Android devices are mostly located in 2000 ∼ 2500MHz chipsets will provide a better performance for MobiTive in the
and 2500 ∼ 3000MHz, which refers to common devices and future. Meanwhile, the study on device specifications released
flagship devices respectively. As shown in Fig. 10, we can in 2019 shows that most new devices will have a performance
see that the current RAM sizes of new released Android not worse than our 4 selected common devices.
devices are mostly larger than 3GB. The mainstream RAM
sizes are 4GB, 6GB, and 8GB, which have a proportion around V. L IMITATIONS AND D ISCUSSIONS
72% among the whole specification data. By investigating
the hardware performance evaluation results of real devices A. Feature Selection
on chipsets and RAM with the six devices used in our As a result of the performance requirement of MobiTive,
experiments, we can tell that the most of the current Android the limited selected feature categories (i.e., manifest properties

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
FENG et al.: PERFORMANCE-SENSITIVE MALWARE DETECTION SYSTEM USING DL ON MOBILE DEVICES 1575

and API calls) surely will not cause large overhead when it the entire Android app with a limited feature list instead of
is working on an Android device. However, the limited two embedding the whole package program, so that the attackers
feature types will also provide limited information from the have to manipulate their malware applications with our defined
Android malware. If there will be a new malware family, features to bypass MobiTive. In practice, attackers can not
whose malicious behaviors can not be represented by our obtain the accurate feature list easily. Meanwhile, considering
selected feature types, the MobiTive may not be able to detect that most of our selected features (i.e., manifest properties
them. In the future, we aim to add more effective feature types and API calls) are defined by the official/trustworthy third-
with low-performance costs as well. party developers, it is almost impossible to bypass MobiTive
Meanwhile, it is very important to detect new malware fam- as easily as the deep learning based voice/image recognition
ilies in practice. Actually, neither dynamic nor static methods systems under the restriction of maintaining the functionalities
can fully guarantee the validity of protection against the new in the malware applications. All in all, adversarial attacks
malware samples. A possible solution for detecting more new on deep learning based malware have domain-specific chal-
malware is that combining two types of methods. In the future, lenges compared with image/voice classification, which is also
we will try to improve the ability to detect new malware belonging to a new research direction as an open question.
by designing a new adaptive method, which is also an open
question for this community. D. Dynamic Behavior Analysis
According to our knowledge and an in-depth literature
B. New Malware Family Detection review, static analysis acts an important in past and current
For any malware detection tool, it is no doubt that detect- cyber security research, and the number of research pub-
ing new malware families in practice is a very important lications on Android malware detection is also larger than
task. However, neither dynamic nor static methods can fully dynamic analysis (static analysis [11], [18], [20], [32], [33],
guarantee the validity of protection against the new malware [57]–[65] vs. dynamic analysis [7], [66]–[70]). Indeed, on a
samples. For example, due to the limited training dataset, specific given detection task, dynamic behavior analysis may
MobiTive would have a similar limitation as other static analy- achieve a more accurate result (e.g., lower false positive)
sis based malware detection systems, which is different from than static analysis, however, there are several limitations,
the dynamic analysis approaches. Specifically, considering a which undertake its applicability on specific scenarios, need
new malware family, the situation may be that the malicious to be discussed. (1) First and most important, the avail-
features are totally different from existing data. Consequently, able scenarios of dynamic behavior analysis based malware
as a result of lack of knowledge, the trained classifier may not detection systems are more limited, because the high cost
be able to make the right decision, although learning-based on computational resources makes dynamic behavior analy-
approaches sometimes have the ability to detect new malware sis based systems unable to satisfy users’ requirements on
variants. Therefore, in the future, we can make some efforts performance and energy. For example, using performance
to improve the ability to detect new samples by combining counter [67] while doing program analysis in malware/bug
varies techniques. We surely will also try to improve the ability detection task is widely used. However, unlike traditional
to detect new malware by designing new adaptive methods, windows/linux programs, Android application have a more
which will also benefit the community by discovering the complicated HCI mechanism. In other word, generating good
possible techniques on solving this open question. quality test benchmarks with a good coverage to the corner
cases is much more difficult than programs on windows/linux.
C. Against Adversarial Attack Assuming we have the ability to obtain the benchmarks,
the time cost in generating and executing them will also bring
Indeed, deep learning based systems (e.g., voice/image
a conflict to the target, which is satisfying user’s demand
recognition) will suffer from adversarial attacks [52]–[55],
on efficiency. (2) Second, the detection efficiency is highly
so that maintaining the robustness of deep learning based
depend on the coverage of the predefined behaviors. Namely,
system becomes a challenging topic. However, there are sev-
once the malicious behavior in the target malware is not
eral differences in the deep learning based systems between
specifically defined by the detection system, the security of
malware detection task and voice/image recognition. (1) First
system will be no longer promised. (3) Third, different from
and most important, unlike voice/image recognition, the adver-
MobiTive, dynamic behavior analysis based system may suffer
sarial attacks in malware detection cannot break the entire
from its working mechanism (i.e., before installation vs. run-
functionality in the applications easily in practice, so that
time). For example, a social engineering based spyware can
the existing adversarial attacks against malware detection
easily store the privacy information on the device and trick
are always generated by manipulating the target malware
the user to upload them, as a result of that most users are not
application with un-triggered code snippets (e.g., dead code)
as professional as security researchers. In the end, according
instead of changing real functionalities [56]. Although it is
to the diverse usage scenarios and targets, we think Android
able to generate adversarial samples to evade the classifier
malware detection approaches based on dynamic behavior and
and achieve a high miss-classification rate, it is impractical
static analysis have their own advantages and weaknesses
so far, because such attack can be easily detected by lever-
respectively, which both call for research on them.
aging other techniques such as static data flow analysis to
delete such features that are introduced by adding dead code VI. R ELATED W ORK
from attackers. Meanwhile, it is also evidenced by lacking Some techniques are proposed based on analyzing the XML
real adversarial malware samples in the existing researches. files from the APK file. Huang et al. [71] classified the
(2) Secondly, different from malware detection approaches on benign data and malware data using the permission infor-
other system (e.g., Windows/Linux), our approach abstracts mation in manifest and files structure as features. Similarly,

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
1576 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

Aung and Zaw [72] also considered the permission. Dif- mobile side is still rare and on demand. Different from the
ferently, they concentrate on the permission requests in the existing techniques, MobiTive concentrates on using deep
source code, not only the static information. Chin et al. [57] learning algorithms on malware detection according to var-
proposed ComDroid, which detects malware by analyzing the ious performance-based experiments on the Android mobile
manifest file. There are also techniques which are based on devices.
the API [58]. Deshotels et al. [59] classified malware based on Recently, Guo et al. [49] have conducted the evaluation on
the API call frequency. Zhang et al. [60] developed DroidSIFT the performance and robustness of models during training and
based on the API dependency graphs. Zhongyang et al. [61] inference phases, which revealed the potential compatibility
introduced DroidAlarm, which analyzes the inter-procedural issues on different platforms and frameworks. Thus, it is
call graphs constructed by the relationship between permis- important to guarantee the quality of the malware detection
sions and the interface to identify attacks. Yan and Yin [7] system when it is deployed on mobile devices. A lot of deep
proposed DroidScope, which generates semantic information learning techniques [81]–[87] have been proposed for testing
from API call and Dalvik opcode traces. Wu et al. [62] the deep learning frameworks. We leave the quality assurance
proposed the DroidMat to detect malware with API traces, and robustness analysis of our framework as the future work.
intent, communication and some other life-cycle information.
Another line of malware research is conducted based on VII. C ONCLUSION
the program analysis (e.g., control flow graph), which is more This paper presents MobiTive, a performance-sensitive
expensive than the XML-based and API-based approach. How- Android malware detection system on mobile devices as a pre-
ever, the result tends to be more precise. Narayanan et al. [32] installed solution. According to the effectiveness of selected
presented an online SVM classifier, which uses the con- features and the efficiency of feature extraction, MobiTive can
trol flow graph generated from the source code as input. provide a reliable detection accuracy and fast responsive (i.e.,
Enck et al. [66] proposed TaintDroid, which is a taint analysis less than 3 seconds on average) detection service on mobile
tool for Android apps. It detects the leakages with the data flow devices directly. To validate the efficiency and reliability,
analysis on target sensitive data. Meng et al. [63] proposed we evaluate MobiTive on six real mobile devices. To provide
a deterministic symbolic automaton (DSA) based detection more insights of this work, we also make an in-depth analysis
system, in which DSA contains the corresponding components of the performance trend on over one hundred mobile phones.
of the target app. R EFERENCES
Machine learning has achieved great success in malware
[1] S. Chen et al., “An empirical assessment of security risks of global
detection, there exist also a lot of learning-based approaches. Android banking apps,” in Proc. ACM/IEEE 42nd Int. Conf. Softw. Eng.,
Arp et al. [11] proposed Drebin, which is a classifier using Jun. 2020, pp. 1310–1322.
features from both of XML files and API calls. Yuan et al. [20] [2] S. Chen et al., “Are mobile banking apps secure? What can be
improved?” in Proc. 26th ACM Joint Meeting Eur. Softw. Eng. Conf.
et al. provided Droid-detector, which performs on a deep Symp. Found. Softw. Eng. (ESEC/FSE), 2018, pp. 797–802.
belief network. Yu et al. [18] et al. presented a malware [3] Y. Zhou and X. Jiang, “Dissecting Android malware: Characteriza-
detection system, which uses permission and API call traces tion and evolution,” in Proc. IEEE Symp. Secur. Privacy, May 2012,
as input. McLaughlin et al. [33] et al. used the convolution pp. 95–109.
[4] C. Tang et al., “A large-scale empirical study on industrial fake apps,” in
neural network in detection. The raw opcode sequences of Proc. IEEE/ACM 41st Int. Conf. Softw. Eng., Softw. Eng. Pract. (ICSE-
target apps are used as the input feature. Kim et al. [64] SEIP), May 2019, pp. 183–192.
presented a malware detection framework based on multiple [5] Y. Zhou, Z. Wang, W. Zhou, and X. Jiang, “Hey, you, get off of my
market: Detecting malicious apps in official and alternative Android
neural networks. Every network has a single feature input and markets,” in Proc. NDSS, 2012, pp. 50–52.
output score. The final detection result is a combination of all [6] W. Zhou, Y. Zhou, M. Grace, X. Jiang, and S. Zou, “Fast, scalable
the models. Xu et al. [65] proposed DeepRefiner, which is an detection of ‘Piggybacked’ mobile applications,” in Proc. 3rd ACM Conf.
efficient two layer malware detection system. Data Appl. Secur. Privacy (CODASPY), 2013, pp. 185–196.
[7] L. K. Yan and H. Yin, “Droidscope: Seamlessly reconstructing the OS
In addition, there are still some other techniques. and dalvik semantic views for dynamic Android malware analysis,” in
Demontis et al. [73] proposed an algorithm to mitigates Proc. USENIX Secur., 2012, pp. 569–584.
attacks like malware data manipulation. Bläsing et al. [74] [8] K. Tam, S. J. Khan, A. Fattori, and L. Cavallaro, “CopperDroid:
introduced AASandbox, which performs detection with Automatic reconstruction of Android malware behaviors,” in Proc. Netw.
Distrib. Syst. Secur. Symp., 2015, pp. 1–15.
combination information of both static and dynamic analy- [9] S. Arzt et al., “Flowdroid: Precise context, flow, field, object-sensitive
sis. Shabtai et al. [68] and Schmidt et al. [69] provided and lifecycle-aware taint analysis for Android apps,” in Proc. PLDI,
the abnormalities identification systems, which use run-time 2014, pp. 259–269.
[10] L. Li et al., “IccTA: Detecting inter-component privacy leaks in Android
device information, such as CPU usage etc. Sun et al. [75] apps,” in Proc. IEEE/ACM 37th IEEE Int. Conf. Softw. Eng., May 2015,
trained a machine learning based classifier, which use the pp. 280–291.
distance of keywords to detect the malware. Lu et al. [76], [11] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K. Rieck,
Chan et al. [77], Lu et al. [78] and Wei et al. [79] focused on “Drebin: Effective and explainable detection of Android malware in your
pocket,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2014, pp. 23–26.
detecting vulnerable components, which may hijack the apps. [12] C. Yang, Z. Xu, G. Gu, V. Yegneswaran, and P. Porras, “Droidminer:
Hao et al. [80] provided a malware detection system, Droid- Automated mining and characterization of fine-grained malicious behav-
Modss, which uses hash comparison to detect repacked Apks. iors in Android applications,” in Proc. ESORICS, 2014, pp. 163–182.
[13] S. Chen, M. Xue, Z. Tang, L. Xu, and H. Zhu, “StormDroid:
Grace et al. [70] proposed RiskRanker, which performs detec- A streaminglized machine learning-based system for detecting Android
tion via analyzing specific app behaviors. malware,” in Proc. 11th ACM Asia Conf. Comput. Commun. Secur. (ASIA
Existing techniques mainly focused on detecting mal- CCS), 2016.
ware with the information from APK or the source code [14] E. Mariconti, L. Onwuzurike, P. Andriotis, E. De Cristofaro, G. Ross,
and G. Stringhini, “MaMaDroid: Detecting Android malware by building
on server. However, with the rapid development of AI Markov chains of behavioral models,” in Proc. Netw. Distrib. Syst. Secur.
chips on devices, the research about malware detection on Symp., 2017, pp. 1–34.

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
FENG et al.: PERFORMANCE-SENSITIVE MALWARE DETECTION SYSTEM USING DL ON MOBILE DEVICES 1577

[15] S. Chen, M. Xue, and L. Xu, “Towards adversarial detection of mobile [48] A. Narayanan, C. Soh, L. Chen, Y. Liu, and L. Wang, “apk2vec: Semi-
malware: Poster,” in Proc. 22nd Annu. Int. Conf. Mobile Comput. Netw., supervised multi-view representation learning for profiling Android
Oct. 2016, pp. 415–416. applications,” in Proc. IEEE Int. Conf. Data Mining (ICDM), Nov. 2018,
[16] L. Fan, M. Xue, S. Chen, L. Xu, and H. Zhu, “POSTER: Accuracy pp. 357–366.
vs. Time cost: Detecting Android malware through Pareto ensemble [49] Q. Guo et al., “An empirical study towards characterizing deep learning
pruning,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., development and deployment across different frameworks and plat-
Oct. 2016, pp. 1748–1750. forms,” in Proc. 34th IEEE/ACM Int. Conf. Automated Softw. Eng.
[17] S. Chen et al., “Automated poisoning attacks and defenses in malware (ASE), Nov. 2019, pp. 810–822.
detection systems: An adversarial machine learning approach,” Comput. [50] Inspeckage. Accessed: Apr. 2020. [Online]. Available:
Secur., vol. 73, pp. 326–344, Mar. 2018. https://github.com/ac-pm/Inspeckage
[18] W. Yu, L. Ge, G. Xu, and X. Fu, “Towards neural network based malware [51] Android-Profiler. Accessed: Apr. 2020. [Online]. Available:
detection on Android mobile devices,” in Cybersecurity Systems for https://developer.Android.com
Human Cognition Augmentation. Cham, Switzerland: Springer, 2014, /studio/profile/Android-profiler
doi: 10.1007/978-3-319-10374-7_7. [52] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation
[19] XDA. (2020). XDA-Developers Android Forums. Accessed: Apr. 2020. as a defense to adversarial perturbations against deep neural networks,”
[Online]. Available: https://forum.xda-developers.com in Proc. IEEE Symp. Secur. Privacy (SP), May 2016, pp. 582–597.
[20] Z. Yuan, Y. Lu, and Y. Xue, “Droiddetector: Android malware charac- [53] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and
terization and detection using deep learning,” Tsinghua Sci. Technol., A. Swami, “The limitations of deep learning in adversarial settings,”
vol. 21, no. 1, pp. 114–123, 2016. in Proc. IEEE Eur. Symp. Secur. Privacy (EuroS&P), Mar. 2016,
[21] B. Wu et al., “Why an Android app is classified as malware? pp. 372–387.
Towards malware classification interpretation,” 2020, arXiv:2004.11516. [54] S. Chen, M. Xue, L. Fan, L. Ma, Y. Liu, and L. Xu, “How can we craft
[Online]. Available: https://arxiv.org/abs/2004.11516 large-scale Android malware? An automated poisoning attack,” in Proc.
[22] R. Feng, J. Q. Lim, S. Chen, S.-W. Lin, and Y. Liu, “Seqmobile: IEEE 1st Int. Workshop Artif. Intell. Mobile (AI4Mobile), Feb. 2019,
An efficient sequence-based malware detection system using RNN on pp. 21–24.
mobile devices,” in Proc. ICECCS, 2020. [55] G. Chen et al., “Who is real bob? adversarial attacks on speaker
[23] R. Feng et al., “MobiDroid: A performance-sensitive malware detection recognition systems,” Proc. S&P, 2021, pp. 1–18.
system on mobile platform,” in Proc. 24th Int. Conf. Eng. Complex [56] X. Chen et al., “Android HIV: A study of repackaging malware for
Comput. Syst. (ICECCS), Nov. 2019, pp. 61–70. evading machine-learning detection,” IEEE Trans. Inf. Forensics Secu-
[24] Tensorflow Lite. Accessed: Apr. 2020. [Online]. Available: https://www. rity, vol. 15, pp. 987–1001, Jul. 2019.
tensorflow.org/lite/ [57] E. Chin, A. P. Felt, K. Greenwood, and D. Wagner, “Analyzing inter-
[25] Symantec Security. Accessed: Apr. 2020. [Online]. Available: application communication in Android,” in Proc. 9th Int. Conf. Mobile
https://www.symantec.com/security-center/threats/ Syst., Appl., Services (MobiSys), 2011, pp. 239–252.
[26] Overview of MobiTive. Accessed: Apr. 2020. [Online]. Available: [58] L. Li, J. Gao, T. F. Bissyandé, L. Ma, X. Xia, and J. Klein, “Character-
https://sites.google.com/view/mobitive2020 ising deprecated Android APIs,” in Proc. 15th Int. Conf. Mining Softw.
[27] Apkmirror Market. Accessed: Apr. 2020. [Online]. Available: https:// Repositories (MSR), 2018, pp. 254–264.
www.apkmirror.com/ [59] L. Deshotels, V. Notani, and A. Lakhotia, “DroidLegacy: Automated
[28] Core ML. Accessed: Apr. 2020. [Online]. Available: https:// familial classification of Android malware,” in Proc. ACM SIGPLAN
developer.apple.com/documentation/coreml/ Program Protection Reverse Eng. Workshop (PPREW), 2014, pp. 1–12.
[29] Caffe2 Mobile. Accessed: Apr. 2020. [Online]. Available: https://caffe2. [60] M. Zhang, Y. Duan, H. Yin, and Z. Zhao, “Semantics-aware Android
ai/docs/mobile-integration.html malware classification using weighted contextual API dependency
[30] Pytorch Mobile. Accessed: Apr. 2020. [Online]. Available: https:// graphs,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS),
pytorch.org/mobile/Android/ 2014, pp. 1105–1116.
[31] Post-Training Quantization. Accessed: Apr. 2020. [Online]. Available: [61] Y. Zhongyang, Z. Xin, B. Mao, and L. Xie, “DroidAlarm: An all-sided
https://www.tensorflow.org/lite/performance/post_training_quantization/ static analysis tool for Android privilege-escalation malware,” in Proc.
[32] A. Narayanan, L. Yang, L. Chen, and L. Jinliang, “Adaptive and scalable 8th ACM SIGSAC Symp. Inf., Comput. Commun. Secur. (ASIA CCS),
Android malware detection through online learning,” in Proc. Int. Joint 2013, pp. 353–358.
Conf. Neural Netw. (IJCNN), Jul. 2016, pp. 2484–2491. [62] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P. Wu, “DroidMat:
[33] N. McLaughlin et al., “Deep Android malware detection,” in Proc. Android malware detection through manifest and API calls tracing,” in
CODASPY, 2017, pp. 301–308. Proc. 7th Asia Joint Conf. Inf. Secur., Aug. 2012, pp. 62–69.
[34] Axmldec. Accessed: Apr. 2020. [Online]. Available: https://github. [63] G. Meng, Y. Xue, Z. Xu, Y. Liu, J. Zhang, and A. Narayanan, “Semantic
com/ytsutano/axmldec modelling of Android malware for effective malware comprehension,
[35] Axml. Accessed: Apr. 2020. [Online]. Available: https://github.com/
detection, and classification,” in Proc. 25th Int. Symp. Softw. Test. Anal.
xgouchet/AXML
[36] Dex2jar. Accessed: Apr. 2020. [Online]. Available: https://github.com/ (ISSTA), 2016, pp. 306–317.
[64] T. Kim, B. Kang, M. Rho, S. Sezer, and E. G. Im, “A multi-
pxb1988/dex2jar
[37] Apktool. A Tool for Reverse Engineering Android APK Files. modal deep learning method for Android malware detection using
Accessed: Apr. 2020. [Online]. Available: https://ibotpeaches.github. various features,” IEEE Trans. Inf. Forensics Security, vol. 14, no. 3,
io/Apktool/ pp. 773–788, Aug. 2018.
[38] Androguard. Accessed: Apr. 2020. [Online]. Available: https://github. [65] K. Xu, Y. Li, R. H. Deng, and K. Chen, “DeepRefiner: Multi-layer
com/androguard/ Android malware detection system applying deep neural networks,”
[39] Soot. Accessed: Apr. 2020. [Online]. Available: https://github.com/ in Proc. IEEE Eur. Symp. Secur. Privacy (EuroS&P), Apr. 2018,
Sable/soot/ pp. 473–487.
[40] Keras: Neural Networks API. Accessed: Apr. 2020. [Online]. Available: [66] W. Enck, P. Gilbert, S. Han, V. Tendulkar, B. G. Chun, L. P. Cox,
https://keras.io/ J. Jung, P. McDaniel, and A. N. Sheth, “Taintdroid: An information-
[41] Scikit-Learn. Accessed: Apr. 2020. [Online]. Available: https://scikit- flow tracking system for realtime privacy monitoring on smartphones,”
learn.org/stable/ ACM Trans. Comput. Syst., vol. 32, no. 2, pp. 1–29, 2014.
[42] Tensorflow. Accessed: Apr. 2020. [Online]. Available: https://www. [67] K. Basu, P. Krishnamurthy, F. Khorrami, and R. Karri, “A theoretical
tensorflow.org/ study of hardware performance counters-based malware detection,”
[43] Daivik Opcode. Accessed: Apr. 2020. [Online]. Available: http:// IEEE Trans. Inf. Forensics Security, vol. 15, pp. 512–525, Jun. 2019.
pallergabor.uw.hu/Androidblog/dalvik_opcodes.html [68] A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, and Y. Weiss, “‘Andro-
[44] Virustotal. Accessed: Apr. 2020. [Online]. Available: https://www. maly’: A behavioral malware detection framework for Android devices,”
virustotal.com/ J. Intell. Inf. Syst., vol. 38, no. 1, pp. 161–190, 2012.
[45] Contagio Website. Accessed: Apr. 2020. [Online]. Available: http:// [69] A.-D. Schmidt, F. Peters, F. Lamour, C. Scheel, S. A. Çamtepe, and
contagiominidump.blogspot.com/ Ş. Albayrak, “Monitoring smartphones for anomaly detection,” Mobile
[46] Pwnzen Infotech. Accessed: Apr. 2020. [Online]. Available: Netw. Appl., vol. 14, no. 1, pp. 92–106, Feb. 2009.
http://www.pwnzen.com/ [70] M. Grace, Y. Zhou, Q. Zhang, S. Zou, and X. Jiang, “RiskRanker:
[47] E. B. Karbab, M. Debbabi, A. Derhab, and D. Mouheb, “MalDozer: Scalable and accurate zero-day Android malware detection,” in
Automatic framework for Android malware detection using deep learn- Proc. 10th Int. Conf. Mobile Syst., Appl., Services (MobiSys), 2012,
ing,” Digit. Invest., vol. 24, pp. S48–S59, Mar. 2018. pp. 281–294.

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.
1578 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021

[71] C.-Y. Huang, Y.-T. Tsai, and C.-H. Hsu, “Performance evaluation on Sen Chen (Member, IEEE) received the Ph.D.
permission-based detection for Android malware,” in Proc. Adv. Intell. degree in computer science from the School of
Syst. Appl., vol. 2, 2013, pp. 111–120. Computer Science and Software Engineering, East
[72] Z. Aung and W. Zaw, “Permission-based Android malware detection,” China Normal University, China, in June 2019.
Int. J. Sci. Technol. Res., vol. 2, no. 3, pp. 228–234, 2013. He was a Research Assistant with Nanyang
[73] A. Demontis et al., “Yes, machine learning can be more secure! Technological University (NTU), Singapore, from
A case study on Android malware detection,” IEEE Trans. Depend. Sec. 2016 to 2019, and a Research Fellow from 2019 to
Comput., vol. 16, no. 4, pp. 711–724, Aug. 2019. 2020. He is currently a Research Assistant Professor
[74] T. Bläsing, L. Batyuk, A.-D. Schmidt, S. A. Camtepe, and S. Albayrak, with the School of Computer Science and Engineer-
“An Android application sandbox system for suspicious software detec- ing, NTU. His research interests include security and
tion,” in Proc. 5th Int. Conf. Malicious Unwanted Softw., Oct. 2010, software engineering.
pp. 55–62.
[75] J. Sun, K. Yan, X. Liu, C. Yang, and Y. Fu, “Malware detec-
tion on Android smartphones using keywords vector and SVM,” in
Proc. IEEE/ACIS 16th Int. Conf. Comput. Inf. Sci. (ICIS), May 2017,
pp. 833–838.
[76] L. Lu, Z. Li, Z. Wu, W. Lee, and G. Jiang, “CHEX: Statically vetting
Android apps for component hijacking vulnerabilities,” in Proc. ACM Xiaofei Xie received the B.E., M.E., and Ph.D.
Conf. Comput. Commun. Secur. (CCS), 2012, pp. 229–240. degrees from Tianjin University. He is currently
[77] P. P. F. Chan, L. C. K. Hui, and S. M. Yiu, “DroidChecker: Analyzing a Presidential Post-Doctoral Fellow with Nanyang
Android applications for capability leak,” in Proc. 5th ACM Conf. Secur. Technological University, Singapore. He has pub-
Privacy Wireless Mobile Netw. (WISEC), 2012, pp. 125–136. lished some top tier conference/journal papers rel-
[78] K. Lu et al., “Checking more and alerting less: Detecting privacy evant to software analysis in ISSTA, FSE, TSE,
leakages via enhanced data-flow analysis and peer voting,” in Proc. Netw. IJCAI, and CCS. His main research interests include
Distrib. Syst. Secur. Symp., 2015, pp. 1–15. program analysis, loop analysis, traditional software
[79] F. Wei, S. Roy, X. Ou, and Robby, “AmAndroid: A precise and general testing, and security analysis of artificial intelli-
inter-component data flow analysis framework for security vetting of gence. In particular, he won two ACM SIGSOFT
Android apps,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur. Distinguished Paper awards.
(CCS), 2014, pp. 1329–1341.
[80] S. Hao, B. Liu, S. Nath, W. G. J. Halfond, and R. Govindan, “PUMA:
Programmable UI-automation for large-scale dynamic analysis of mobile
apps,” in Proc. 12th Annu. Int. Conf. Mobile Syst., Appl., Services
(MobiSys), 2014, pp. 204–217.
[81] K. Pei, Y. Cao, J. Yang, and S. Jana, “DeepXplore: Automated whitebox Guozhu Meng received the B.E. and M.E. degrees
testing of deep learning systems,” in Proc. 26th Symp. Operating Syst. from Tianjin University, China, in 2009 and 2012,
Princ., Oct. 2017, pp. 1–18. respectively, and the Ph.D. degree from Nanyang
[82] L. Ma et al., “DeepGauge: Multi-granularity testing criteria for deep Technological University, Singapore, in 2017.
learning systems,” in Proc. 33rd ACM/IEEE Int. Conf. Automated Softw. He was a Research Fellow with Nanyang Tech-
Eng. (ASE), 2018, pp. 120–131. nological University and a Visiting Research Fellow
[83] X. Xie et al., “DeepHunter: A coverage-guided fuzz testing framework with the University of Luxembourg. He is currently
for deep neural networks,” in Proc. 28th ACM SIGSOFT Int. Symp. an Associate Professor with the Institute of Infor-
Softw. Test. Anal. (ISSTA), 2019, pp. 146–157. mation Engineering, Chinese Academy of Sciences.
[84] X. Du, X. Xie, Y. Li, L. Ma, Y. Liu, and J. Zhao, “DeepStellar: Model-
His research interests include mobile security, vul-
based quantitative analysis of stateful deep learning systems,” in Proc.
nerability detection, and big data analysis.
27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw.
Eng. (ESEC/FSE), 2019, pp. 477–487.
[85] L. Ma et al., “DeepMutation: Mutation testing of deep learning systems,”
in Proc. IEEE 29th Int. Symp. Softw. Rel. Eng. (ISSRE), Oct. 2018,
pp. 100–111.
[86] X. Zhang et al., “Towards characterizing adversarial defects of deep
learning software from the lens of uncertainty,” in Proc. ACM/IEEE Shang-Wei Lin received the B.S. degree from
42nd Int. Conf. Softw. Eng., Jun. 2020, pp. 1–13. National Chung Cheng University in 2003 and the
[87] X. Xie, L. Ma, H. Wang, Y. Li, Y. Liu, and X. Li, “DiffChaser: Detecting Ph.D. degree in 2010.
disagreements for deep neural networks,” in Proc. 28th Int. Joint Conf. He started his Post-Doctoral Researcher work with
Artif. Intell., Aug. 2019, pp. 5772–5778. NUS and SUTD from 2011 to 2015. In May 2015,
he joined Nanyang Technological University (NTU)
as an Assistant Professor. His research interests
include formal verification, formal synthesis, embed-
ded system design, cyberphysical systems, security
systems, multi-core programming, and component-
based object-oriented app frameworks for real-time
embedded systems.

Ruitao Feng received the bachelor’s degree in com-


puter science and technology from Tianjin Univer- Yang Liu (Senior Member, IEEE) received the
sity, Tianjin, China, in 2014. He is currently pursuing bachelor’s degree (Hons.) in computing from the
the Ph.D. degree with the School of Computer National University of Singapore (NUS) in 2005 and
Science and Engineering, Nanyang Technological the Ph.D. degree in 2010.
University, Singapore. He started his Post-Doctoral Researcher work with
He has been working as a Research Assistant NUS, MIT, and SUTD. In Fall 2012, he joined
with Temasek Laboratory, Nanyang Technologi- as a Nanyang Assistant Professor Nanyang Tech-
cal University, since 2014. His research interests nological University (NTU), where he is currently
include solving security and performance problems a Professor and the Director of the Cybersecurity
on mobile platform with software engineering and Laboratory. He specializes in software verification,
machine (deep) learning methods. security, and software engineering.

Authorized licensed use limited to: Northeastern University. Downloaded on January 18,2021 at 13:38:12 UTC from IEEE Xplore. Restrictions apply.

You might also like