0% found this document useful (0 votes)
52 views12 pages

Real-Time Behavior Analysis and Identification For Android Application

detecting malware in andriod applica tions

Uploaded by

akula sruthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views12 pages

Real-Time Behavior Analysis and Identification For Android Application

detecting malware in andriod applica tions

Uploaded by

akula sruthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final


publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number

Real-Time Behavior Analysis and Identification


for Android Application
Sixian Sun1, Xiao Fu1, Hao Ruan2, Xiaojiang Du3, Seinor Member, IEEE, Bin Luo1,
and Mohsen Guizani4, Fellow, IEEE
1
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210000 China.
2 IT-FLEX team, Intel Corporation, Shanghai, 200241 China .
3
Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122 USA.
4
Department of Electrical and Computer Engineering, University of Idaho, Moscow, ID 83844 USA
Corresponding author: Xiao Fu (e-mail: [email protected]).
This work is supported by the National Natural Science Foundation of China (61100198/F0207).

ABSTRACT The number of applications based on the Android platform is increasing rapidly now.
However, as the supervision and review of Android applications are inadequate, a reasonable chance exists
that users will download malware. This malware can lead to information leakage, monetary loss, and other
damages. At present, a variety of applications exist for detecting malware, but most of these applications
cannot show specific malicious behaviors. Moreover, the operation of this detection software is based on
the database of viruses, and thus, it cannot identify unknown malware. To solve these problems, we
implemented a system to detect the behaviors of Android applications and identify known or unknown
malware. Our system can monitor specified applications utilizing loading a kernel module. After the
detection process, the related documents are uploaded to the server, and the dynamic behaviors are
reconstructed. As a result, a behavior diagram is generated. In addition, if the user needs to know whether
the application is malware, the related Android package is sent to the server and analyzed. Then, the server
calculates the results and the results are returned to the client.

INDEX TERMS Android malware, behavior analysis, dynamic detection, software identification

I. INTRODUCTION QEMU. These simulated environments own higher-level


With its rapid development, the Android system has become permissions and are able to achieve a detailed dynamic
an operating system that has a wide range of influence analysis. However, they have several disadvantages, such
worldwide. While it is convenient for users to use diverse as path explosion and incomplete detection results.
applications in life, because of lax review systems, a Static analysis extracts the required information by
number of malwares have appeared. According to statistical means of source code and binary file analyses. In this
results [27], most malware malicious behaviors are related method, almost all codes can be covered and analyzed
to obtaining root permissions, achieving remote control, without applications being executed. Unfortunately, the
and utilizing the Command and Control (C&C) Server, challenges of code obfuscation [12] and dynamic code
which results in a monetary loss, sensitive information loading [15] may arise. Finally, hybrid analysis, which
disclosure, and other disadvantages. Thus, the analysis of combines static and dynamic analysis, was invented;
Android applications has become a hot topic. Currently, however, this method is certainly more complicated.
three main methods of analyzing applications exist: static, The results of these analysis methods cannot easily be
dynamic, and hybrid analysis. understood, especially by normal users. Moreover, the
Dynamic analysis detects behaviors of applications by execution of professional analysis methods is complicated.
executing them on real machines or simulators. One dynamic Therefore, several malware identification applications were
analysis method uses a Dalvik virtual machine (VM); however, developed for normal users [19]. Although convenient, these
it cannot detect behaviors in native codes. A second means of applications may not show specific malicious behaviors.
implementing dynamic analysis depends on VM introspection Furthermore, some identify applications through a database of
[3] [16] [22] executed in a simulated environment such as

VOLUME XX, 2017

2169-3536 © 2017 IEEE. Translations and content mining are permitted for academic research only.
Personal use is also permitted, but republication/redistribution requires IEEE permission. 1
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

known malware, and therefore are unable to identify presented in Section III. Then, we present the evaluation in
unknown malware. Section IV. In Section V, we review and compare related work.
In this paper, we apply hybrid analysis in our system to Finally, Section VI discusses limitations and future work.
resolve the problems described above. We implement an
Android application to detect, analyze, and identify II. OVERVIEW
applications and evaluated it using many sample applications.
The key contributions of this paper are summarized as follows: A. TECHNIQUES
1) We create a new approach for detecting real-time Several techniques are applied in our system to manage data
behaviors based on an Android kernel, which uses storage, network transmission [23], and the service terminal.
kernel-level monitoring mechanisms. MySQL is used to store data, including the permissions and
2) We create a new approach for identifying malware APIs of applications and the probability that selected
using the results of both dynamic detection and static permissions and APIs are used. Using these data, our system
analysis. Then, it is easy to analyze an application can identify Android applications. Moreover, the small scale
according to the statistical results to identify whether it and fast speed of MySQL is suitable for our system. The
is malware by using a naive Bayesian algorithm. In process of identifying applications is executed on the server.
addition, it is noteworthy that this method can identify To build this, we apply Struts 2, a Web application framework.
both known and unknown software. As compared to Struts 1, Struts 2 is based on WebWork and
3) Our approach analyzes behaviors using a data-centric handles requests using an interceptor mechanism so that it can
technique, which differs from the traditional analysis completely separate the business logic from Servlet.
of a single application or the entire system [1] [14].
The analysis approach presented in this paper is B. SYSTEM OVERVIEW
capable of reconstructing the behaviors of multiple An overview of the system of real-time behavior analysis and
applications while incurring less overhead. identification for Android applications is presented in Fig. 1.
4) We implement a complete system with graphical The system is composed of six modules, two of which are on
interfaces, which is more convenient to use. Moreover, the client side and four on the server side. Specifically, on the
we present the evaluation of our system in detail. client side, an Android application is installed that is composed
The rest of this paper is structured as follows. In Section II, of two modules. The first time the application is run on a
we introduce the techniques applied in our system and give an device, the required files are initialized. Then, each time the
overview of the system. The detailed design of the system is application is opened, additional initialization work is

FIGURE 1. Overview of the system

VOLUME XX, 2017 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

completed. The system loads a kernel module to detect the After obtaining the address of sys_call_table, specific
behaviors of the selected applications, and all behaviors are system calls can be intercepted. The source addresses of
recorded in the general log file. required system calls should first be stored and then
Our server is composed of four modules. If required, the replaced with the addresses of the handling functions
dynamic behaviors of an application are reconstructed, and designed in this study. In these handling functions, the
then the behavior graph is generated. The configuration files original system calls are still called to handle interrupts or
and codes of an Android application can be acquired by using a exceptions. Four types of system calls are intercepted in our
decompiler. In addition, relevant content such as permissions system: Android interprocess communication, file
and APIs can be extracted from these files. Samples of operations, network operations, and process management.
malicious or benign applications are parsed, and the statistical Among these system calls, Android interprocess
results for permissions and APIs can be acquired. After feature communication can be parsed by Binder Parser, as most of
selection, selected permissions and APIs are used to train a the system calls depend on the Android Binder mechanism;
classifier. When it has received an APK, the server analyzes it other system calls can be directly parsed using System Call
and utilizes the classifier to identify the application. Parser. The two parses are introduced below.
1) BINDER PARSER
III. DETAILED DESIGN The Binder framework is the standardized interprocess
communication mechanism. The Binder mechanism
A. INITIALIZATION consists of four components: Client, Server, Service
The initialization module is responsible for generating Manager, and Binder Driver. Client, Server, and Service
resource files when the Android application is used for the Manager run in the user space, whereas Binder Driver runs
first time. Specifically, some resource files need to be in the kernel space. In addition, in communications, Binder
copied, including all Android interface definition language Driver provides /dev/binder, which is a type of device file
(AIDL) files and the loadable kernel module [24] [28], and for communicating with the userspace. At the same time,
some files need to be created, such as the uid_file and Client, Server, and Service Manager communicate with
directories for log files and behavior graphs. Binder Driver by file manipulation functions, including
Nonsystem applications are shown as a list each time the open and ioctl. Finally, Service Manager is used as a
user opens our Android application. PackageManager is daemon for managing Server and providing Client with the
used to indicate all applications, and system applications capacity to query the interfaces of Server. AIDL is used to
are filtered out. The detailed information of each nonsystem implement one-to-one correspondence above in the
application includes the name of the application, name of Android system, as it allows an application to define
the application package, the id of the application, the name interfaces between Client and Server. The Android software
of the version, number of the version, installation date, development kit (SDK) automatically generates a Java
update date, and the icon of the application. interface file when an AIDL file has been completed.
The source function of the ioctl system call is ioctl (unsigned
B. DYNAMIC BEHAVIOR DETECTION int fd, unsigned int cmd, unsigned long arg). The first
An important task in the behavior detection module is to parameter represents a file descriptor of one binder device, the
add a hook to the kernel to monitor system calls. A loadable second represents the IO control command, and the third
kernel module can implement this task. Most operating represents the content sent in userspace. If the second
systems, including Unix and Windows, support the loadable parameter is BINDER_WRITE_READ, the third must be a
kernel module, and thus it can be used to detect behaviors binder_write_read structure. In the binder_write_read structure,
of applications without recompiling the Linux kernel or the member variable “write_buffer” records content that is
restarting the system. transmitted from the user space to Binder Driver, while the
In rootkit.ko, the first step is to obtain sys_call_table, which member variable “read_buffer” preserves data from Binder
contains all system calls. When an interrupt or an exception Driver in the user space.
occurs, the kernel jumps to the exception vector table to handle Further, “write_buffer” is stored as an array, each
it. In the Linux kernel of the Android system, a section of element of which consists of a communication protocol
space ranging from the address of 0xffff0000 represents all code and a set of communication data. The communication
interrupt routines, and an instruction, which is stored at the protocol code BC_TRANSACTION represents
address of 0xffff0008, is able to copy the address of exception communication data between processes; the corresponding
handling to the current instruction register. In the process of structure is binder_transaction_data. The proposed system
exception handling, an instruction exists that loads the address extracts required information from the two member
of sys_call_table to a register. Therefore, it is feasible to search variables described in the following in the structure of
for the loading instruction in the process of handling, and then binder_transaction_data.
find the address of sys_call_table. “Data_buffer_address,” which stores information from user
space, is the first member variable required. The first content

VOLUME XX, 2017 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

in the buffer is the request header. The following content the algorithm is O(n), where n stands for the number of
comprises the parameters of the interface, which can be parsed records in the log file.
by Server. Contents other than primitive information and The details of the algorithm flow are as follows:
contents in Intents are unreadable. Therefore, we directly parse 1) Initialize the array nodes and the hash table map.
the primitive information of specific functions and information 2) Get log records from the log file one by one. If no log
in Intents to gain data in the buffer. record remains, the system goes to Step 9.
The second member variable is “function_feature_code,” 3) Extract pid, uid, the name of the function, and
which matches the function in the AIDL file. The Android parameters from each log record, and then store them
SDK may generate some constants according to rules when in a node of behavior. If the system call is a clone call,
compiling the AIDL file. These constants are stored at cid also needs to be stored.
member variable “code,” and Server searches for the 4) If a uid has been stored at map, this indicates that the
corresponding function through the constants. To parse node of an application has been created, and the system
“code,” it is necessary to parse the AIDL files in advance for must go to Step 5. If the node of an application has not
interfaces. In this system, the AIDL files are parsed been created, the system goes to Step 8.
automatically and stored in memory. In addition, some 5) When pid has also been stored at map, the process tree,
bound services such as ActivityManager and at which the current node is stored, has been in the
ContentProvider are services themselves, and thus we graph, and the system then goes to Step 6. It goes to
convert them into AIDL files artificially. Step 7 when pid is not in map.
2) SYSTEM CALL PARSER 6) If the system call is clone, insert this node(clone) after
System calls of file operations contain mainly open, read, the node in map, and update map[uid][cid] to this
write, and close, and parameter parsing is aimed at file node(clone). For other system calls, insert this
descriptors. To improve the performance, we apply a hash node(clone) after the node in map, and then update
table to store file descriptors that have been parsed. map[uid][pid] to this node(clone).
Furthermore, some file operations are triggered by an 7) If pid does not exist, make the current behavior node a
application itself; these are probably operations on processes, child node of the application node, and then update
device files, and class libraries. In this system, file operations map[uid][pid].
on /proc, /dev, /vendor/lib, and /system/lib are not recorded to 8) If uid does not exist, create an application node and
avoid future influence on the behavior reconstruction. make it a child node of the root node. In addition, make
For network operations, we choose to record the data, the current behavior node a child node of the
sources, and destinations that are transmitted. For system application node, and update map[uid][pid].
calls of sendto and recvfrom, attention should be paid to the 9) A primary behavior graph has been generated.
first three parameters. In detail, the first parameter is the Algorithm. 1. Graph-generating algorithm
file descriptor of the socket and is used to record the IP 1: nodes[0] = root
addresses and port numbers of sources and destinations. 2: map[uid][pid] = 0
The second represents the data to be sent or received, and 3: i = 1
4: for each line in log file do
the third specifies the length of the data. 5: store pid, uid, function, and parameters into a node
All system calls are eventually recorded in a general file 6: for clone function store child id (cid) into the node
in the format [Rootkit] (process id) (application id) system 7: if uid exists then
8: if pid exists then
call (parameter 1, …, parameter n). In the Android system, 9: if function == clone then
each Android application has a unique id; its id can 10: let this become the child of nodes[map[uid][pid]]
distinguish an application. 11: map[uid][cid] = i
12: else
13: let this become the child of nodes[map[uid][pid]]
C. BEHAVIOR RECONSTRUCTION 14: map[uid][pid] = i
The records in log files are too complicated to be understood 15: end if
by users; therefore, log records are reconstructed to generate 16: else
17: this node becomes the child of application node
behavior graphs in the proposed system. As the behavior 18: map[uid][pid] = i
reconstruction module is distributed on the server, the log file, 19: end if
and packages.list must be sent over the network to the server. 20: else
21: create an application node
1) GRAPH GENERATING ALGORITHM 22: this node becomes the child of application node
The code for the graph-generating algorithm is presented in 23: map[uid][pid] = i
Algorithm. 1. In the algorithm, “uid” stands for the unique id 24: end if
25: i++
of an Android application, “pid'” stands for the id of a process, 26:end for
and “cid” stands for the id of a child process. Main steps of the
algorithm are to create a node and add the node to the tree for
each record in the log file. Therefore, the time complexity of

VOLUME XX, 2017 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

2) BROADCAST-MATCHING ALGORITHM performReceive is called by the class ActivityThread. For


Algorithm. 2. Broadcast-matching algorithm static registration, the function scheduleReceiver is called
1: Q = empty by class BroadcastQueue, and then the broadcast is handled
2: last = 0 directly in class ActivityThread.
3: for i = 1 to nodeSize do
4: if function of nodes[i] is schedule method then Finally, the two registration methods share the same process.
5: Q.enqueue(nodes[i]) The function onReceive is called, and then the broadcast is
6: else handled by the class that inherits the class BroadcastReceiver
7: if function of nodes[i] is finished method then
8: for each node in Q do and rewrites the function onReceive. After handling the
9: if function of nodes[i] match function of node then broadcast, the class BroadcastReceiver may call the function
10: cnode = pnode = nodes[i] sendFinished, followed by the function finishReceiver. In the
11: last = i
12: repeat
entire process above, the functions
13: cnode = pnode scheduleRegisteredReceiver, scheduleReceiver, and
14: pnode = cnode.parent finishReceiver are IPC functions. By monitoring the three
15: until pnode.pid == nodes[i].pid and pnode.time > last and
pnode.time > node.time
functions, we can correctly match the life cycles of the
16: cnode and nodes[i] become children of this node broadcast.
17: Q.remove(nodes[i]) The broadcast-matching algorithm is shown in Algorithm.
18: break
2. As the process of handling a broadcast event is sequential,
19: end if
20: end for the schedule methods of the broadcast life cycle are stored in a
21: end if queue. Main steps of the algorithm are to identify every node
22: end if in the behavior graph and to traverse to the top from a specific
23: end for
behavior node. Since the number of traversed nodes in the
The primary behavior graph above still lacks readability and second step, which ranges from 2 to (nodeSize-1), is uncertain,
semantic relations. Hence, a broadcast-matching algorithm is the average frequency of the second step calculated is
used to improve the behavior graph. The first step is to identify (nodeSize+1)(nodeSize-2)/2. Therefore, the time complexity
the life cycles of activities in an application, and the second of the algorithm is O(n 2), where n stands for the number of
step is to match broadcast events. For the first step, some of
behavior nodes in the behavior graph.
the IPC functions that are detected can match the callback The detailed algorithm flow is as follows:
functions of the life cycles of activities in the Android system.
1) Get behavior nodes in the behavior graph serially
For example, the IPC function startActivity can match with the
according to the log file. If no behavior node remains,
callback function onStart, and the IPC function the system goes to Step 10.
activityStopped can match with the callback function onStop.
2) Check whether the function of the current behavior
This indicates that IPC functions are able to reflect the life
node is a schedule method of the broadcast life cycle.
cycles of activities in an application correctly.
If it is a schedule method of the broadcast life cycle,
For the second step, to identify the life cycles of a broadcast, the system goes to Step 3; otherwise, it goes to Step 4.
the Android system applies a broadcast mechanism to send
3) Add the behavior node to the queue, and return to Step
messages about the system to applications. If an application
1.
has registered the broadcast event, it handles the event. In the
4) Check whether the function of the current behavior node
Android system, there are two approaches for registering
is the finishing method of the broadcast life cycle. If it is
broadcast receivers. One is static registration. Receivers and
the finishing method of the broadcast life cycle, the
actions in applications must be defined in the
system goes to Step 5; otherwise, it returns to Step 1.
AndroidManifest.xml file. Whether an application is in the
5) Visit the next node in the queue, and mark it as the
active state or not, it listens to receivers.
node of the queue. If all nodes in the queue have been
The second approach is dynamic registration, which calls visited, the system returns to Step 1.
functions to register in activities. An application will no
6) Check whether the function of the current behavior
longer listen to receivers if it is closed. In the process of
matches the function of the node of the queue. If they
sending a broadcast, when an application requests the
are matched, the system goes to Step 7; otherwise, it
function sendBroadcast, in fact, it requests the function
returns to Step 5.
sendBroadcast in the class ContextWrapper. Then, the
7) In the primitive behavior graph, traverse to the top
function broadcastIntent is called by the class ComtextImpl,
from the current behavior node that is the node of the
and several functions in ActivityManagerService are called
finishing method of the broadcast life cycle. During the
to handle the broadcast.
traversal, ensure that the times of occurrence in the log
The specific sending processes of the two methods of
file of visited nodes are later than those of the current
registering broadcast events differ. In the case of dynamic
behavior node.
registration, the function scheduleRegisteredReceiver is
8) When the traversal finishes, make the node where the
called by the class BroadcastQueue, and then the function
traversal stops the first function of the broadcast life

VOLUME XX, 2017 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

cycle, and also make the current node the last function program, we execute commands of CMD by obtaining the
of the broadcast life cycle. At the same time, make the runtime environment.
two nodes the children of the node of the schedule
method of the broadcast life cycle. E. APPLICATION STATISTICS AND ANALYSIS
9) Remove the queued node from the queue, and return to During the preparation phase, massive samples of Android
Step 1. applications were essential. As it was unrealistic to download
10) The broadcast-matching algorithm is completed, and a all of the applications manually, benign applications were
behavior graph with abundant semantic information achieved by means of a Web crawler from the Android Market,
has been generated. and malware applications were downloaded from several
3) GRAPH-SIMPLIFYING ALGORITHM forums such as the Kafan and phpBB forums. For this system,
The behavior graph generated above still contains redundant it was sufficient to use a simple Web crawler program to grab
information and needs to be simplified. First, the redundant information from the Internet automatically.
nodes of clones without children should be removed. Second, The statistics of the permissions and APIs are required.
duplicate nodes that are continuously called are combined into Permissions are considered as an example in this paper. The
one node, and the number of repetitions is marked. Finally, in first step is to calculate the sums of permissions. Thus, for
this system, each broadcast life cycle can be abstracted into a each permission, four numbers need to be calculated: the
node, which will contain abstract behavior information. There number of malware applications using the permission, the
are three types of nodes after abstraction, as follows: number of benign applications using the permission, the
1) File operations, including the “open” system call, number of malware applications not using the permission,
“write” system call, “read” system call, and “close” and the number of benign applications not using the
system call, are classified as File Access. permission. This is the process of collecting the statistics of
2) Network operations, including the “sendto” and permissions, as well as of APIs.
“recvfrom” system calls, are classified as Network The second step is to analyze the permissions and obtain
Access. characteristic attributes. In this system, a chi-square test is
3) The third type contains IPC calls, but they are too applied to determine whether the presence of a permission
numerous to arrange. Therefore, the authority mechanism and the nature of an application owning the permission are
of the Android system is used. In the Android system, the related. For the algorithm, the chi-square values can be
permissions of an application are checked if one of the calculated by using the formula for four-fold tables. In the
key APIs is called. For example, the system checks formula, a stands for the number of malware applications
whether the application owns using the permission, b stands for the number of benign
android.permission.SEND_SMS permission if a message applications using the permission, c stands for the number
is sent. In practice, permission checking is completed of malware applications not using the permission, d stands
through the function checkPermission in the class for the number of benign applications using the permission,
ActivityManager, and the checked permission is added to and n stands for the number of applications. In the special
the abstract node as abstract information. case where the value of a, b, c, or d is less than 5 and that of
n is greater than 40, the correction formula 2
(|ad−bc|− )

D. STATIC APPLICATION ANALYSIS 2


=

( + )( + )( + )( + )
2
(1)
The server receives the APK of an Android application from
the client and analyzes it to extract permissions and APIs [25]. is required. In addition,
In this system, APKTool, which is a decompiler provided by
2
( − )

2
=
(2)
Google, is used to decompile APKs. The tool generates files ( + )( + )( + )( + )

such as source codes, a configuration file, and resources.


In the Android system, permissions must be declared is used under normal conditions.
before being used in Android applications. Hence, to obtain A chi-square test tests permissions other than those that are
permissions, AndroidManifest.xml, which is the used fewer than 50 times. The larger the chi-square value, the
configuration file, needs to be traversed. larger the correlation between the presence of permission and
Decompiling results in several smali folders, which contain the nature of an application owning the permission. Hence,
the codes of an application in the form of a smali file [11] [29]. permissions are ordered according to chi-square values, and
In fact, although Android applications are developed using the the first 80 are defined as characteristic attributes.
Java language, the Android system owns the Dalvik VM itself. The last step is to train the classifier [17] with the
Therefore, the codes in Android applications are compiled by characteristic attributes above. In this system, a naive
smali instead of a class of Java. All smali files are traversed, Bayes classification algorithm is applied to identify
and every line starting with “invoke,” which represents a applications. To obtain the probability of occurrence of a
function call, is extracted. Then, the extracted line is parsed to category, the
obtain the APIs. Finally, to run APKTool [18] in a Java
9
VOLUME XX, 2017

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

conditional probabilities of characteristic attributes, which are


parts of an item to be classified, must be calculated. Therefore,
in this section, we calculate the conditional probabilities of the
characteristic attributes. Using the data in the database, we can
determine the probability of malware applications with a
certain characteristic attribute, probability of malware
applications without a certain characteristic attribute, the
probability of benign applications with a certain characteristic
attribute, and the probability of benign applications without a
certain characteristic attribute.

F. MALWARE APPLICATION IDENTIFICATION


In this module, a classifier using the naive Bayes classification
algorithm is implemented to identify whether an application is
malware or benign. First, the server receives the APK and the
log file of an application from the client. Then, the APK is
parsed according to Section III.D. In addition, since the static
analysis of APKs suffers the problem of a lack of dynamic
codes, we also extract the permissions and APIs recorded by
dynamic behavior detection from the log file. Then, using the
probabilities from before, the probability of the application
being malware can be determined, as well as the probability of
the application being benign. Finally, the two probabilities are
compared, and the application is identified.
Finally, to interpret the identification results, our system
traverses the behavior graph generated from dynamic behavior
detection and reconstruction and searches for malicious
behaviors to present them. In this study, we focused on file
FIGURE 2. Start-up behaviors of GGTracker
operations, network operations, and IPC calls. The steps to
locate malicious behaviors are as follows: through the accuracy of static analysis on the server. The
1) Obtain the root node of the complete behavior graph processes of the experiments were based on the Android
and traverse its children. simulator Galaxy Nexus, with the Linux kernel Goldfish
2) If the information of the node is related to the system 3.4.0 and Android system 4.4.2.
calls required, go to Step 3; otherwise, go to Step 4.
3) Behaviors must be stored in arrays. If the array has not A. EVALUATION OF BEHAVIOR ANALYSIS
been created, it must be created, and then the node In the experiment to evaluate the behavior analysis,
must be added to the array. When the array already GGTracker, a malware application from a research community,
exists, the node is simply added to the array. Then, was chosen for analysis. GGTracker is a new type of Trojan
traverse the children of the node, and go to Step 2. located on a Trojan Website, which is similar to Android
Specifically, if the node does not have children, the Market and is likely to be downloaded to Android devices by
behavior is considered to be finished. users. The objective of GGTracker is to subscribe a device to
4) If the array exists, the behavior is considered to be fee-based services without the user being aware.
finished; otherwise, traverse the children of the node, After starting the detection process, we launched
and go to Step 2. GGTracker and triggered many types of events manually to
If an application is identified as a malware application, record dynamic behaviors. Through the behavior analysis of
perform the above operations. All of the behaviors we GGTracker, we observe that GGTracker has two main
obtain are likely to be malicious behaviors. malicious behaviors. The first malicious behavior is shown in
Fig. 2. The first time that GGTracker is started, it fetches the
IV. EVALUATION user’s phone number and relevant information and records
In this section, we present an evaluation of our system. The them in a particular xml file. Then, it sends the relevant
evaluation process addressed two aspects. One aspect was the information to a remote server, ggtrack.org, which forces the
usability of the system, which was evaluated through the user’s device to subscribe to a fee-based service. In addition,
behavior analysis of applications and the overhead that the the IP address and port of the remote server shown in the
dynamic analysis incurs in the Android system. The second behavior graph are 52.28.3.6:80. Finally, GGTracker registers
aspect was the accuracy of the system, which was evaluated a broadcast to receive messages to listen to the messages

VOLUME XX, 2017 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

TABLE I
PERFORMANCE O VERHEAD
Without Proposed With Proposed
Operation Method Method Overhead
(average time) (average time)
IPC 11110.5 (ms) 13141 (ms) 18%
File 4132.1 (ms) 4853.2 (ms) 17%
operation

TABLE Ⅱ
COMPARISON WITH PREVIOUS M ETHODS
Research Name Overhead
Research in this paper 17% - 18%
TaintDroid >= 32%
VetDroid >= 32%
Aurasium 14% - 35%
CopperDroid 20% - 30%

B. EVALUATION OF PERFORMANCE
In the case of dynamic detection, we evaluated the
performance when our system was run on Android devices.
Two custom test schemes were applied. One was applied to
test the upper limit on the performance overhead of parsing
system calls, which include mainly ioctl system calls and file
system calls. The second scheme tested the performance
overhead in an actual operation process. We tested the start-up
time of applications to evaluate their performance overhead, as
applications are disturbed less during this time. In addition,
during the start-up time, operations including
ActivityManager analysis, interprocess communication,
display of graphical interfaces, and behaviors in the life
cycle of onCreate() are executed.
FIGURE 3. Monitoring behaviors of GGTracker For the first performance test, a program was designed to
request multiple file operations or IPC calls repeatedly, and then
received. Under normal circumstances, in fact, if users decide record the execution time to test the upper limit of the
to subscribe to a service by using messages, the service performance overhead. Table Ⅰ shows that the percentage of the
provider sends a message to the users to confirm the fees. After overhead of file operations is 17%, and that of the overhead of
receiving the message, users need to respond with specific parsing IPC calls is 18%. Moreover, as Table Ⅱ shows, as
content such as “Confirm” or “Y.” However, GGTracker is compared with previous methods, our proposed method increases
able to subscribe to fee-based services without the user being the burden on the Android system only slightly.
aware, as it intercepts and listens to messages. In the second test process, an Android Debug Bridge
The second malicious behavior is the interception of (ADB) instruction was used to test the start-up time of
messages, which is presented in Fig. 3. When a mobile Android applications. To avoid interference, we tested the
device has received a message, GGTracker intercepts the start-up time of each application 10 times and then
message and parses it. The analysis of behaviors showed calculated the average time. Ultimately, 16 different
that GGTracker intercepts messages sent from phone applications were tested, of which 14 were provided by the
numbers, including 99735, 46621, 96512, 33335, system, and 2 were installed manually. The behaviors of
00033335, 00036397, 36397, 55991, 55999, 56255, and applications in the start-up process differ, and therefore
41001. Then, GGTracker sends the message to a remote their overheads are not the same. According to the results,
server, the domain name of which is www.amaz0n- the performance overhead of the start-up process ranges
cloud.com. In particular, GGTracker may reply to the from 0.20% to 10.67%, and the average start-up time
message sent by phone number 41001 with “Yes.” increases by 5.25% with dynamic analysis.

VOLUME XX, 2017 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

C. EVALUATION OF MALWARE APPLICATION obfuscation and dynamic code loading. Dynamic analysis
IDENTIFICATION based on a Dalvik VM is widely used for taint analysis.
We evaluated the results of the malware application TaintDroid [6], which performs taint tracking by modifying the
identification. After being downloaded, 122 malware Dalvik VM, was the original dynamic taint analysis method.
applications and 166 benign applications were available for Based on TaintDroid, several researchers proposed different
testing. We traversed all applications to test them in our methods [18] [26]. However, all of these methods have the
system one by one, to get the results of the identification same shortcoming: it is difficult to analyze native codes. Thus,
process. to solve the problem, VMI-based research studies were
Fig. 4. presents the results of our experiment. As we presented. For example, DroidScope [22] runs the entire
focused on identifying malware application, we confirmed Android system on a QEMU VM in order to seamlessly
malware application as the positive class. In the reconstruct the semantics of the OS and the Java layer. These
experiment, 104 malware applications and 150 benign approaches need to be run in a simulated environment, and it is
applications can be accurately identified. After being unlikely that they will be ported to real devices. They are
calculated, the accuracy of the malware application unable to obtain real behaviors of applications and are faced
identification is 88.2%. We can also obtain the results that with problems such as anti-forensic techniques.
the precision of the malware application identification is In this paper, we propose a new method of dynamic analysis
86.7% and the recall of the malware application based on a kernel to detect the behaviors of applications. As
identification is 85.2%. With the results of our experiment compared with the approaches above, the most important
above, this identification method is proved to be effective. unique characteristic of our method is that it can be used in
real devices to yield reliable results. Furthermore, our method
16 uses kernel-level monitoring mechanisms in order to monitor
both Java codes and native codes, while dynamic analysis
based on a Dalvik VM is unable to achieve this.
104 True positives In addition, malware cannot find our method, and thus
avoid detection, because our method runs at the kernel level
False negatives
and owns the highest-level permission, whereas most
True negatives applications can own only lower-level permissions.
150 False positives Although Jarvis also operates in the Linux kernel, its main
goal is to bridge the semantic gap between high-level
18 Android APIs and low-level system calls, not to analyze
application behaviors. In addition, our method transforms
the detection results into a behavior graph, and thus, users
FIGURE 4. Results of identifying applications. Malware application is can understand the analysis results more easily.
confirmed as the positive class and benign application is confirmed as
the negative class. For example, “False positives” means that an
instance of benign application is identified as malware application. VI. DISCUSSION
In the study described in this paper, we implemented a
system of real-time behavior analysis and identification for
V. RELATED WORK Android applications. The system can be improved to enhance
Static Analysis. Static analysis extracts information required security and accuracy in the future. Firstly, the process of
by means of analyzing source codes or binary files. It analyzes identification is completed through the network, and thus it is
and covers all codes rather than executing the application, and faced with severe security issues [4] [5] [10] [21] such as
therefore its code coverage is high. However, the method lacks information leakage. In the future, we can create an effective
practical execution paths and relevant contextual information algorithm to select the best relay to assist the secure
[2]. Moreover, it is faced with the challenges of code transmission like [7] [8]. Moreover, we can ensure the
obfuscation and dynamic code loading. transmission security by implementing data encryption and
As compared with most static analysis methods, decryption. To improve the system performance, we can also
including RiskRanker [9], our static analysis method apply cache techniques [20]. If the identification results are
extracts all of the permissions and APIs and then analyzes pre-stored at the relay nodes around the user, the data
them using statistics, instead of looking up sensitive or transmission will be directly performed from the relays to the
dangerous codes. Furthermore, we use permissions and user, instead of experiencing the identification process again.
APIs from the APK and the results of dynamic detection to Thus, it is an effective way to reduce the transmission load and
solve the problem of dynamic code loading. to speed up the transmission of identification results.
Dynamic Analysis. Dynamic analysis is performed by Considering that a large number of users are likely to identify
observing the behaviors of applications while they are running applications at the same time, we had better improve cahce
by executing them. It is able to avoid the problems of code techniques [13] to deal with such a situation. However,

VOLUME XX, 2017 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

outdated channel state information (CSI) [13] may be a [10] X. Hei, X. J. Du and S. Lin, “PIPAC: Patient Infusion Pattern based
Access Control Scheme for Wireless Insulin Pump System,” in Proc.
tricky problem for us. of IEEE INFOCOM, Turin, Italy, 2013, pp. 3030-3038
Secondly, the limitations of the dynamic analysis method in [11] J. Hoffmann, M. Ussath, T. Holz and M. Spreitzenbarth, “Slicing droids:
our system are essentially the same as those of traditional program slicing for smali code,” in Proc. 28th Annual ACM Symposium on
Applied Computing, Coimbra, Portugal, 2013, pp. 1844-1851.
dynamic analysis methods. On the one hand, dynamic analysis
[12] A. Kovacheva, “Efficient code obfuscation for Android,” in Proc. Int.
methods usually utilize custom automated tools to trigger Conf. Advances in Information Technology, Bangkok, 2013, pp. 104-
events, or the users themselves trigger events. Thus, it takes a 119.
long time to analyze a large number of applications. On the [13] X. Z. Lai, J. J. Xia, M. B. Tang, H. C. Zhang and J. H. Zhao, “Cache-
aided multiuser cognitive relay networks with outdated channel state
other hand, there are many logic branches in an application, In formation,” IEEE Access, vol. 6, pp. 21879-21887, 2018.
and this is likely to give rise to a path explosion problem. In [14] A. Lanzi, D. Balzarotti, C. Kruegel, M. Christodorescu and E. Kirda,
addition, our system is designed mainly for Android system “AccessMiner: using system-centric models for malware protection,” in
Proc. 17th ACM Conf. Computer and Communications Security,
4.4.2 and is unable to deal with parts of services automatically. Chicago, IL, USA, 2010, pp. 399-412.
Therefore, our system may be improved to enhance the [15] S. Poeplau, Y. Fratantonio, A. Bianchi, C. Kruegel and G. Vigna,
accuracy in the future as follows: “Execute this! Analyzing unsafe and malicious dynamic code loading
in Android applications,” in Proc. NDSS Symposium, San Diego,
1) We should attempt to apply the corresponding AIDL California, USA, 2014.
files in different Android systems to help the parsing [16] A. Reina, A. Fattori and L. Cavallaro, “A system call-centric analysis
binder. and stimulation technique to automatically reconstruct android
malware behaviors,” in Proc. ACM European Workshop on Systems
2) The dynamic analysis should be able to extract bound Security, Prague, 2013, pp. 1-6.
services from the Android system automatically, and [17] J. Sahs and L. Khan, “A machine learning approach to Android malware
automatically analyze these services to provide detection,” in Proc. EISIC, Odense, Denmark, 2012, pp. 141-147.
information for the parsing binder. [18] D. Schreckling, J. Köstler and M. Schaff, “Kynoid: real-time
enforcement of fine-grained, user-defined, and data-centric security
3) We can combine static analysis and dynamic analysis policies for android,” in Proc. 6th IFIP WG 11.2 Int. Conf. Information
to generate more comprehensive graphs. Security Theory and Practice: security, privacy and trust in computing
systems and ambient intelligent ecosystems, Egham, UK, 2012, pp.
208-223.
[19] L. Wu, X. J. Du and J. Wu, “Effective Defense Schemes for Phishing
REFERENCES Attacks on Mobile Computing Platforms,” IEEE Transactions on
[1] Adity and D. Kaur, “Detection and prevention of malicious node using data Vehicular Technology, vol. 65, pp. 6678 - 6691, Aug. 2016. DOI.
centric techniques,” International Journal of Emerging Trends and 10.1109/TVT.2015.2472993
Technology in Computer Science, vol. 5, no. 2, pp. 95-97, Mar. 2016. [20] J. J. Xia, F. S. Zhou, X. Z. Lai, H. C. Zhang, H. B. Chen, Q. H. Yang,
[2] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein, Y. L. X. Liu, J. H. Zhao, “Cache Aided Decode-and-
Traon, D.Octeau and P. McDaniel, “FlowDroid: Precise context, flow, Forward Relaying Networks: From the Spatial View,” Wireless Comm
field, object-sensitive and lifecycle-aware taint analysis for Android unications and Mobile Computing, pp. 1-9, Apr. 2018, DOI.
apps,” in Proc. 35th ACM SIGPLAN Conf. Programming Language 10.1155/2018/5963584.
Design and Implementation, Edinburgh, United Kingdom, 2014, pp. [21] Y. Xiao, V. Rayi, B. Sun, X. Du, F. Hu and M. Galloway, “A survey of
259-269. key management schemes in wireless sensor networks,” Journal of
[3] Y. Cheng, et al., “A lightweight live memory forensic approach based Computer Communications, vol. 20, no. 11-12, pp. 2314-2341, Sep.
on hardware virtualization,” Elsevier Information Sciences, vol. 379, 2007.
pp. 23-41, Feb. 2017. [22] L. K. Yan and H. Yin, “DroidScope: seamlessly reconstructing the OS
[4] X. Du and H. H. Chen, “Security in wireless sensor networks,” IEEE and dalvik semantic views for dynamic android malware analysis,” in
Wireless Communications, vol. 15, no. 4, pp. 60-66, Aug. 2008, DOI. Proc. 21st USENIX Conf. Security symposium, Bellevue, WA, 2012,
10.1109/MWC.2008.4599222. pp. 29-29.
[5] X. Du, M. Guizani, Y. Xiao, and H. H. Chen, “Secure and efficient [23] S. L. Yang and J. P. He, “Research and implementation of web services
time synchronization in heterogeneous sensor networks,” IEEE Trans. in Android network communication framework Volley,” in Proc. 11th
Vehicular Technology, vol. 57, no. 4, pp. 2387-2394, Jul. 2008, DOI. Int. Conf. Service Systems and Service Management, Beijing, China,
10.1109/TVT.2007.912327. 2014, pp. 1-3
[6] W. Enck, P. Gilbert, B. G. Chun, L. P. Cox, J. Y. Jung, P. McDaniel and [24] D. H. You and B. N. Noh, “Android platform based linux kernel
A. N. Sheth, “TaintDroid: An information-flow tracking system for rootkit,” in Proc. 6th Int. Conf. Malicious and Unwanted Software,
realtime privacy monitoring on smartphones,” in Proc. 9th USENIX Fajardo, Puerto Rico, 2011, pp. 79-87.
Conf. Operating systems design and implementation, Vancouver, BC, [25] F. Yu, S. Anand, I. Dillig and A. Aiken, “Apposcopy: semantics-based
Canada, 2010, pp. 393-407. detection of Android malware through static analysis,” in Proc. 22nd
[7] L. S. Fan, X. F. Lei, N. Yang, T. Q. Duong and G. K. Karagiannidis, “Secure ACM SIGSOFT International Symposium on Foundations of Software
multiple amplify-and-forward relaying with cochannel interference,” IEEE Engineering, Hong Kong, China, 2014, pp. 576-587.
Journal of Selected Topics in Signal Processing, vol. [26] Y. Zhang, M. Yang, B. Q. Xu, Z. M. Yang, G. F. Gu, P. Ning, X. S.
10, no. 8, pp. 1494-1505, Dec. 2016, DOI. Wang and B. Y. Zang, “Vetting undesirable behaviors in android apps
10.1109/JSTSP.2016.2607692. with permission use analysis,” in Proc. ACM SIGSAC Conf. Computer
[8] L. S. Fan, X. F. Lei, N. Yang, T. Q. Duong and G. K. Karagiannidis, and communications security, Berlin, Germany, 2013, pp. 611-622.
“Secrecy cooperative networks with outdated relay selection over correlated [27] Y. Zhou and X. Jiang, “Dissecting android malware: characterization
fading channels,” IEEE Trans. Vehicular Technology, vol. 66, no. 8, pp. and evolution,” in Proc. IEEE Symposium on Security and Privacy,
7599-7603, Aug. 2017, DOI. 10.1109/TVT.2017.2669240. San Francisco, CA, USA, 2012, pp. 95-109.
[9] M. Grace, Y. Zhou, Q. Zhang, S.H. Zou and X. X. Jiang, “Riskranker: [28] W. Zhu, Y. J. Wang and Z. Xue, “Study on Android rootkit based on
scalable and accurate zero-day android malware detection,” in Proc. VFS,” Information Security and Communications Privacy, vol. 1, pp.
10th Int. Conf. Mobile systems, applications, and services, Low Wood 68-69, Jan. 2013.
Bay, Lake District, UK, 2012, pp. 281-294.

VOLUME XX, 2017 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final
publication. Citation information: DOI 10.1109/ACCESS.2018.2853121, IEEE Access

[29] Z. C. Zhu, S. S. Tong, B. J. Shen, Z. W. Qi, T. Zhang and M. Zhao,


“Confused Smali code analysis for safety of Android system,”
Computer Engineering and Design, vol. 37, Feb. 2016.

VOLUME XX, 2017 9

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but
republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like