Malware Detection Using Static Analysis
Malware Detection Using Static Analysis
ABSTRACT
Android is a free open-source operating system (OS), which allows an in-depth
understanding of its architecture. Therefore, many manufacturers are utilizing this
OS to produce mobile devices (smartphones, smartwatch, and smart glasses) in
different brands, including Google Pixel, Motorola, Samsung, and Sony. Notably, the
employment of OS leads to a rapid increase in the number of Android users. However,
unethical authors tend to develop malware in the devices for wealth, fame, or private
purposes. Although practitioners conduct intrusion detection analyses, such as static
analysis, there is an inadequate number of review articles discussing the research efforts
on this type of analysis. Therefore, this study discusses the articles published from 2009
until 2019 and analyses the steps in the static analysis (reverse engineer, features, and
classification) with taxonomy. Following that, the research issue in static analysis is also
highlighted. Overall, this study serves as the guidance for novice security practitioners
and expert researchers in the proposal of novel research to detect malware through
static analysis.
Submitted 16 November 2020
Accepted 13 April 2021
Subjects Artificial Intelligence, Data Mining and Machine Learning, Mobile and Ubiquitous
Published 11 June 2021
Computing, Security and Privacy, Operating Systems
Corresponding authors Keywords Android, Review, Static analysis, Machine learning, Features, Malware
Rosmalissa Jusoh,
[email protected]
Ahmad Firdaus,
[email protected] INTRODUCTION
Academic editor Mobile devices, such as smartphones, iPads, and computer tablets, have become everyday
Sedat Akleylek necessities to perform important tasks, including education, paying bills online, bank
Additional Information and transactions, job information, and leisure. Based on the information from an online
Declarations can be found on
page 39
mobile device production website, Android is one of the popular operating systems (OS)
used by manufacturers (Rayner, 2019; Jkielty, 2019). The open-source platform in Android
DOI 10.7717/peerj-cs.522
has facilitated the smartphone manufacturers in producing Android devices of various
Copyright sizes and types, such as smartphones, smartwatches, smart televisions, and smart glasses. In
2021 Jusoh et al.
the most recent decades, the quantity of remarkable Android gadgets accessible worldwide
Distributed under has increased from 38 in 2009 to over 20,000 in 2016 (Android, 2019a). As a result of the
Creative Commons CC-BY 4.0
demand for this Android OS, the recent statistics from Statista revealed that the number
OPEN ACCESS of Android malware increase to 26.6 million in March 2018 (Statista, 2019). Moreover,
How to cite this article Jusoh R, Firdaus A, Anwar S, Osman MZ, Darmawan MF, Ab Razak MF. 2021. Malware detection using static
analysis in Android: a review of FeCO (features, classification, and obfuscation). PeerJ Comput. Sci. 7:e522 http://doi.org/10.7717/peerj-
cs.522
McAfee discovered a malware known as Grabos, which compromises the Android and
breaches Google Play Store security (McAfee, 2019). It was also predicted that 17.5 million
Android smartphones had downloaded this Grabos mobile malware before they were taken
down.
Mobile malware is designed to disable a mobile device, allow malicious acts to
remotely control the device, or steal personal information (Beal, 2013). Moreover,
these malicious acts able to run stealthily and bypass permission if the Android kernel
is compromised by mobile malware (Ma & Sharbaf, 2013; Aubrey-Derrick Schmidt et
al., 2009b). In September 2019, a total of 172 malicious applications were detected on
Google Play Store, with approximately 330 million installations. According to researchers,
the malicious components were hidden inside the functional applications. When the
applications are downloaded, it leads to the appearance of popup advertisements, which
remain appear even when the application was closed (O’Donnell, 2019). To detect this
malware, security practitioners conducting malware analysis, which aims to study the
malware characteristics and behaviour. There are dynamic, static, and hybrid analysis.
Table 1 shows comparison for static, dynamic and hybrid analysis done from previous
researches. Specifically, dynamic analysis is an analysis, which studies the execution and
behaviour of the malware (Enck, 2011; Yaqoob et al., 2019). However, dynamic analysis is
incapable of identifying several parts of the code operating outside the monitoring range.
Besides, provided that the dynamic analysis is a high resource-consuming analysis with a
high specification for hardware (Enck, 2011), static analysis is another alternative to detect
malware. It is an analysis, which examines malware without executing or running the
application. Additionally, this analysis able to identify malware more accurately, which
would act under unusual conditions (Castillo, 2011). This is due to static analysis examine
overall parts of a program including parts that excluded in dynamic analysis. Furthermore,
static analysis is able to detect unknown malware just as dynamic analysis could (Yerima,
Sezer & McWilliams, 2014) and requiring low resources.
To integrate the characteristics of the static and dynamic method, three-layer detection
model called SAMAdroid has been proposed by Saba Arshad et al. (2018) which combines
static and dynamic characteristics. Mobile Sandbox by Spreitzenbarth et al. (2015) which
proposed to use the results of static analysis to guide the dynamic analysis and finally realize
classification. The hybrid analysis technique is great to help in improving the accuracy, but
it also has a major drawback such as the waste of time and space for the huge number of
malware samples to be detected and analyzed (Fang et al., 2020; Alswaina & Elleithy, 2020).
Table 2 presents the past review articles on Android, with Feizollah et al. (2015)
specifically focusing on features, including static, dynamic, hybrid, and metadata. It
summarizes the features preferred researchers in their analysis. Comparatively, this study
placed more emphasis on features besides classification and obfuscation. Subsequent
reviews, namely Sufatrio et al. (2015) and Schmeelk, Yang & Aho (2015), highlighted the
survey, taxonomy, challenges, advantages, limitations in the existing research in the Android
security area, and the technique in the static analysis research on Android. However,
compared to the current review, the aforementioned reviews only presented a few features
and information on static analysis. In the Android permission category (Fang, Han & Li,
SURVEY METHODOLOGY
Methodology
This section describes the method to retrieve the articles related to malware detection
using static analysis for Android. We used Web of Science to run the review, eligibility and
exclusion criteria, steps of the review process (identification, screening, eligibility), and
data analysis.
Identification
The review was performed based on the main journal database in the Web of Science
(WoS). This database covers more than 256 disciplines with millions of journals regarding
the subjects related to network security, computer system, development, and planning.
It also stores over 100 years of comprehensive backfile and citation data established by
Clarivate Analytics (CA), which are ranked through three separate measures, namely
Table 2 Comparison with previous review articles. Summarization of previous related review articles in detecting malware.
References Ma & Sharbaf Fang, Han & Li Feizollah et al. Sufatrio et al. (2015) Schmeelk, Yang & Pan et al. (2020) This paper
(2013) (2014) (2015) Aho (2015)
Title Investigation of Permission- A Review on Feature Securing Android: A Android Malware A Systematic Malware Detection
Static and Dynamic based Android Selection in Mobile Survey, Taxonomy, Static Analysis Literature Review of using Static Analysis
Android Anti-virus Security: Issues and Malware Detection and Challenges Techniques Android Malware for Android: A
Strategies Countermeasures Detection Using Review and Open
Static Analysis Research Issue
Year 2013 2014 2015 2015 2015 2020 Current paper
Citations 9 132 172 146 21 1
Dataset X X X
Reverse engineer X X X
tools
All static features X X X X
All classifications X
(Machine learning,
deep learning, graph,
and others)
Obfuscation X
constraints
and methods to
overcome it
5/54
citations, papers, and citations per paper. The search strings in the CA database were
‘‘static analysis’’, ‘‘malware’’, and ‘‘Android’’.
There were 430 records identified through database searching. These journals and
conferences are mainly from Computer and Security and IEEE Access, which are listed
in Table 3. Collections of the studies that are related to Android malware detection using
static analysis in the reference section, where studies take up a small proportion in the
primary studies. All the studies related to search terms are taken into account, and the
searching range is from January 2009 to December 2019.
Screening
Experiment articles were identified in the static analysis, omitting other unrelated articles.
Initially, the searching of articles was specified into a journal article and excluded review
articles, books, and conference proceedings. To focus specifically on static analysis, the
articles, which combined both static and dynamic analyses, were removed. Another
criterion focused on the selection of the articles was the use of English, which therefore
removed all non-English articles to avoid any difficulty in translating in the future. The
selection of articles took place from 2009 to 2019, totaling the duration to 10 years. This
duration was suitable for exploring the evolution of research in security areas. Apart from
that, the Android platform was the focus of this study.
Eligibility
Figure 1 depicts the review that process involved four steps; identification, screening,
eligibility, and analysis. The review was performed in mid of 2019. Based on previous
studies, the process used similar keywords related to malware detection, static analysis,
and security. After the identification process, we remove any duplicated articles. During
the screening process, we discover 375 documents and remove a few articles and left 172
articles. This is because the articles were unrelated to the interested area. Lastly, we used
150 articles for review (Shaffril, Krauss & Samsuddin, 2018).
Static analysis
Mobile malware compromises Android devices (smartphone, smartwatch, and smart
television) for wealth, stealing data, and personal purposes. The examples of mobile
malware include root exploit, botnets, worms, and Trojan. To detect malware, most of
security practitioners perform two types of analysis; dynamic and static.
Specifically, dynamic analysis is an experiment, which detects malware by executing
malware and benign applications to monitor and differentiate their behaviours. However,
the monitoring of all behaviours is costly and requires high specifications in terms of device
memory, CPU, and storage. Furthermore, the malware is inflicted on a device at a certain
time or whenever the attacker decides on it. Accordingly, as the dynamic analysis only
monitors behaviours at a certain range of time based on the research period, numerous
malware activities outside the research period might be omitted (Feizollah et al., 2013;
Yerima, Sezer & Muttik, 2015; Wei et al., 2017). Furthermore, dynamic analysis requires a
separate and closed virtual environment to run a malware and observe its behaviour on
the system. However, an isolated setup dynamic leads to an impractical analysis in the
Android platform due to the increase in power and memory consumption. While power
and memory are the most concerning constraints of Android devices, static analysis is the
alternative for the dynamic analysis.
Static analysis is a category of analysis, which investigates the malware application
code and examine full activities in an application within an unlimited range of time,
by without executing the application (Chess & McGraw, 2004). The main step of static
analysis procedure is the reverse engineer process, which retrieves the whole code and
further scrutinises the structure and substance within the application (Sharif et al., 2008;
Chang & Hwang, 2007; Aafer, Du & Yin, 2013). Therefore, this analysis can examine the
overall code with low requirement for memory resources and minimum CPU processes.
Additionally, the analysis process is prompt due to the absence of the application. With this
analysis, unknown malware is also identified using enhanced detection accuracy through
machine learning approaches (Narudin et al., 2014; Feizollah et al., 2013). Table 4 presents
the advantages and disadvantages of dynamic and static analyses.
A lot of researchers publish their works using static approaches for malware detection
on the Android platform. Even in this static approach, in its turn, contains a number
of approaches. For example, there are signature-based approach and other approach are
depending on detection and classification of the source code. Signature-based detection
utilizes its specification by having an information of malware signatures determined and
arranged in advance inspection (Samra et al., 2019). However, signature-based approach
Able to detect benign applications, which abruptly The application of reverse engineer takes a short
transform into malware during its execution amount of time
The examination on the overall code, followed by
the identification of a possible action
Low resources (e.g., CPU, memory, network,
and storage). Therefore, this analysis is suitable
for mobile device which equipped with low
specifications.
Limitations
High resources (e.g., CPU, memory, network, and Inability to detect normal application, which Waste of time
storage) promptly transforms the malware
Higher time consumption to run the application Obfuscation Require more spaces for huge number of
for further analysis and exploration malware samples
Possibly omits the malware activities outside the The investigation is continued to determine the
analysis range minimal features (e.g., permission, a function call,
and strings) to detect malware
Difficulty in detecting applications, which can
hide malicious behaviour when it is operated
The investigation is continued to determine the
minimal features (e.g., traffic and memory) to
detect malware
are not able to detect unknown malware even though this approach is a set of features that
uniquely differentiate the executable code (Gibert, Mateu & Planes, 2020).
Obfuscation is one of the obstacles in the static analysis, which is used by malware authors
in their malicious software to evade the intrusion detection or antivirus system (Wei et
al., 2017). The examples of the obfuscation methods are renaming the code, adding
unnecessary codes, and encrypting the string. Therefore, security practitioners need to
overcome obfuscation to increase their detection results. Accordingly, the alternatives
performed by the security practitioners are presented in ‘Obfuscation’.
Table 4 shows that both static and dynamic analyses have similar limitations despite
the selection of the ideal features in minimal amount. In detecting malware, features refer
to the attributes or elements to differentiate an application, which may either be malware
or benign. Security practitioners are faced with obstacles in investigating various features
in all types of categories (e.g., permission, API, directory path, and code-based) and the
need to simultaneously reduce these features. Notably, determining the ideal features in
minimal amount is crucial to enhance the accuracy of the analyses (e.g., the accuracy of
the predictive model) and reduce data and model complexity (Feizollah et al., 2015).
Figure 2 illustrates the static analysis operation, which consisted of several steps. The first
step was the acquirement of the benign and malware datasets in the Android application,
each with the (.apk) filename extension. This was followed by the reverse engineering
performed on these applications to retrieve the code by extracting a few folders from one
.apk file, which consisted of nested files with codes (Java or smali). Furthermore, one
.apk would comprise approximately a thousand lines of codes. Therefore, with a total of
1,000 applications in one dataset, the security practitioners were required to scrutinise
millions of lines of code. With the completion of the reverse engineering, an analysis
would be conducted, which involved features. Features consist of a series of application
Dataset
Figure 3 illustrates the Android malware dataset from different places. Notably, the
majority of the datasets were obtained from universities. The datasets were in the form
of an Android application package, which was followed by an .apk filename extension.
Malgenome (Anonymous, 0000d) is the name of Android malware dataset, which was made
to be publicly available with permission from their administrator. These malware samples,
which were collected by North Carolina State University (NCSU) from August 2010 to
October 2011, covered multiple families of malware consisting of botnet and root exploit.
The characterization of the malware families was based on the method of the installation,
the way the malware carried the malicious payloads, and its method of activation.
Androzoo (Allix et al., 2018; du Luxembourg, 2016) is another dataset consisting of
approximately more than three million of Android applications (.apk). This dataset
originates from the University of Luxembourg to contribute to the community for research
purposes and further explore the notable development in the detection of malware, which
damages the Android. Drebin (Technische Universität Braunschweig, 2016) dataset also
presents Android malware publicly with strict requirements. A university from Germany
(University in Braunschweig, Germany) collected 5,560 samples with 179 families. The time
range provided for the malware was from August 2010 to October 2012. The university
project, which was known as MobileSandbox, was an initiative for the acquirement of
samples for academia and industry.
Android malware dataset (AMD) is a public Android malware dataset from the
University of South Florida, which consists of 24,650 samples with 71 categorised families.
To obtain this dataset, the user is required to acquire permission from the university and
provide authentic information with evidence. The academia and the industry are allowed
to use these samples for research purposes.
Contagio (MilaParkour, 2019) dataset presents the malware, which focuses on mobile
malware, with a condition that the user should omit one sample to obtain another sample.
It provides a dropbox for the user to share their mobile malware samples. According to
their blogspot (MilaParkour, 2019), the name of the administrator of this dataset is Mila
Parkour, who is reachable only through emails. Based on Table 5, which presents the
Dataset References of the articles with the use of the respective datasets
Malgenome Anonymous (0000d) Yerima, Sezer & McWilliams (2014), Firdaus et al. (2017), Firdaus & Anuar
(2015), Firdaus et al (2018)
Drebin Anonymous (0000c) Firdaus et al. (2017); Firdaus et al. (2018); Firdaus et al. (2018)
Android malware dataset (AMD) Badhani & Muttoo (2019)
Contagio MilaParkour (2019) Feldman, Stadther & Wang (2014); Islamic & Minna (2015)
Androzoo Université du Luxembourg (2018) Razak et al. (2019); Firdaus et al. (2017); Razak et al. (2018); Firdaus et al.
(2017)
Figure 4 Reverse engineer tools for static analysis. This is the example of reverse engineer tools that
have been used by the previous researchers to extract the code for malware.
Full-size DOI: 10.7717/peerjcs.522/fig-4
research articles and the respective datasets, it could be seen that the dataset providers
receive significant attention from other universities and the industry. It is hoped that this
action would enhance the security of the Android device and its users from time to time.
Reverse engineer
Static analysis is an activity to investigate the code of an application without executing it. In
order to investigate, security practitioners implement the reverse engineering method. This
method reversed from the executable file to its source code (Dhaya & Poongodi, 2015). This
reverse engineering process loads the executable into a disassembler to discover what the
program does. Figure 4 illustrates the tools used to perform a reverse engineering method,
which was also adopted by security practitioners to identify Android malware. Table 6
illustrates the tools adopted in the respective articles.
Features
Once the researchers reverse engineer the executable file using specific tools, they need to
select features from the source code. Feature selection is important in order to increase
the accuracy of the detection system (Feizollah et al., 2015; Chanda & Biswas, 2019; Klaib,
Sara & Hasan, 2020) Figure 5 presents the taxonomy of multiple static features. The next
sections are the details for each type of static feature.
Advertisement libraries. Provided that most Android applications are available for free
download, Android developers need to include advertisement libraries (ad libraries) in
Figure 5 Taxonomy of multiple static features. Each static feature was figure out from the various ex-
periments done using the specific tools and methods.
Full-size DOI: 10.7717/peerjcs.522/fig-5
the free application for financial purposes. During the run-time of the application, the
ad libraries would transfer the data regarding users’ activities. The developer would then
receive an incentive based on certain metrics of the information. Adrisk (Grace et al., 2012a)
scrutinised and measured the risk of the codes of the ad libraries to detect applications,
which may harm the users.
Apk, dex and xml properties. Several security practitioners adopted the features, which
consist of .apk file properties. The authors of the malwares (Kang et al., 2015) and (Zhou
et al., 2012) are examined in two experiments due to the significant number of Android
malwares written by a similar person. Therefore, the features of the malwares include serial
numbers of the author, author’s information, name, contact and organization information,
developer certification, author’s ID, and public key fingerprints of the author. Other features
highlighted in this section are the application name, category, package, description, rating
values, rating counts, size, number of zip entries, and common folders (Samra, Kangbin &
Ghanem, 2013; Shabtai, Fledel & Elovici, 2010).
Directory path. Directory path allows access for a specific folder in the operating system
(OS). It was found by security practitioners that the attacker incorporated a directory path
for a sensitive folder in their malware. Meanwhile, several paths related to Android
kernel directory were identified by another study (Firdaus & Anuar, 2015), such as
‘data/local/tmp/rootshell’, ‘/proc’, and ‘/system/bin/su’.
Commands. Two types of commands are available, namely (1) root command and (2)
botnet command. Specifically, several root commands were identified by (Firdaus & Anuar,
2015) in the Unix machine, such as ‘cp’, ‘cat’, ‘kill’, and ‘mount’. Normally, these commands
were used by the administrators to execute higher privileged actions in the Unix machine.
Provided that Android architecture was based on the Unix kernel, the attackers included
root commands in their malware to control the victim’s Android devices. Therefore, the
identification of root commands is crucial in investigating malwares.
The second type of command is a botnet command. Meanwhile, one type of malware,
which is known as a mobile botnet, includes botnet commands in their malware codes, such
as ‘note’, ‘push’, ‘soft’, ‘window’, ‘xbox’, and ‘mark’. The attacker used these commands
to communicate with the command and control (C&C) server, while droidanalyzer (Seo
et al., 2014) combines API, root command, and botnet command into a set of features to
detect root exploit and mobile botnet.
Other than ad libraries, certain researchers inspect the Android Debug Bridge (adb)
code. ADB (Android Developers, 2017) is a tool, which provides a command-line access
facility for users or developers to communicate with Android mobile devices. This facility
allows the installation of unwanted applications and execution of various Unix by the
attacker in the victim’s device. Therefore, RODS (Firdaus et al., 2018) is a root exploit
detection system for the detection of a root exploit malware with ADB features.
Geographic location. Geographic location is a feature, which identifies the origin of the
application. The geographic detector was identified as one of the features in research
by (Steven Arzt et al., 2014). Provided that 35% of the mobile malware families appeared to
originate from China with 40% of the facilities originating from Russia, Ukraine, Belorus,
Latvia, and Lithuania countries, it was crucial to consider geographic location as one of the
features for the detection of Android malware. For this reason, researchers increased the
risk signal for the applications originating from the aforementioned countries.
Manifest file. Android application is built on the top of the application framework which
provides an interface for the user. The program is based on the Android application package
file in the (.apk) format, which is also used to install an application in android-based
mobile devices. It consists of meta-inf, resource, assets and library directory, classes.dex,
resources.arsc, and androidmanifest.xml file. One of the files, androidmanifest.xml
(manifest file), is an essential file with contents of various features, such as permission,
intent, hardware component, and components of the application (activities, services,
broadcast receivers, and content providers) (Android, 2015).
(a) Permission
Permission is a unique security mechanism for Android devices. To enable the
permission, the user needs to allow the application during the installation period.
However, many users accidentally enable certain permissions, which leads to access to
sensitive security-relevant resources. Therefore, permission features were examined in
many studies. Based on the application of permission in several studies to measure the
risk of the application, permission was further identified as malicious (Razak et al., 2018;
Razak et al., 2019). Some other studies, such as (Hao Peng et al., 2012; Samra, Kangbin &
Ghanem, 2013; Walenstein, Deshotels & Lakhotia, 2012; Huang, Tsai & Hsu, 2012; Sahs &
Khan, 2012; Sanz et al., 2013; Talha, Alper & Aydin, 2015; Aung & Zaw, 2013), used the
permission features as the inputs for machine learning prediction.
(b) Intent
The intent is coded in the manifest file and allows a component of the application
to request certain functionality from another component from other application. For
example, application A can use the component of application B for the management of
photos in the device despite the exclusion of the component from application A. Provided
that this feature enables malicious activities among the attackers, several experiments used
intent (declared in the manifest file) as one of the features for the detection of malware,
such as (Feizollah et al., 2017; Fazeen & Dantu, 2014).
Network address. Access to the Internet is essential for attackers to retrieve private
information of the victim, change the settings, or execute malicious commands. This
process requires the incorporation of the Uniform Resource Locator (URL) or network
address in the malware code. The examples of sensitive URLs include the Android Market
on Google Play, Gmail, Google calendar, Google documents, and XML schemas. These
features were used in Luoshi et al. (2013) and Apvrille & Strazzere (2012), Mohd Azwan
Hamza & Ab Aziz (2019) for malware detection.
by Zheng, Sun & Lui (2013), Medvet & Mercaldo (2016), Faruki et al. (2013) and Zhao et al.
(2019) to detect Android malware in the static analysis. Further examples of the features
in this section are method (Kim et al., 2018), opcode (Zhao et al., 2019), byte stream @
byte block (Faruki et al., 2013), Dalvik code (Gascon et al., 2013), and code involving
encryption (Gu et al., 2018). The selection of the features by security practitioners is
followed by classification. This process was performed to receive the features as input and
differentiate between either the application malware or benign (normal).
Figure 6 depicts that researchers prefer to investigate permission and API features
compare to others. However, the trend in permission features is decline from 2013 until
2018. However, API features takes place in previous experiments as it increased from six
(2014) to 9 (2019). This indicates that the API trend would increase in following year in
static detection.
Classification
In the classification process for static analysis, many security analysts used two types of
methods; (1) Machine learning (ML) and (2) Graph. The following section presents the
ML studies with static features.
set), followed by a prediction of the outputs. Basing on a given dataset, the learning set
makes intelligent decisions according to certain algorithms. One of the machine learning
types is supervised based on the data for the training stage to create a function. Furthermore,
each part of the training data contains input (features or characteristics) and output (class
label-malware and benign). This is followed by the training stage, which calculates the
approximate distance between the input and output examples to create a model. This
training stage could classify unknown applications, such as malware or benign application.
Four types of ML are present, such as (1) classical learning; (2) reinforcement learning, (3)
neural network and deep learning, and (4) ensemble method. Figure 7 illustrates the ML
taxonomy, which starts with classical learning.
(a) Supervised Learning
Supervised learning (SL) is a process of learning from previous instances to predict
future classes. Therefore, the prediction of the class label involves the construction of a
concise model from previous experience. The machine learning classifier is then used to
test the unknown class (Kotsiantis, 2007). To detect Android malware with static features,
the SL method is widely used by security practitioners. Accordingly, the previous articles
adopting this method are illustrated in Table 7.
(b) Unsupervised Learning
Unsupervised learning is another type of learning involved in machine learning. It is
a clustering technique where the data is unlabeled and has also been used in computer
security areas, including malware detection and forensic (Beverly, Garfinkel & Cardwell,
2011). Clustering refers to the division of a large dataset into smaller data sets with several
similarities. It classifies a given object set through a certain number of clusters (assume k
clusters) to determine the k centroids assigned for each cluster. In this case, this algorithm
selects the centroid by random from the applications set, extracts each application from a
given dataset, and assigns it to the nearest centroid. Table 7 tabulates the previous articles,
which adopted this method.
(c) Reinforcement learning
A reinforcement learning model consists of an agent (a set of actions A) and an
environment (the state space S) (Anderson et al., 2018). Deep reinforcement learning was
introduced by reinforcement agents as a framework to play Atari games, which often
exceed human performance (Volodymyr Mnih et al., 2013; Volodymyr et al., 2015). The
advances in deep learning may extract high-level features from raw sensory data, leading
to breakthroughs in computer vision and speech recognition. In the case of deep learning,
the agent would be required to learn a value function in an end-to-end way, which takes
raw pixels as input and predicts the output rewards for each action.
Graph. The use of a graph is another method in machine learning and pattern recognition,
which is performed by investigating the data and control-flow analysis. It is also capable
of identifying unknown malware through the examination on the flow of the code. This
method is preferred by security analysts due to the uniform flow despite the changes
made by the malware authors on the API calls to avoid intrusion detection systems. The
types of analysis in graph method include call graph, inter-component call graph (ICCG),
control-flow graph (CFG), and dependence graph, while Table 9 lists the previous works
of research on static malware detection using the graph method.
A call graph (specifically known as flow graph) is a graph representing the control and
data flow of the application, which investigates the exchange of information through the
procedures. A node in the graph represents a procedure or function, as seen from the x and
y symbols, which indicate that procedure x calls for procedure y. Apposcopy (Feng et al.,
2014) presents its new form of call graph known as inter-component call graph (ICCG) to
match malware signature. As a directed graph where nodes are known as components in
an application, it establishes ICCG from a call graph and the results of the pointer analysis.
The objective of apposcopy is to measure the inter-component communication (ICC),
calls, and flow relations.
Another graph called a control flow graph (CFG) is also applied by many security analysts
to investigate the malware programme. Woodpecker (Grace et al., 2012b) created the CFG
start from each entry point (activity, service, receiver, content provider), which is defined
in the permission stated in the androidmanifest.xml file. Furthermore, the public interface
or services from an execution path is discovered through the flow graph. However, it would
be considered by Woodpecker as a capability leak if it is not guarded by the permission
requirement nor prevented from being invoked by another unrelated application. The
same graph was applied in subsequent works of research, namely Flowdroid (Steven Arzt
et al., 2014), Dendroid (Suarez-Tangil et al., 2014; Sahs & Khan, 2012), Asdroid (Huang
et al., 2014a), Anadroid (Shuying Liang et al., 2013), Adrisk (Grace et al., 2012a), and
Dexteroid (Junaid, Liu & Kung, 2016a).
Another graph is the dependency graph, which illustrates the dependencies of several
objects on each other. An example could be seen in the dead code elimination case process,
in which the graph identifies the dependencies between operation and variables. With the
dependency of non-operation on certain variables, these variables would be considered
dead and should be deleted. The studies, which adopted this type of graph are CHEX (Lu
et al., 2012), Dnadroid (Crussell, Gibler & Chen, 2012), Droidlegacy (Deshotels, Notani &
Lakhotia, 2014b; Zhou et al., 2013).
Others. Besides machine learning and graph, several security practitioners adopted
different methods, such as Normalized Compression Distance (NCD). Adopted in the
studies by Desnos (2012b) and Paturi et al. (2013), this method can measure the similarities
between the malwares and represent them in the form of a distance matrix. Despite the
evolution of many malwares from time to time, some of their behaviour patterns are similar
to each other. The calculation of the similarities using NCD would identify the malwares,
which share the same distance.
A study known as DelDroid (Hammad, Bagheri & Malek, 2019) implemented a method
called as Multiple-Domain Matrix (MDM). This method refers to a complex system, which
calculates multiple domains and is based on the Design-Structure Matrix (DSM) model.
Furthermore, MDM is formed by the connection of DSM models with each other. The
study initialised multiple domains in the MDM to represent the architecture of an Android
system for privilege analysis. To illustrate, the incorporation of certain definitions in the
MDM representation in the architecture enables DelDroid to identify the communication
of the application, which may result in an unauthorised malware attack.
Another previous static experiment was conducted on the MD5 signature of the
application to detect malware (Seo et al., 2014). In the first process, the study assigned
the application as level C (the lowest level of suspicion), followed by calculation and
cross-reference in the database of signatures. The application would be recorded if the
result was positive. However, it would be identified as malware if the result of the suspicion
was R. The system examined the files inside the application to find any matched MD5
signature.
Androsimilar (Faruki et al., 2013) practised a method known as a statistical similarity
digest hashing scheme, which inspects the similarity on the byte stream based on robust
statistical malicious static features. It is also a foot-printing method, which identifies the
regions or areas of statistical similarity with known malware. Following that, it generates
variable-length signatures to detect unknown malware (zero-day).
The following study is DroidMOSS (Zhou et al., 2012), which identifies between the
repackaged (modified) and original application. This function is important due to the
content of malicious activities in many Android repackaged applications. This study
used a fuzzy hashing technique, which generated fingerprint based on this technique to
localise and detect any previously applied modifications to the original application. It then
calculated the edited distance to measure the similarity between the applications. When
the result of the similarity exceeds certain values, the application would be considered as a
modified sample.
Under another static experiment, a study by Apvrille & Strazzere (2012) adopted a
method known as a risk score weight, which was performed through the calculation of the
risk score based on the selected features in the code. When the features were identified, the
score increased according to certain risky patterns of properties. Particularly, the patterns
were based on different likelihoods of the given situations between normal and malware
samples. Lastly, the percentage of the likelihood was calculated. Figure 9 shows that both
ML and graph were the popular methods among security practitioners in static analysis.
The graph method was found to exceed the ML method in 2011, 2012, and 2014, although
ML was more preferred compared to graph in other years. However, this situation reveals
that graphs and ML are favourable options in the static experiment.
A study started to utilise DL (part of ML) in the static experiment in 2019, which
also combined DL (Convolutional neural network—CNN) with Control flow graph
(CFG). Notably, provided that API was the only feature utilised in this study, many
future opportunities were available to combine different DL classifiers (Recurrent neural
network—RNN, Generative* adversarial networks—GAN or Deep belief network*—DBN)
with other features besides API and different types of graph. It is noteworthy that DL could
also be combined with NCD and MDM.
Obfuscation
Static analysis involves reverse engineering, such as decompile and disassemble, while
malware developer utilises the obfuscation method to increase the difficulty of the
decompiling process and lead to confusion in it. Obfuscation is a technique, which increases
the difficulty in understanding the programmes due to the failure of the lead security
analysts to distinguish between malware and benign application. Notably, it is a well-known
obstacle to be examined by static analysis. Figure 10 illustrates the types of obfuscation,
which include encryption, oligomorphic, polymorphism, and metamorphism (Moser,
Kruegel & Kirda, 2007; You & Yim, 2010).
The encryption method was extensively practised by the malware writers. In this
case, the important code or strings, which revealed the malware detector or security
practitioner, should be identified. Accordingly, the code was encrypted and converted to
the ciphertext. Furthermtore, various algorithms to encrypt the code are present, such as
Caesar, Playfair, Data Encryption Standard (DES), Advanced Encryption Standard (AES),
and Rivest-Shamir-Adelman (RSA). Therefore, for the security practitioner to understand
the behaviour of the malware, the encrypted code should be decrypted using the correct
decryptor (Wei et al., 2017).
Besides being a malware capable of mutating @ changing the decryptor, the oligomorphic
is also able to generate multiple decryptors to hundreds of types (You & Yim, 2010).
Consequently, the security practitioner would need to change different decryptor multiple
times until the code is returned to the normal string. Nevertheless, this type of obfuscation
does not affect the size or shape of the code. Another type of obfuscation is polymorphic. It
is a descriptor, which affects the size or shape of the code. Compared to oligomorphic, it is
more advanced due to the incorporation of code transposition, register reassignment, dead
code @ nop insertion, and armoring. Meanwhile, metamorphism is an approach beyond
the oligomorphic and polymorphic types due to the absence of decryptor in its mechanism.
Therefore, its constant body could be hidden from memory and increase the difficulty of
the static investigation to detect malware.
The following information is the obfuscation methods that regularly used by
polymorphism (polimorphic) and metamorphism (metamorphic) obfuscation (You &
Yim, 2010).
(a) Code transportation
Code transposition is a method, which restructures the orders of the original code
without causing any effects on its conduct. This process is performed with two methods.
The first method is the random restructure of the original code by including jumps
or unconditional branches. However, security practitioners can detect obfuscation by
removing those jumps or unconditional branches. The second method is the production
of new generations by selecting and restructuring independent instructions without any
impact on others. However, the adoption of these methods is challenging for the malware
writer, while the security practitioners are faced with a difficulty to detect this method of
obfuscation.
(b) Register reassignment
Register reassignment is another method of obfuscation, which shifts the registers of
the code from one generation to another. This method is performed without changing the
behaviour of the code while keeping the programme of the code similar to its original state.
(c) Dead-code/nop insertion
Advantage of obfuscation. Despite the adoption of the obfuscation method by the malware
writers or the attackers to evade detection, obfuscation also serves the following advantages
based on other points of views:
(a) Reduction of the size of the application
Google (Android, 2019b) encourages developers to enable shrinking in their release to
build an application to remove any unused codes and resources. Furthermore, provided
that obfuscation would shorten the names of the classes and members in the code, the
developer will be able to reduce the size of the application. Notably, the size of the
application is a significant concern in Android handheld devices (smartphones, smart
glasses, and smartwatch) with limited storage and resources.
(b) The difficulty for the malware writer to understand the obfuscated normal application
To develop malware in certain situations, malware writers need to perform reverse
engineering on the normal repackaged application. Therefore it is able to confuse them
to steal private information and discover application vulnerabilities from that obfuscated
normal @ benign application code (Diego et al., 2004).
(c) Security practitioners can detect malware easily
Obfuscation also facilitates the detection of malware by the researcher (Nissim et al.,
2014). To illustrate, there are certain situations where malware regularly adopts similar
obfuscation marks, which is impossible to exist in normal application. Therefore, security
practitioners able to detect malware with the presence of these marks. Following all the
advantages and drawbacks, continuous research on obfuscation is crucial to obtain better
results from the detection of malware through the static analysis.
Table 11 The detection of malware, which attacks Android OS, based on previous static analysis. To identify the trends in the detection of malware through static
analysis, this section presents a list of previous works of research, which cover all areas (year, features, and classification).
2018 Kumar, • X
Kuppusamy
& Aghila
(2018)
2018 Wang et • • • • X
al. (2018)
2018 Ma Zhao- • X
hui et al.
(2017)
2017 Feizollah • X
et al.
(2017)
2017 Kim et al. • X
(2018)
2017 Zhou et al. • X
(2017)
2017 Pooryousef • X
& Amini
(2017)
2017 Wu et al. • X
(2018)
2017 Chang & • • X
De Wang
(2017)
2016 Junaid, • X
Liu &
Kung
(2016a)
2016 Atici, • X X
Sagiroglu
& Dogru
(2016)
2016 Medvet & • X
32/54
Mercaldo
(2016)
(continued on next page)
Jusoh et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.522
Table 11 (continued)
Year Ref Features (•) Classification (X)
AD API A DP C F G M N CB ML DL Graph Other
2016 Wu et al. • X
(2016)
2016 Nissim et • X
al. (2016)
2015 Sheen, • • X
Anitha &
Natarajan
(2015)
2015 Kang et al. • • • X
(2015)
2015 Kang et al. • • • X
(2015)
2015 Firdaus • • • X
& Anuar
(2015)
2015 Talha, • X
Alper &
Aydin
(2015)
2015 Junaid, • • X
Liu &
Kung
(2016a)
2015 Lee, Lee & •
Lee (2015)
Table 11 (continued)
Year Ref Features (•) Classification (X)
AD API A DP C F G M N CB ML DL Graph Other
2014 Feng et al. • X
(2014)
2014 Huang et • • X
al. (2014a)
2014 Steven • X
Arzt et al.
(2014)
2014 Yerima, • • • X
Sezer &
Muttik
(2014)
2014 Seo et al. • • MD5
(2014) signature
2014 Suarez- • X
Tangil et
al. (2014)
2013 Aafer, • X
Du & Yin
(2013)
2013 Lee & Jin •
(2013)
2013 Shuying • • X
Liang et al.
(2013)
2013 Zhou et al. • • X
(2013)
2013 Luoshi et • • • X
al. (2013)
2013 Peiravian • • X
& Zhu
(2013)
2013 Samra, • • X
Kangbin &
Ghanem
(2013)
(continued on next page)
34/54
Jusoh et al. (2021), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.522
Table 11 (continued)
Year Ref Features (•) Classification (X)
AD API A DP C F G M N CB ML DL Graph Other
2013 Gascon et • X
al. (2013)
2013 Huang, • X
Tsai
& Hsu
(2012)
2013 Aung • X
& Zaw
(2013)
2013 Faruki et • Similarity
al. (2013) digest
hashing
2013 Paturi et • Normalised
al. (2013) Compression
Distance
(NCD)
2013 Apvrille • X
& Apvrille
(2013)
2013 Yerima et • • X
al. (2013)
2013 Borja • X
Sanz et al.
(2013)
2012 Grace et • X
al. (2012a)
2012 Wu et al. • • X
(2012)
2012 Bartel et • • X
al. (2012)
2012 Grace et • • X
al. (2012b)
2010 Shabtai, • • • X
Fledel &
Elovici
(2010)
2009 Aubrey- • X
Derrick
Schmidt et
al. (2009a)
36/54
Notes.
AD, Advertisement libraries; API, API; A, apk, dex and XML properties; DP, Directory path; C, Commands; F, unction call; G, Geographic; M, Manifest file; N, Network address or URLs; CB,
Code-based.
changes of the androidmanifest.xml document from a library application depends on was
converged into the last androidmanifest.xml record. Other package files fall down into apk,
xml and dex properties feature.
Besides the combination of DL and graph, ML and graph were also combined in the
studies by Atici, Sagiroglu & Dogru (2016) in 2016 and Sahs & Khan (2012) in 2012. These
studies utilised a similar graph, which was the Control flow graph (CFG), indicating
that the combination of ML and graph increased the detection result. Therefore, future
work is suggested to test this combination in different static features. Other parts of
classification (Multiple-Domain Matrix (MDM), MD5 signature, similarity digest hashing,
normalized compression distance (NCD), and fuzzy hashing technique) were also useful
in the detection of malware with static features. These classifications also contributed to
the availability of future work combinations with ML, DL, and graph.
CONCLUSIONS
Following the interest to explore the recent studies in the static analysis, a review
was performed on the existing studies by past security investigators on Android
malware detection, which was explained through phases (reverse engineer, features, and
Funding
This work was supported by the Ministry of Higher Education (MOHE) for
Fundamental Research Grant Scheme (FRGS) with grant number RDU190190,
FRGS/1/2018/ICT02/UMP/02/13, and Universiti Malaysia Pahang (UMP) internal grant
with grant number RDU1803142. The funders had no role in study design, data collection
and analysis, decision to publish, or preparation of the manuscript.
Grant Disclosures
The following grant information was disclosed by the authors:
Ministry of Higher Education (MOHE).
Fundamental Research Grant Scheme (FRGS): RDU190190,
FRGS/1/2018/ICT02/UMP/02/13.
Universiti Malaysia Pahang (UMP): RDU1803142.
Competing Interests
The authors declare there are no competing interests.
Author Contributions
• Rosmalissa Jusoh conceived and designed the experiments, performed the computation
work, prepared figures and/or tables, and approved the final draft.
• Ahmad Firdaus conceived and designed the experiments, performed the computation
work, authored or reviewed drafts of the paper, and approved the final draft.
• Shahid Anwar and Mohd Zamri Osman performed the experiments, authored or
reviewed drafts of the paper, and approved the final draft.
• Mohd Faaizie Darmawan analyzed the data, prepared figures and/or tables, and approved
the final draft.
Data Availability
The following information was supplied regarding data availability:
This is a review article.
REFERENCES
Aafer Y, Du W, Yin H. 2013. DroidAPIMiner: mining API-Level Features for Robust
Malware Detection in Android. In: Security and privacy in communication networks.
Sydney, NSW, Australia: SecureComm, 86–103.
Abadi M, Agarwal A, Barham P. 2016. TensorFlow: large-scale machine learning on
heterogeneous distributed systems. Available at http://arxiv.org/abs/1603.04467 .
Aktas K, Sen S. 2018. UpDroid: updated Android Malware and its familial classification.
In: Gruschka N, ed. Secure IT Systems. NordSec 2018. Lecture Notes in Computer
Science. Vol. 11252. Cham: Springer DOI 10.1007/978-3-030-03638-6_22.
Al-Rfou R, Alain G, Almahairi A, Angermueller C, Bahdanau D. 2016. Theano: A
Python framework for fast computation of mathematical expressions. arXiv .
Available at http://arxiv.org/abs/1605.02688.
Allahham AA, Rahman MA. 2018. A smart monitoring system for campus using zigbee
wireless sensor networks. International Journal of Software Engineering & Computer
System 4(1):1–14 DOI 10.15282/ijsecs.4.1.2018.1.0034.
Allix K, Bissyandé TF, Klein J, Le Traon Y. 2018. AndroZoo: collecting millions of
android apps for the research community. In: MSR ’16 proceedings of the 13th
international conference on mining software repositories, Austin, Texas. 468–471
DOI 10.1145/2901739.2903508.
Alotaibi A. 2019. Identifying malicious software using deep residual long-short term
memory. IEEE Access 7:163128–163137 DOI 10.1109/ACCESS.2019.2951751.
Alsoghyer S, Almomani I. 2019. Ransomware detection system for android applications.
Electron 8(8):1–36 DOI 10.3390/electronics8080868.
Alswaina F, Elleithy K. 2018. Android malware permission-based multi-class
classification using extremely randomized trees. IEEE Access 6:76217–76227
DOI 10.1109/ACCESS.2018.2883975.
Alswaina F, Elleithy K. 2020. Android malware family classification and analysis: current
status and future directions. Electron 9(6):1–20 DOI 10.3390/electronics9060942.
Amit Y, Geman D. 1997. Shape quantization and recognition with randomized trees.
Neural Computation 9(7):1545–1588 DOI 10.1162/neco.1997.9.7.1545.
Anderson HS, Filar B, Roth P. 2017. Evading Machine Learning Malware Detection. In:
BlackHat DC. 6 . Available at https://www.blackhat.com/docs/us-17/thursday/us-17-
Anderson-Bot-Vs-Bot-Evading-Machine-Learning-Malware-Detection-wp.pdf .
Anderson HS, Kharkar A, Filar B, Evans D, Roth P. 2018. Learning to Evade Static PE
Machine Learning Malware Models via Reinforcement Learning. ArXiv . Available at
http://arxiv.org/abs/1801.08917 .