Tifs 18
Tifs 18
8, AUGUST 2015 1
Abstract—Most existing Android malware detection and categorization techniques are static approaches, which suffer from evasion
attacks such as obfuscation. By analyzing program behaviors, dynamic approaches are potentially more resilient against these attacks.
Yet existing dynamic approaches mostly rely on characterizing system calls which are subject to system-call obfuscation. This paper
presents DroidCat, a novel dynamic app classification technique, to complement existing approaches. By using a diverse set of
dynamic features based on method calls and ICC Intents without involving permission, app resources, or system calls while fully
handling reflection, DroidCat achieves superior robustness than static approaches as well as dynamic approaches relying on system
calls. The features were distilled from a behavioral characterization study of benign versus malicious apps. Through three
complementary evaluation studies with 34,343 apps from various sources and spanning the past nine years, we demonstrated the
stability of DroidCat in achieving high classification performance and superior accuracy compared to two state-of-the-art peer
techniques that represent both static and dynamic approaches. Overall, DroidCat achieved 97% F1-measure accuracy consistently for
classifying apps evolving over the nine years, detecting or categorizing malware, 16% to 27% higher than any of the two baselines
compared. Further, our experiments with obfuscated benchmarks confirmed higher robustness of DroidCat over these baseline
techniques. We also investigated the effects of various design decisions on DroidCat’s effectiveness and the most important features
for our dynamic classification. We found that features capturing app execution structure such as the distribution of method calls over
user code and libraries are much more important than typical security features such as sensitive flows.
Index Terms—Android, security, malware, dynamic analysis, profiling, detection, categorization, stability, robustness, obfuscation.
1 I NTRODUCTION
Android has been the target platform of 97% malicious apps can request, acquire, and revoke permissions at
mobile apps [1], most of which steal personal information, runtime [19]. This new run-time permission mechanism
abuse privileged resources, and/or install additional implies that static approaches will not be able to discover
malicious software [2]. With the Android market growing when an abnormal permission is requested and granted at
rapidly, it is critically important to differentiate malware runtime, and would suffer from more false alarms if users
from benign apps (i.e., malware detection). Further, for revoke dangerous permissions after app installation. Third,
threat assessment and defense planning, it is also crucial to static approaches have limited capabilities in detecting
differentiate malware of different families (i.e., malware malicious behaviors that are exercised through dynamic
categorization by family). code constructs (e.g., calling sensitive APIs via reflection).
Two main classes of approaches to Android malware These and other limits [20] make static analysis vulnerable
detection/categorization have been studied: static and to widely adopted detection-evading schemes (e.g., code
dynamic. Static approaches leverage static code analysis to obfuscation [21] and metamorphism [22]). Recently,
check whether an app contains abnormal information flows resource-centric features are also used in static code
or calling structures [3], [4], [5], [6], [7], [8], matches analysis to overcome the above-mentioned limitations [23].
malicious code patterns [9], [10], requests for excessive However, such approaches can still be evaded by malware
permissions [11], [12], [13], [14], and/or invokes APIs that adopting resource obfuscation [24], [25].
are frequently used by malware [15], [16], [17], [18]. In comparison, dynamic approaches provide a
Static approaches may have the advantage of being complementary way to detect/categorize malware [26],
sound and scalable to screening large numbers of apps, yet [27], [28], [29]. In particular, behavior-based techniques [30]
they cannot always precisely detect malware for three model program behavioral profiles [31], [32] with
reasons. First, due to the event-driven features of Android, system/API call traces (e.g., [33], [34], [35]) and/or
such as lifecycle callbacks and GUI handling, run-time resource usage [28], [33]. Machine learning has been
control/data flows are not always statically estimatable; increasingly incorporated in these techniques, training
they depend on the run-time environment. This classification models from those profiles to distinguish
approximation makes static analysis unable to reveal many malware from benign apps. However, system-call based
malware activities. Second, the mere existence of some malware detectors can still be evaded when an app
permissions and/or APIs in code does not always mean obfuscates system calls [30], [36], [37], [38]. Sensitive API
that they are actually executed or invoked frequently at usage does not necessarily indicate malicious intentions.
runtime to cause an attack. Purely checking for existence of Abnormal resource usage does not always correspond to
permissions and/or APIs can cause static analysis to abnormal behaviors, either. Generally, behavior-based
wrongly report malware. In particular since API Level 23, approaches relying on system-call sequences and/or
Android has added dynamic permission support such that dependencies may be easily thwarted by system-call
obfuscation techniques (e.g., mimicry attack [37] and
• Haipeng Cai is with the School of Electrical Engineering and Computer illusion [38]). A more comprehensive dynamic app
Science, Washington State University, Pullman, WA, 99163. classifier is needed to capture varied behavioral profiles
E-mail: [email protected]
• Na Meng, Barbara Ryder, and Daphne Yao are with the Department of and thus be robust to attacks against specific profiles.
Computer Science, Virginia Tech, Blacksburg, VA. Recent dynamic Android malware categorization
Manuscript received February 19, 2018; revised September, 2018. approaches utilize the histogram [39] or chains (or
dependencies) [34] of system calls. A recent dynamic sophisticated obfuscation schemes, along with three
Android malware detection technique [26] differentiates different sets of benign apps. Our study showed that
API call counts and strictly matches API signatures to DroidCat worked robustly well on obfuscated malware
distinguish malware from benign apps. Due to the with 96% to 97% F1 accuracy, significantly (5% to 46%)
underlying app features used, both kinds of techniques are higher than the F1 accuracy of either baseline approach.
subject to replacement attacks [30] (replacing system-call Our analysis of the three techniques’ performance with
dependencies with semantically equivalent variants) in respect to varying decision thresholds further corroborated
addition to system-call obfuscation (renaming system-call the consistent advantages of our approach. We also
signatures). Also, the detection approach [26] may not conducted in-depth case studies to assess the performance
work with apps that adopt the dynamic permission of DroidCat on individual malware families and various
mechanism already introduced in Android [19], due to its factors that may impact its performance.
reliance on statically retrieved app permissions. Several In summary, we made the following contributions:
other malware detectors combine dynamic profiles with • We developed DroidCat, a novel Android app
static features (e.g., those based on APIs [18] or static classification approach based on a new, diverse set of
permissions [26], [35]), and thus are vulnerable to the same features that capture app behaviors at runtime through
evasion schemes impeding static approaches. short app-level profiling. The features were discovered
In this paper, we develop a novel app classification from a dynamic characterization study that revealed
technique, DroidCat, based on systematic app-level behavioral differences between benign and malicious
profiling and supervised learning. DroidCat is developed apps in terms of method calls and ICCs.
to not only detect but also categorize Android malware • We evaluated DroidCat via three complementary
effectively (referred to as the malware detection and malware studies versus two state-of-the-art peer approaches as
categorization mode, respectively). Different from existing baselines on 34,343 distinct apps spanning year 2009
learning-based dynamic approaches, DroidCat trains its through year 2017. Our results showed that DroidCat
classification model based on a diverse behavioral app largely outperformed the baselines in stability,
profile consisting of features that cover run-time app classification performance, and robustness in both
characteristics in complementary perspectives. DroidCat classification modes, with competitive efficiency.
profiles inter-component communication (ICC) calls and • We conducted in-depth case studies of DroidCat
invocations of all methods, including those defined by user concerning its performance on individual malware
code, third-party libraries, and the Android framework, families and various factors that affect its classification
instead of monitoring system calls. Also, it fully handles capabilities. Our results confirmed the consistently
reflective calls while not using features based on app high overall performance of DroidCat, and additionally
resources or permissions. DroidCat is thus robust to attacks showed its strong performance on most of the families
targeting system calls or exploiting reflection. DroidCat is we examined. We also identified the most effective
also robust to attacks targeting specific sensitive APIs, learning algorithm and dynamic features for DroidCat,
because they are not the only target of method invocations. and demonstrated the low sensitivity of DroidCat to
The features used in DroidCat were decided based on a the coverage of dynamic inputs.
dynamic characterization study of 136 benign and 135 • We released for public access DroidCat and our
malicious apps. In the study, we traced the execution of benchmark suites, to facilitate reproduction of our
each app, defined and evaluated 122 behavioral metrics to results and the development/evaluation of future
thoroughly characterize behavioral differences between the malware detection and categorization techniques.
two app groups. All these metrics measure the relative
occurrence percentage of method invocations or ICCs, which
can never be captured by static malware analyzers. 2 M OTIVATING E XAMPLE
Based on the study, we discovered 70 discriminating Malware developers increasingly adopt various
metrics with noticeably different values on the two app obfuscation techniques (e.g., code reflection) to evade
groups, and included all of them in the feature set. The 70 security checks [41]. Figure 1 shows five code excerpts in a
features are grouped into three feature dimensions: real trojan-horse sample of a dominant [42] malware family
structure, security, and ICC. By training a model with the FakeInst which sends SMS texts to premium-rate numbers.
Random Forest algorithm [40], DroidCat builds a Except for data encryption (by calling m6 on line 3), this
multi-class classifier that predicts whether an app is benign malware heavily uses reflection to invoke methods
or malicious from a particular malware family. We including Android APIs so as to access privileged resources
extensively assessed DroidCat in contrast to DroidSieve [23], such as device id (lines 1–11). In addition, to exploit the
a state-of-the-art static app classification technique, and SMS service (lines 13–18), it retrieves the text and number
Afonso [27], a state-of-the-art dynamic peer approach. The needed for the malicious messaging via reflection (line
evaluation experiments are conducted on 17,365 benign 27–29) and then invokes sendSms (line 30) which calls
and 16,978 malicious apps that span the past nine years. ad.notify.SmsItem::send via reflection again (lines 20–23).
Our evaluation results revealed very-high stability of While simple reflection with string constants (e.g., lines
DroidCat in providing competitive classification accuracy 22,27,28) can be deobfuscated by static analysis [43], [44] (at
for apps evolving over the years: it achieved 97.4% and extra cost), more complex cases may not be (e.g., lines
97.8% F1 accuracy for malware detection and malware 4,5,15,16 where the class and method names are retrieved
categorization, respectively, all with small variations across from a database object mdb). As a result, static code-based
the datasets of varying years. Our comparative study features related to APIs and sensitive flows would not be
further demonstrated the substantial advantages of extracted from the app, and techniques based on such
DroidCat over both baseline techniques, with 27% and 16% features would not detect the security threats. Also, the
higher F1 accuracy in the detection and categorization malicious behaviour in this sample is exhibited in its code
mode, respectively. We also assessed the robustness of our only, not reflected in its resource/asset files (e.g.,
approach against a set of malware adopting various configuration and UI layout); thus approaches bypassing
2
1 // in ad.notify.Settings::getImei(Context context) sensitive information flows. There are also output APIs that
2 // m6 returns ’phone’; cls returns ’android.telephony.TelephonyManager’ send data out of the current component via network or
3 TelephonyManager tm = context.getSystemService(m6(b, b−1, x|76)); storage. We consider them sinks of potential sensitive
4 Class c = Class.forName(mdb.cls(ci)); information flows. If an app’s execution trace has any
5 Method m = c.getMethod(mdb.met(mi),null); //met returns ’getDeviceId’
6 return m.invoke(tm, null); (control-flow) paths from sources to sinks, the app might
7 be malicious due to potential sensitive data leakage.
8 // in NotificationApplication::onCreate(); cls returns ’ad.notify.Settings’
9 Class c = Class.forName(mdb.cls(ci)); //met returns ’getImei’
10 Method m = c.getMethod(mdb.met(mi), new Class<Context>[1]); 4 F EATURE D ISCOVERY AND C OMPUTATION
11 adUrl += m.invoke(null, context);
12 At the core of our approach are its features that are
13 // in ad.notify.SmsItem::send(String str, String str2) computed from app execution traces. Although from these
14 // cls returns ’android.telephony.SmsManager’
15 Class c = Class.forName(mdb.cls(ci)); //met returns ’sendTextMessage’
traces we could extract many features, not every feature is
16 Method m = c.getMethod(mdb.met(mi), new Class<Object>[5]); a good differentiator of malicious apps from benign ones.
17 SmsManager smsManager = SmgManager.getDefault(); Therefore, with a relatively small dataset (136 benign and
18 m.invoke(smsManager, str, null, str2, null, null) 135 malicious apps) (Section 4.1), we first conducted a
19
20 // in ad.notify.OperaUpdateActivity::sendSms(String str, String str2)
systematic dynamic characterization study by defining and
21 Class c = Class.forName(mdb.cls(ci)); // cls returns ’ad.notify.SmsItem’ measuring 122 metrics (Section 4.2) as possible features.
22 Method m = c.getMethod("send", new Class<String>[2]); Based on the comparison between the two groups of apps,
23 Boolean bs = m.invoke(null, str, str2);
we decided which metrics were good differentiation
24
25 // in ad.notify.OperaUpdateActivity::threadOperationRun(int i, Object o)
factors, and thus included them into our feature set
26 SmsItem smsItem=getSmsItem(ad.notify.NotifyApplication.smsIndex); (Section 4.4). The central objective of this exploratory study
27 Class c = Class.forName("ad.notify.SmsItem"); is to discover the features to be used by DroidCat.
28 Field f1 = c.getField("number"); int number = f1.get(smsItem);
29 Field f2 = c.getField ("text"); Object text = f2.get(smsItem);
30 sendSms(number, text); 4.1 Benchmarks
Fig. 1. Code excerpts from a FakeInst malware sample: the complex and Our characterization study used a benchmark suite of both
heavy use of reflection can thwart static code-based feature extraction. benign apps and malicious apps. To collect benign apps,
code analysis (e.g., DroidSieve [23]) might not succeed we downloaded the top 3,000 most popular free apps in
either. Further, malware developers can easily obfuscate Google Play at the end of year 2015 as our initial candidate
app resources too [25]. In these situations, we believe that a pool. Next, we randomly selected an app from the pool and
robust dynamic approach is a necessary complement for checked whether it met the following three criteria: (1) the
defending against such malware samples. minimum supporting SDK version is 4.4 (API 19) or above,
(2) the instrumented APK file runs successfully with inputs
by Monkey [46], and (3) navigating the app with Monkey
3 BACKGROUND inputs for ten minutes covers at least 50% of user code (we
Android applications. Programmers develop Android used our characterization toolkit DroidFax [47] which
apps primarily using Java, and build them into app includes a statement-coverage measurement tool directly
package (i.e., APK) files. Each APK file can contain three working with APKs, which instruments each statement in
software layers: user code (userCode), Android libraries user code to track coverage at runtime). If an app met all
(SDK), and third-party libraries if any (3rdLib). An Android criteria, we further checked it with VirusTotal [48] to
app typically comprises four components as follows [45]: confirm if the app was benign. As such we obtained 136
Activities which deal with UI and handle user interaction to benign apps. For malicious apps, we started with the
the device screen, Services which handle background MalGenome dataset [49], the most widely used malware
processing associated with an application, Broadcast collection. We found 135 apps meeting the above criteria,
Receivers which handle communication between Android and confirmed them all as malware using VirusTotal. The
OS and applications, and Content Providers which handle APK sizes of our benchmarks vary from 2.9MB to 25.6MB.
data storage and management (e.g., database) issues. Recall that this characterization study is exploratory with
ICC. Components interact with each other through ICC the goal of identifying robust and discriminating dynamic
objects—mainly Intents. If both the sender and the receiver features for app classification, thus we aimed at a relatively
of an Intent are within the same app, we classify the ICC as small scale (in terms of the benchmark suite size).
internal; otherwise, it is external. If an Intent has the receiver
explicitly specified in its content, we classify the ICC as 4.2 Metrics Definition
explicit; otherwise, it is implicit. Based on collected execution traces, we characterized app
Lifecycle methods and callbacks. Each app component behaviors by defining 122 metrics in three orthogonal
follows a prescribed lifecycle that defines how this dimensions: structure, ICC, and security (Table 1).
component is created, used, and destroyed. Intuitively, the more diversely these metrics capture app
Correspondingly, developers are allowed to overwrite execution, the more completely they characterize app
various lifecycle methods (e.g., onCreate(), onStart(), and behaviors. These metrics measure not only the existence of
onDestroy()) to define program behaviors when the events certain method invocations or ICCs, but also their relative
happen. Developers can also overwrite other event occurrence frequencies and distribution. For brevity, we
handlers (e.g., onClick()) or define new callbacks to will only discuss a few metrics in the paper; detailed
implement extra logic when other interesting events occur. description of all metrics can be found at
Security-relevant APIs. There are sensitive APIs that http://chapering.github.io/droidfax/metrics.htm.
acquire personal information of users like locations and Structure dimension contains 63 metrics on the
contacts. For example, Location.getLatitude() and distributions of method calls, their declaring classes, and
Location.getLongitude() retrieve GPS location caller-callee links. 31 of these metrics describe the
coordinates. We consider these APIs sources of potential distributions of all method calls among three code layers
3
TABLE 1
Metrics for dynamic characterization and feature selection
# of Substantially # of Noticeably
Dimension # of Metrics Exemplar Metric Disparate Metrics Different Metrics
Structure 63 The percentage of method calls whose definitions are in user code. 15 32
ICC 7 The percentage of external implicit ICCs. 2 5
Security 52 The percentage of sinks reachable by at least one path from a sensitive source 19 33
Total 122 36 70
(i.e. user code, third-party libraries, and Android SDK), or do not monitor OS-level system calls, because we want
among different components. The other 32 metrics describe DroidCat to be robust to any attacks targeting system calls.
the distributions of a specific kind of methods—callbacks Our instrumentation is not limited to sensitive APIs, either.
(including lifecycle methods and event handlers). One By ensuring that sensitive APIs are not the only target
example Structure metric is the percentage of method calls scope of method-call profiling, we make DroidCat more
to the SDK layer. Another example is the percentage of robust against attacks targeting sensitive APIs. Prior work
Activity lifecycle callbacks over all callbacks invoked. shows that even without invoking malicious system calls or
ICC dimension contains 7 metrics to describe ICC sensitive APIs, some malicious apps still can conduct
distributions. Since there are two ways to classify ICCs, attacks by manipulating other apps via ICCs [54], [55], [56],
internal vs. external, and implicit vs. explicit, enumerating all [57]. Thus, we also trace ICCs to further reveal behavioral
possible combinations leads to four metrics. The other three differences between benign and malicious apps.
metrics are defined based on the type of data contained in To characterize the dynamic behaviors of apps, we need
the Intent object associated with an ICC: the Intent carries to run each instrumented app for a sufficiently long time
data in either its URI or extras field only, or both. One using various inputs to cover as many program paths as
example ICC metric is the percentage of ICCs that carry possible. Manually entering inputs to apps is very
data through URI only. Another example is, out of all ICCs inefficient. In order to quickly trigger diverse executions of
exercised, the percentage that are implicit and external. an app, we used Monkey [46] to randomly generate inputs.
Security dimension contains 52 metrics to describe To balance between efficiency and code coverage, we set
distributions of sources, sinks, and the reachability between Monkey to feed every app for ten minutes. (DroidCat only
them through method-level control flows. The reachability executes each app for five minutes; we investigated the
is used to differentiate from all exercised sources/sinks that effect of dynamic coverage on the effectiveness of DroidCat
are risky. If a source reaches at least one sink, it is in Section 7.3.) Once the trace for an app is collected via the
considered a risky source. Similarly, a risky sink is reachable probed run-time monitors, most of the 122 metrics are
from at least one source. Both of these indicate security computed through straightforward trace statistics. The
vulnerabilities, because sensitive data may be leaked when metrics involving risky sources/sinks are calculated
flowing from sources to sinks. For example, a Security through a dynamic call graph built from the trace, which
metric is the percentage of method calls targeting sources facilitates reachability computation.
over all method calls. Another example is the percentage of
exercised sinks that are risky.
4.4 Metrics (Feature) Selection
To identify any metric that well differentiates between the
4.3 Metrics (Feature) Computation two app groups, we measured the value of each metric on
To compute the 122 metrics of an Android app, we first every benchmark app, and then computed the mean values
instrumented the program for execution trace collection. separately for all benign and malware benchmarks. If a
Specifically, we used Soot [50] to transform each app’s APK metric had a mean value difference greater than or equal to
along with the SDK library (android.jar) into Jimple 5%, we considered the behavioral profile of the two groups
code (Soot’s intermediate representation), and then inserted substantially disparate with respect to the metric. If a
in the Jimple code probes to run-time monitors for tracing metric had a difference greater than or equal to 2%, we said
every method call (including those targeting SDK APIs and the behavioral profile was noticeably different with respect
third-party library functions) and every ICC Intent. We also to the metric. We experimented with various thresholds
labeled additional information for instrumented classes chosen heuristically, and found these two (5% and 2%)
and methods to facilitate metric computation. For instance, reasonably well represent two major levels of
we marked the component type for each instrumented differentiation between our malware and benign samples.
class, the category of each instrumented callback, and the As shown in Table 1, by comparing mean metric values
source or sink property of each relevant SDK API. To across app groups, we found 36 substantially disparate
decide the component type of a class such as Foo, we metrics, and 70 noticeably different metrics. We show the
applied Class Hierarchy Analysis (CHA) [51] to identify all top 10 differentiating metrics in Figure 2. There are ten
the superclasses. If Foo extends any of the four known metrics listed on the Y-axis, and the X-axis corresponds to
component types such as Activity, its component type is mean metric values, which vary from 0% to 100%. Each
labeled accordingly. We used the method-type mapping list metric listed on Y-axis corresponds to: a red bar to show
in [47] to label the category of callbacks and the the mean value of all malicious apps, and a green bar to
source/sink property of APIs. Exception handling and represent the mean of all benign ones. The whisker on each
reflection are two widely used Java constructs. Accordingly, bar represents the standard error of the mean. Empirically,
our instrumentation fully tracks two special kinds of these 10 metrics best demonstrate the behavioral
method and ICC calls: (1) those made via reflection, and (2) differences between malicious and benign apps.
those due to exceptional control flows [52] (e.g., calls from In the structure dimension, malicious apps call fewer
catch blocks and finally blocks). methods defined in SDK and more methods defined in user
Next, we ran the instrumented APK of each app on an code, and involve more callbacks relevant to UI. This
Android emulator [53] to collect execution traces, which indicates that user operations may trigger excessive or
include all method calls and ICCs exercised. Note that we unexpected computation. For instance, on average, SDK
4
SDK‐>SDK calls 8% malware benign
60%
UserCode‐>SDK calls 45%
19%
Structure
3rdLib‐>SDK calls 28%
12%
Activity lifecycle callback 92%
73%
View event handler 84%
56%
System status event handler 5%
23%
External explicit ICC 42%
10%
ICC
21%
Risky source invocation 39%
27%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Fig. 2. Top-10 differentiating metrics between malware and benign apps revealed by our exploratory characterization study.
Benign Feature Computation Training
apps Instrumented Supervised
Instrumentation apps learning
Malicious
Behavioral
apps
Monitoring ordinary method Profiling features classification
calls and ICC Intents, and fully Multi-class result
handling calls via reflection Execution classifier
Android and/or exceptional control flows. Feature extraction
traces
apps Testing
Fig. 3. DroidCat overview: it trains a multi-class classifier using benign and malicious apps and then classifies unknown apps.
5
We implemented the learning component of DroidCat in of these samples ranged across the past nine years (i.e.,
Python, using the Scikit-learn toolkit [59], to train and test 2009–2017). We note that there were not exactly the same
the classifier. We provided open source the entire DroidCat samples shared by any two of our datasets (although some
system (including the feature computation component) and samples in one dataset might be the evolved versions of
our datasets at https://chapering.github.io/droidcat. samples in another). In cases where the original datasets
(e.g., DB and MG) overlap, we removed the common
samples from either dataset (e.g., we dismissed MG
6 E VALUATION samples from the original DB dataset). We also ensured
For a comprehensive assessment of DroidCat’s capabilities that these four datasets did not overlap with the dataset
in malware detection and categorization, we conducted used in our characterization study—we excluded the 136
three complementary evaluation studies. In Study I, we aim GP apps and 135 MG malware used in that study when
to gauge the stability of DroidCat by applying it to four forming the four datasets in Table 2. The reason was to
longitudinal datasets (across nine years) to see how well it avoid relevant biases (e.g., overfitting) since the
works for apps in the evolving Android ecosystem. In characterization dataset was used for developing/tuning
Study II, we compare the prediction performance of DroidCat (i.e., for discovering/selecting its features).
DroidCat against state-of-the-art peer approaches, including
a static and a dynamic approach, by applying the three 6.2 Experimental Setup
techniques to two newest datasets among the four. In Study
III, we measure the robustness against obfuscation of the For the baseline techniques, we consider both static and
three techniques using an obfuscation benchmark suite dynamic approaches to Android malware prediction. In
along with varying benign sample sets. We first describe particular, we compare DroidCat to DroidSieve, a
our evaluation datasets in Section 6.1 and procedure in state-of-the-art static malware detection and categorization
Sections 6.2 and 6.3, and then present the evaluation results approach. DroidSieve characterizes an app with
of the three studies in Sections 6.4, 6.5, and 6.6, respectively. resource-centric features (e.g., use permissions extracted
from the manifest file of an APK) in addition to code
(syntactic) features, and then uses these features to train an
6.1 Datasets Extra Trees model that is later used for predicting the label
TABLE 2 of a given app. We chose Afonso as another baseline
Main datasets used in our evaluation studies technique, a state-of-the-art dynamic approach for app
Benign apps Malware
Dataset Period Source #Apps Source #Apps #Families classification. Afonso traces in an app the invocations of
D1617 2016-2017 GP,AZ 5,346 VS,AZ 3,450 153 Android APIs and system calls in specified lists, and uses
D1415 2014-2015 GP,AZ 6,545 VS,AZ 3,190 163 the call frequencies to differentiate malware from benign
D1213 2012-2013 GP,AZ 5,035 VS,AZ,DB,MG 9,084 192 apps based on a Random Forest model.
D0911 2009-2011 AZ 439 VS,AZ,DB,MG 1,254 88
To enable our comparative studies, we obtained the
Table 2 lists the various datasets (named by ids in the feature computation code from the DroidSieve authors and
first column) used in our evaluation experiments. Each implemented the learning component. With help of the
dataset includes a number (fourth column) of benign apps authors, we were able to reproduce the performance results
and a number (sixth column) of malware in a number (the against part of the datasets used in the original evaluation
last column) of families. Apps in each of these four datasets of this technique hence gained confidence about the
are all from the same period (range of years, in the second correctness of our implementation. We implemented the
column), according to the age of each app measured by its Afonso tool according to the API and system call lists
first-seen date we obtained from VirusTotal [48]. The table provided by the authors. We developed DroidCat as
gives the sources (third and fifth columns) of the samples. described earlier. To compute the features and prediction
AndroZoo (AZ) [60] is an online Android app collection results with the two baselines, we followed the exact
that archives both benign and malicious apps. Google Play settings as described in the respective original papers (and
(GP) [61] is the official Android app store. VirusShare by the authors of DroidSieve via emails for which we
(VS) [62] is a database of malware of various kinds initially had difficulties getting performance results close to
including Android malware. The Drebin dataset (DB) is a originally reported ones). In particular, to produce the
set of malware shared in [15], and (Malware) Genome execution traces required by Afonso, we ran each app on a
(MG) is a malware set shared in [63]. We also used an Nexus One emulator with API Level 23, 2G RAM, and 1G
obfuscation benchmark suite along with benign apps from SD storage for 5 minutes as triggered by Monkey random
AZ and GP for Study III (as detailed in Section 6.6). inputs (same as for DroidCat as described in Section 4.3).
Concerning the overhead of dynamic analysis, we All of our experiments were performed on a Ubuntu 15.04
randomly chose a subset of samples from each respective workstation with 8G DDR and a 2.6GHz processor.
source, except for the MG dataset which we used all
samples therein given its relatively small size. A few apps
were discarded during the benchmark collection, because 6.3 Methodology
they could not be unzipped, were missing resource files We evaluated DroidCat in each of its two working modes:
(e.g., assets), or could not be successfully instrumented, (1) malware detection, in which it labels a given app as either
installed, or traced. In particular, for the D1617 and D1415 benign or malicious, and (2) malware categorization, in which
datasets which we used for a comparative study (Study II), it labels a given malware with the predicted malware family.
we also discarded samples with which any of the three To facilitate the assessment of DroidCat in these two different
compared techniques failed in its analysis. We did not modes, we simply treat DroidCat as a multi-class classifier,
apply any of the selection criteria in the characterization with different number of classes to differentiate in different
study (Section 4.1). The numbers (of samples) listed in the modes (e.g., two classes in the detection mode, and two or
table are those of the remaining samples actually used in more classes in the categorization mode).
our studies. In all, our datasets include 17,365 benign apps For Study I, we ran four tests of DroidCat, each using
and 16,978 malware, for a total of 34,343 samples. The age one of the four datasets (D0911 through D1617). For Study
6
II, we executed DroidCat and the two baselines on D1617 1.0 1.0
and D1415, because these two are the most recent datasets.
0.8 0.8
We used three obfuscation datasets for Study III, in which
we ran three tests of the three techniques accordingly.
0.6 0.6
detection (1.34% standard deviation) than for malware
categorization by families (2.38% standard deviation). 0.4 0.4
0.2 0.2
Finding 1: DroidCat achieved mean F1 accuracy of DroidCat (0.9715)
Afonso (0.8204)
DroidCat (0.9759)
Afonso (0.8777)
97.39% and 97.84%, and AUC of 0.95-0.98 and 0.0
DroidSieve (0.9267)
0.0
DroidSieve (0.9018)
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.94-0.98, for malware detection and categorization, re- False Positive Rate False Positive Rate
spectively. It was also stable in classifying apps from diffe- Fig. 7. ROC curves with AUCs of DroidCat versus baselines for malware
rent years within 2009–2017, evidenced by small standard detection on datasets D1415 (left) and D1617 (right).
deviations in F1 of 1.34-2.38% across the nine years. 1.0 1.0
0.8 0.8
True Positive Rate
0.6 0.6
6.5 Study II: Comparative Classification Performance
In this study, we aim to compare our approach to the two 0.4 0.4
9
performance across the three datasets since they all use the DroidCat. We summarize our methodology and findings
same malware set (i.e., the 1214 Praguard malware). below. Further details can be found in our technical report
Our results show that DroidCat largely surpassed the on DroidCat [67].
baseline techniques in any performance metric on any of
the three datasets for malware detection. In the malware
categorization mode, DroidSieve achieved an F1 (92%), the 7.1 Setup and Methodology
closest to that of DroidCat (97%) among all our comparative We started with the characterization study dataset and
experiments. We note that DroidSieve achieved 99% F1 for added into it malware samples from years 2016 and 2017 in
categorizing the MG subset of our malware set here, and the wild, resulting in 287 benign apps and 388 malware.
over 99% F1 for detecting the MG malware from The majority of these apps adopted various obfuscation
benign-app sets different from ours [23]. The considerable strategies (e.g., reflection, data encryption, and
drop in accuracy, when 237 more malware and many class/method renaming). The malware samples were in 15
different benign apps were trained and tested, suggested popular families, including DroidDream, BaseBridge, and
the potential overfitting of this technique to particular DroidKungFu which were among the most evasive families
datasets. On the other hand, compared to our results in according to a prior study [24], and FakeInst and OpFake
Study II, DroidSieve performance did not change much due which are known to combine multiple obfuscation schemes
to obfuscation. Thus, the technique tended to be (renaming, string encryption, and native payload). We
obfuscation-resilient indeed, and its performance variation applied the same hold-out validation procedure, and the
with respect to the original evaluation results in [23] seems same three accuracy metrics (P, R, F1) as used in the
to be mainly attributed to its instability. evaluation experiments (Section 6.3).
Afonso appeared to be resilient against obfuscation too,
but only for malware detection. The substantial
7.2 Results
performance drop (by over 30% in F1) because of
obfuscation indicates its weak obfuscation resiliency for For malware categorization, DroidCat performed perfectly
categorizing malware. Meanwhile, its considerable (with 100% F1) for the majority (11) of the (15) studied
performance variations for malware detection across the families. In particular, these 11 families include the three
three datasets corroborate the instability of this technique. that were previously considered highly evasive:
In contrast, DroidCat tended to be both robust and stable. DroidDream, BaseBridge, and DroidKungFu. Previous tools
The robustness was evidenced by the small difference in studied [24] achieved no more than 54% detection rate (i.e,
performance metrics between this study and Study II. Its the recall metric in our study) on these three families. By
performance variations across the three datasets were also weighted average, over all the 15 classes, DroidCat achieved
quite small, showing its stability even in the presence of 97.3% precision, 96.8% recall, and 97.0% F1. In the malware
complicated obfuscation. detection mode, DroidCat worked even more effectively,
We also computed the ROC curves and AUCs of the with 97.1% precision, 99.4% recall, and 98.2% F1. These
three techniques on each of the three datasets. The results are largely consistent with what we obtained from
contrasts between DroidCat and the two baselines was the extensive evaluation studies (Studies I through III).
similar to those seen in Study II. The AUC numbers
(0.97–0.99 for DroidCat) show considerable advantages of 7.3 Effects of Design Factors
our approach as well (0.05 and 0.09 greater AUC than any
baseline for detection and categorization, respectively). In Feature set choice. We investigated several alternative
all, the ROC results confirmed that DroidCat is robust to feature sets, including the full set of 122 metrics, the set of
various obfuscation schemes, with respect to varying metrics in each of the three dimensions, and the set of 36
decision thresholds, more than the two baselines. substantially disparate metrics (see Table 1). We found that
D* (the default set of 70 features used by DroidCat) worked
Conclusions. On overall average, DroidCat achieved a the best, suggesting that adding more features does not
96.64% F1, compared to 79.59% by Afonso and 79.62% by necessarily improve classification performance. The
DroidSieve in the detection mode. In the categorization Structure features had significantly better effectiveness than
mode, DroidCat also significantly outperformed the two ICC and Security features.
baseline techniques, with 5–46% higher F1. ROC results Most important dynamic features. To see which
corroborated the robustness merits of our approach, specific features are the most important to our technique,
compared to the baselines. In absolute terms, the accuracy we computed the importance ranking [23], [68] of the 70
and AUC numbers revealed that DroidCat can work highly features used by DroidCat. We found that Structure features
effectively with obfuscated apps. consistently dominated the top list, especially when there
were a greater number of classes that our classifier had to
Finding 3: DroidCat exhibited superior robustness to differentiate. In particular, two subcategories of Structure
both state-of-the-art techniques compared, by achieving features contributed the most: (1) distribution of
96% to 97% F1 accuracy on malware that adopted sophi- method/class invocation over the three code layers, and (2)
sticated obfuscation schemes along with varying sets of callback invocation for lifecycle management. The Security
benign apps, significantly higher than the two baselines. features were generally less important, with the ICC
features being the least important. Among all Security
features, those capturing risky control flows and accesses to
sensitive data/operations of particular kinds (e.g., sinks for
7 I N -D EPTH C ASE S TUDIES SMS_MMS) exhibited the greatest significance. The very
We have conducted in-depth case studies on a subset of our few ICC features included in these top rankings
datasets to access the capabilities of our approach in contributed more to identifying benign apps from malware
classifying apps with respect to individual malware than to distinguishing malware families.
families. Through the case studies, we also investigated the Learning algorithm choice. In addition to the Random
effects of various design factors on the performance of Forest algorithm (RF, with 128 trees) used by default, we
10
experimented DroidCat with seven other learning for a dynamic analysis of Android apps, our datasets may
algorithms. Our results show that RF performed still be relatively small in size compared to those used by
significantly better than all the alternatives. Support Vector many static approaches. In particular, considering our
Machine [69] (SVM) with linear kernel had the second best datasets split by ranges of years, our samples from each
effectiveness, while SVM with rbf kernel performed the period may not be representative of the app population of
worst. Naive Bayes [70] with Bernoulli distribution had the that period. For this reason, our results are potentially
third best effectiveness, while with Gaussian distribution it subject to overfitting. To mitigate this limitation, we have
had the second worst effectiveness. Neither Decision considered benchmarks from diverse sources. Recall the
Trees [71] nor k-Nearest Neighbors [72] worked as well as goal of DroidCat is to complement static approaches in
the best setting of the above three. scenarios where they are inapplicable (Section 2). In all, our
Input coverage. We have repeated our case studies on a experimental results and conclusions are best interpreted
new dataset obtained by applying the same coverage filter with respect to the datasets used in our studies.
used in our characterization study. Only apps for which Prior studies have shown that learning-based malware
10-minute Monkey inputs covered at least 50% of user code detectors are subject to class imbalances in training
were selected, resulting in 136 benign and 145 malicious datasets [73], [74]. Our results also suffer from this
apps. The user-code coverage for these 281 apps ranged subjection as our datasets contain imbalanced benign and
from 50% to 100% (mean 66%, standard deviation 12%). malware samples, as well as imbalanced malware families.
The higher-coverage dataset contained 10 app categories: There were two causes for these imbalances: (1) our data
BENIGN and 9 malware families. We applied the same sources do not provide balanced sample sets, and (2) for
held-out validation as used in other experiments. For fair evaluation we needed to use exactly the same samples
malware detection and (9-class) malware categorization, for evaluating DroidCat against the two baselines, thus we
DroidCat gained consistent increases in each of the three had to discard some samples for which the features for any
performance metrics (P, R, F1) on the higher-coverage technique cannot be computed (Section 6.1), which further
datasets compared to our results without the coverage perplexed our control of data balance. On the other hand,
filter. Yet, the increases were all quite small (at most 1.5%). however, the imbalances enabled us to additionally assess
These small differences indicate that the performance of the stability of our approach against the baselines: for
DroidCat did not appear to be very sensitive to the instance, in Study I, our results revealed that the
user-code coverage of run-time inputs. performance of DroidCat was not much affected by the
imbalance of both kinds (more benign apps, in D1617 and
8 E FFICIENCY D1415, or more malware, in D1213 and D0911). We also
note that all the datasets against which we compared
The primary source of analysis overhead of all the three DroidCat to the baselines contained much less malware
techniques compared is the cost for feature extraction. For than benign samples. This kind of imbalance resembles
dynamic approaches like DroidCat and Afonso, this cost real-world situations in which we do have much fewer
includes the time for tracing each app, which is five malware than benign apps.
minutes in both techniques. Specifically, DroidCat took
353.9 seconds for feature computation and 0.01 seconds for Intuitively, the more app code covered by the dynamic
testing, on average per app. In contrast, Afonso took 521.74 inputs, the more app behaviors can be captured and
seconds for feature computation and 0.015 seconds for utilized by our approach. We thus conducted a dedicated
testing per app. As expected, the tracing time dominated study in this regard. Our results confirmed that with
the total feature computation cost in DroidCat and Afonso. higher-coverage inputs DroidCat improved in
Also, in these two techniques, the testing time is almost effectiveness. However, the effectiveness differences were
negligible, mainly because their feature vectors are both small (<2%) between two experiments involving datasets
relatively small (at most 122 features per app in DroidCat, that had large differences (20%) in code coverage.
and 163 features per app in Afonso). As a static approach, Nevertheless, these results may not be generalizable; more
DroidSieve does not incur tracing cost. Its average feature conclusive results would need more extensive studies on
computation cost was 74.19 seconds per app. However, the effect of input coverage. Also, although DroidCat relies
DroidSieve uses very-large feature vectors (over 20,000 on capturing app behavioral patterns in execution
features per app), causing its substantial cost for the testing composition and structure (instead of modeling explicit
phase (on average 3.52 seconds per app). Concerning the malicious behaviors through suspicious permission access
storage cost, DroidCat and Afonso took 21KB and 32KB per and/or data flows), reasonable coverage is still required for
app, respectively, mainly for storing the traces. DroidSieve producing usable traces to enable the feature computation.
does not incur trace storage cost, and it took 0.4KB per app We aimed to leverage a diverse set of behavioral features
for storing feature files. (in three orthogonal dimensions) to make DroidCat robust
In all, DroidCat appeared to be reasonably efficient as a to various evasion attacks that target specific kinds of
dynamic malware analysis approach, and was dynamic profiles. To empirically examine the robustness,
lighter-weight than the peer dynamic approach Afonso. we purposely used an obfuscation benchmark suite in the
DroidSieve was the most efficient among the three evaluation. Further, in the case studies, we used datasets in
techniques, due to its lack of tracing overheads. However, which the majority of apps adopted a variety of evasion
given the substantially superior performance of DroidCat techniques, including complex/heavy reflection, data
over DroidSieve, the additional cost incurred by DroidCat encryption, and class/method renaming. However, other
can be seen to be justified. types of evasion (especially anti-dynamic-analysis)
attacks [75] have not been explicitly covered in our
experiments. For instance, some malware might detect and
9 L IMITATIONS AND T HREATS TO VALIDITY then evade particular kinds of run-time environments (e.g.,
The difficulty and overhead of tracing a large number of emulator). In our evaluation, the dynamic features were all
apps present challenges to dynamic analysis, which extracted from traces gathered on an Android emulator.
constrained the scale of our studies. While reasonably large The high classification performance we achieved suggests
11
that our approach seems robust against emulator-evasion leveraged the graphs to categorize malware [39]. Dash et
attacks. On the other hand, after our features are revealed, al. generated features at different levels, including pure
attackers could take adversarial approaches to impede the system calls and higher-level behavioral patterns like file
computation of our dynamic features or pollute the code to system access which conflate sequences of related system
make our features less discriminatory. calls [34]. Some of the malware detection techniques have
DroidCat works at app level without any modification been applied to family categorization as well [23], [78], [79].
of the Android framework and/or OS as in [18], [76]. This Discussion. Table 6 compares our approach to
design makes DroidCat easier to use and more adaptable to representative recent peer works in terms of classification
rapid evolution of the Android ecosystem, but it does not capability with respect to the three possible settings and
handle dynamically loaded code or native code yet. robustness against various analysis challenges. For the
Meanwhile, DroidCat requires app instrumentation, which settings that a tool was not designed to work in, the
may constitute an impediment for its use by end users. A capability was not applicable (hence noted as N/A).
more common deployment setting would be to use Almost all the static approaches compared are
DroidCat for batch screening by an app vetting service (e.g., vulnerable to reflection as they use features based on APIs.
as part of an app store), where the instrumentation, tracing, Marvin [18] as a hybrid technique also suffers from this
learning, and prediction can be packed in one holistic vulnerability as it relies on a number of static API-based
automated process of the service. Finally, our technique features. Techniques using features on static permissions,
follows a learning-based approach using features that can such as DroidSIFT [79], Drebin [15], and StormDroid [26],
be contrived, thus it may be vulnerable to sophisticated face challenges due to run-time permissions [19], [85]
attacks such as mimicry and poisoning [77]. which are increasingly adopted by (over one third already
of) Android apps [86]. The use of features based on system
10 R ELATED W ORK calls comprises the resiliency of DroidScribe [34],
Madam [35], and Afonso [27] against obfuscation schemes
Dynamic Characterization for Android Apps. There have targeting system calls [30], [36], [38]. DroidSieve [23] gains
been only a few studies broadly characterizing run-time high accuracy with resilience against reflection by reducing
behaviors of Android apps. Zhou et al. manually analyzed code analysis and using resource-centered features, but
1,200 samples to understand malware installation methods, may not detect malware that expresses malicious behaviors
activation mechanisms, and the nature of carried malicious only in code while with benign resources/assets. Our
payloads [49]. Cai et al. instrumented 114 benign apps for comparative study results presented in this paper have
tracing method calls and ICCs, and investigated the supported this hypothesis. In addition, like a few other
dynamic behaviors of benign apps [82]. These studies works that extract features (other than permission) from
either focus on malicious apps or benign ones. Canfora et resource files [15], [18], [55], it may not work with malware
al. profiled Android apps to characterize their resource with resources obfuscated [24], [25], [41].
usage and leveraged the profiles to detect malware [83]. We In contrast, DroidCat adopts a purely dynamic
profiled method and ICC invocations in our approach that resolves reflective calls at runtime, thus it is
characterization study as in [82] yet with both benign and fully resilient against even complex cases of reflection. It
malicious samples. Also, our study aimed at not only relies on no features from resource files or based on system
behavior understanding [49], [82]. We further utilized the calls, thus it is robust against obfuscation targeting those
understanding for app classification like [83] yet with features. While it remains to be studied if it well adapts to
different behavioral profiles and not only for malware Android ecosystem evolution, DroidCat would not be
detection (but also for categorizing malware by families). much affected by run-time permissions as it does not use
Android Malware Detection. Most previous detection related features. Also, compared to prior approaches
techniques utilized static app features based on API typically focusing on API calls, DroidCat characterizes the
calls [15], [16], [17], [35], [78], [79], [80] and/or invocations of all methods and ICCs.
permissions [15], [18], [23], [26], [35]. ICCDetector [55] We omitted in the table the effectiveness numbers (e.g.,
modeled ICC patterns to identify malware that exhibits detection rate and accuracy) for these compared works
different ICC characteristics from benign apps. Besides because they are not comparable: the numbers all came
static features, a few works enhanced their capability by from varied evaluation datasets. In this paper, we have
exploiting dynamic features (i.e., hybrid approaches) such extensively studied two of the listed approached versus
as messaging traffic [26], file/network operations [18], and ours on the same datasets. Nonetheless, in terms of any of
system/API calls [35]. However, approaches relying on the effectiveness metrics we considered, DroidCat
static code analysis are generally vulnerable to reflection appeared to have very promising performance relative to
and other code obfuscation schemes [20], which are widely the state-of-the-art peer approaches.
adopted in Android apps (especially in malware) [41].
Suarez-Tangil et al. [23] mined non-code (e.g.,
resources/assets) features for more robust detection.
Static-analysis challenges have motivated dynamic
11 C ONCLUSION
approaches, of which ours is not the first. Afonso et al. [27] We presented DroidCat, a dynamic app classification
built dynamic features on system/API call frequencies for technique that detects and categorizes Android malware
malware detection, similar to [29] where occurrences of with high accuracy. Features that capture the structure of
unique callsites were used as features. A recent static app executions are at the core of our approach, in addition
technique MamaDroid [81] and its dynamic variant [84] to those based on ICC and security sensitive accesses. We
model app behaviors based on the transition probabilities empirically showed that this diverse, novel set of dynamic
between abstracted API calls in the form of Markov chains. features enabled the superior robustness of DroidCat against
Android Malware Categorization. Approaches have analysis challenges such as heavy and complex use of
been proposed to categorize malware into known families. reflection, resource obfuscation, system-call obfuscation,
Xu et al. traced system calls, investigated three alternative use of run-time permissions, and other evasion schemes.
ways to graphically represent the traces, and then These challenges impede most existing peer approaches, a
12
TABLE 6
Comparison of recent works on Android malware classification in capability and robustness. DET: detection, CAT: family categorization, SYSC:
system call, RT_PERM: run-time permission, RES: resource, OBF: obfuscation.
Classification Capability Robustness against Analysis Challenges
Technique Year Approach DET CAT Reflection SYSC_OBF RT_PERM RES_OBF
DroidMiner [78] 2014 Static 3 3 7 3 3 3
DroidSIFT [79] 2014 Static 3 3 7 3 7 3
Drebin [15] 2014 Static 3 N/A 7 3 7 7
MudFlow [80] 2015 Static 3 N/A 7 3 3 3
Afonso et al. [27] 2015 Dynamic 3 N/A unknown 7 3 3
Marvin [18] 2015 Hybrid 3 N/A 7 3 7 7
Madam [35] 2016 Hybrid 3 N/A unknown 7 7 3
ICCDetector [55] 2016 Static 3 N/A 7 3 3 7
DroidScribe [34] 2016 Dynamic N/A 3 3 7 3 3
StormDroid [26] 2016 Hybrid 3 N/A 7 3 7 3
MamaDroid [81] 2017 Static 3 N/A 7 3 3 3
DroidSieve [23] 2017 Static 3 3 3 3 7 7
DroidCat this work Dynamic 3 3 3 3 3 3
real concern since the app traits leading to the challenges [16] Y. Aafer, W. Du, and H. Yin, “DroidAPIMiner: Mining API-level
are increasingly prevalent in modern Android ecosystem. features for robust malware detection in Android,” in SecureComm,
2013.
Through extensive evaluation and in-depth case [17] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P. Wu, “Droid-
studies, we have shown the superior stability of our mat: Android malware detection through manifest and API calls
approach in achieving high classification performance, tracing,” in Proceedings of Asia Joint Conference on Information Secu-
compared to two state-of-the-art peer approaches, one rity, 2012, pp. 62–69.
[18] M. Lindorfer, M. Neugschwandtner, and C. Platzer, “Marvin: Effi-
static and one dynamic. Meanwhile, in absolute terms, cient and comprehensive mobile app classification through static
DroidCat achieved significantly higher accuracy than the and dynamic analysis,” in COMPSAC, vol. 2, 2015, pp. 422–433.
peer approaches studied for both malware detection and [19] “Requesting permission at run time,” https://
developer.android.com/training/permissions/requesting.html,
family categorization. Thus, DroidCat constitutes a 2015.
promising solution complementary to existing alternatives. [20] A. Moser, C. Kruegel, and E. Kirda, “Limits of static analysis for
malware detection,” in ACSAC, 2007, pp. 421–430.
[21] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bry-
R EFERENCES ant, “Semantics-aware malware detection,” in IEEE Symposium on
Security and Privacy, 2005, pp. 32–46.
[1] “Android malware accounts for 97% of malicious mobile apps,” [22] J. Lee, K. Jeong, and H. Lee, “Detecting metamorphic malwares
http://www.scmagazineuk.com/updated-97-of-malicious- using code graphs,” in SAC, 2010, pp. 1970–1977.
mobile-malware-targets-android/article/422783/, 2015. [23] G. Suarez-Tangil, S. K. Dash, M. Ahmadi, J. Kinder, G. Giacinto,
[2] “The ultimate android malware guide: What it does, and L. Cavallaro, “DroidSieve: Fast and accurate classification of
where it came from, and how to protect your phone obfuscated Android malware,” in CODASPY, 2017, pp. 309–320.
or tablet,” http://www.digitaltrends.com/android/the- [24] D. Maiorca, D. Ariu, I. Corona, M. Aresu, and G. Giacinto, “Ste-
ultimate-android-malware-guide-what-it-does-where-it-came alth attacks: An extended insight into the obfuscation effects on
-from-and-how-to-protect-your-phone-or-tablet/. Android malware,” Computers & Security, vol. 51, pp. 16–31, 2015.
[3] K. Lu, Z. Li, V. P. Kemerlis, Z. Wu, L. Lu, C. Zheng, Z. Qian, [25] G. Square, “Dexguard,” https://www.guardsquare.com/en/
W. Lee, and G. Jiang, “Checking more and alerting less: Detecting dexguard, 2017.
privacy leakages via enhanced data-flow analysis and peer vo- [26] S. Chen, M. Xue, Z. Tang, L. Xu, and H. Zhu, “Stormdroid: A stre-
ting.” in NDSS, 2015. aminglized machine learning-based system for detecting Android
[4] W. Yang, X. Xiao, B. Andow, S. Li, T. Xie, and W. Enck, “Appcon- malware,” in AsiaCCS, 2016, pp. 377–388.
text: Differentiating malicious and benign mobile app behaviors [27] V. M. Afonso, M. F. de Amorim, A. R. A. Grégio, G. B. Junquera,
using context,” in ICSE, 2015. and P. L. de Geus, “Identifying Android malware using dynami-
[5] Y. Feng, S. Anand, I. Dillig, and A. Aiken, “Apposcopy: cally obtained features,” Journal of Computer Virology and Hacking
Semantics-based detection of Android malware through static Techniques, vol. 11, no. 1, pp. 9–17, 2015.
analysis,” in FSE, 2014. [28] A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, and Y. Weiss, ““An-
[6] H. Gascon, F. Yamaguchi, D. Arp, and K. Rieck, “Structural de- dromaly": A behavioral malware detection framework for An-
tection of Android malware using embedded call graphs,” in Pro- droid devices,” Journal of Intelligent Information Systems, vol. 38,
ceedings of the 2013 ACM Workshop on Artificial Intelligence and Se- no. 1, pp. 161–190, 2012.
curity, 2013. [29] I. Burguera, U. Zurutuza, and S. Nadjm-Tehrani, “Crowdroid:
[7] M. Grace, Y. Zhou, Q. Zhang, S. Zou, and X. Jiang, “Riskranker: behavior-based malware detection system for Android,” in Procee-
scalable and accurate zero-day Android malware detection,” in dings of the 1st ACM workshop on Security and privacy in smartphones
MobiSys, 2012, pp. 281–294. and mobile devices. ACM, 2011, pp. 15–26.
[8] H. Cai and J. Jenkins, “Leveraging historical versions of android [30] J. Ming, Z. Xin, P. Lan, D. Wu, P. Liu, and B. Mao, “Impe-
apps for efficient and precise taint analysis,” in MSR, 2018, pp. ding behavior-based malware analysis via replacement attacks to
265–269. malware specifications,” Journal of Computer Virology and Hacking
[9] B. Wolfe, K. Elish, and D. Yao, “High precision screening for An- Techniques, pp. 1–15, 2016.
droid malware with dimensionality reduction,” in International [31] U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda,
Conference on Machine Learning and Applications, 2014. “Scalable behavior-based malware clustering,” in NDSS, 2009.
[10] K. Griffin, S. Schneider, X. Hu, and T.-C. Chiueh, “Automatic gene- [32] K. Tam, S. J. Khan, A. Fattori, and L. Cavallaro, “Copperdroid: Au-
ration of string signatures for malware detection,” in RAID, 2009. tomatic reconstruction of android malware behaviors.” in NDSS,
[11] H. Kang, J.-w. Jang, A. Mohaisen, and H. K. Kim, “Detecting and 2015.
classifying android malware using static analysis along with crea- [33] H. S. Galal, Y. B. Mahdy, and M. A. Atiea, “Behavior-based features
tor information,” Int. J. Distrib. Sen. Netw., 2015. model for malware detection,” Journal of Computer Virology and
[12] H. Peng, C. Gates, B. Sarma, N. Li, Y. Qi, R. Potharaju, Hacking Techniques, pp. 1–9, 2015.
C. Nita-Rotaru, and I. Molloy, “Using probabilistic generative mo- [34] S. K. Dash, G. Suarez-Tangil, S. Khan, K. Tam, M. Ahmadi, J. Kin-
dels for ranking risks of Android apps,” in CCS, 2012. der, and L. Cavallaro, “Droidscribe: Classifying Android malware
[13] B. P. Sarma, N. Li, C. Gates, R. Potharaju, C. Nita-Rotaru, and based on runtime behavior,” Mobile Security Technologies, 2016.
I. Molloy, “Android permissions: A perspective combining risks [35] A. Saracino, D. Sgandurra, G. Dini, and F. Martinelli, “Madam:
and benefits,” in Proceedings of the 17th ACM Symposium on Access Effective and efficient behavior-based Android malware detection
Control Models and Technologies, 2012. and prevention,” TDSC, 2016.
[14] W. Enck, M. Ongtang, and P. D. McDaniel, “On lightweight mobile [36] W. Ma, P. Duan, S. Liu, G. Gu, and J.-C. Liu, “Shadow attacks:
phone application certification,” in CCS, 2009, pp. 235–245. Automatically evading system-call-behavior based malware de-
[15] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K. Rieck, tection,” J. Comput. Virol., 2012.
“Drebin: Effective and explainable detection of Android malware [37] S. Forrest, S. Hofmeyr, and A. Somayaji, “The evolution of
in your pocket,” in NDSS, 2014. system-call monitoring,” in ACSAC, 2008, pp. 418–430.
13
[38] A. Srivastava, A. Lanzi, J. Giffin, and D. Balzarotti, “Operating [64] T. Dietterich, “Overfitting and undercomputing in machine lear-
system interface obfuscation and the revealing of hidden operati- ning,” ACM Computing Surveys, vol. 27, no. 3, pp. 326–327, 1995.
ons,” in DIMVA, 2011, pp. 214–233. [65] R. B. Rao, G. Fung, and R. Rosales, “On the dangers of
[39] L. Xu, D. Zhang, M. A. Alvarez, J. A. Morales, X. Ma, and J. Cava- cross-validation. an experimental evaluation,” in SIAM Internatio-
zos, “Dynamic Android malware classification using graph-based nal Conference on Data Mining, 2008, pp. 588–596.
representations,” in Cyber Security and Cloud Computing (CSCloud), [66] contagiodump.blogspot.com/.
2016 IEEE 3rd International Conference on, 2016, pp. 220–231. [67] H. Cai, N. Meng, B. Ryder, and D. Yao, “Droidcat: Unified dyna-
[40] T. K. Ho, “Random decision forests,” in International Conference on mic detection of Android malware,” Tech. Rep. TR-17-01, January
Document Analysis and Recognition, 1995. 2017, http://hdl.handle.net/10919/77523.
[41] K. Tam, A. Feizollah, N. B. Anuar, R. Salleh, and L. Cavallaro, “The [68] D. Cournapeau, “Machine learning in python,” http://scikit-
evolution of Android malware and Android analysis techniques,” learn.org/stable/supervised_learning.html#supervised-learning,
ACM Computing Surveys (CSUR), vol. 49, no. 4, p. 76, 2017. 2016.
[42] “Over 60 percent of Android malware comes from one malware [69] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,
family: Fakeinstaller,” http://tech.firstpost.com/news-analysis/ 1995.
over-60-percent-of-android-malware-comes-from-one-malware- [70] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach,
family-mcafee-48109.html, 2012. 2003.
[43] D. Octeau, D. Luchaup, M. Dering, S. Jha, and P. McDa- [71] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., 1986.
niel, “Composite constant propagation: Application to Android [72] N. S. Altman, “An Introduction to Kernel and Nearest-Neighbor
inter-component communication analysis,” in ICSE, 2015, pp. Nonparametric Regression,” The American Statistician, vol. 46,
77–88. no. 3, pp. 175–185, 1992.
[44] B. Bichsel, V. Raychev, P. Tsankov, and M. Vechev, “Statistical de- [73] S. Roy, J. DeLoach, Y. Li, N. Herndon, D. Caragea, X. Ou, V. P.
obfuscation of Android applications,” in CCS, 2016, pp. 343–355. Ranganath, H. Li, and N. Guevara, “Experimental study with
[45] “Android app components,” http://www.tutorialspoint.com/ real-world data for Android app security analysis using machine
android/android_application_components.htm. learning,” in ACSAC, 2015, pp. 81–90.
[46] Google, “Android Monkey,” http://developer.android.com/ [74] K. Allix, T. F. Bissyandé, Q. Jérome, J. Klein, Y. Le Traon et al.,
tools/help/monkey.html, 2015. “Empirical assessment of machine learning-based malware detec-
[47] H. Cai and B. Ryder, “DroidFax: A toolkit for systematic characte- tors for Android,” EMSE, vol. 21, no. 1, pp. 183–211, 2016.
rization of Android applications,” in ICSME, 2017, pp. 643–647. [75] S. Rasthofer, S. Arzt, S. Triller, and M. Pradel, “Making malory
[48] “VirusTotal,” https://www.virustotal.com/. behave maliciously: Targeted fuzzing of Android execution envi-
[49] Y. Zhou and X. Jiang, “Dissecting Android malware: Characteriza- ronments,” in ICSE, 2017.
tion and evolution,” in Proceedings of IEEE Symposium on Security [76] M. Lindorfer, M. Neugschwandtner, L. Weichselbaum, Y. Fratan-
and Privacy, 2012, pp. 95–109. tonio, V. Van Der Veen, and C. Platzer, “Andrubis–1,000,000 apps
[50] P. Lam, E. Bodden, O. Lhoták, and L. Hendren, “Soot - a Java later: A view on current Android malware behaviors,” in Internati-
bytecode optimization framework,” in Cetus Users and Compiler onal Workshop on Building Analysis Datasets and Gathering Experience
Infrastructure Workshop, 2011, pp. 1–11. Returns for Security, 2014, pp. 3–17.
[51] J. Dean, D. Grove, and C. Chambers, “Optimization of [77] S. Venkataraman, A. Blum, and D. Song, “Limits of learning-based
object-oriented programs using static class hierarchy analysis,” in signature generation with adversaries,” in NDSS, 2008.
ECOOP, 1995. [78] C. Yang, Z. Xu, G. Gu, V. Yegneswaran, and P. Porras, “Droidmi-
[52] H. Cai and R. Santelices, “Diver: Precise dynamic impact ana- ner: Automated mining and characterization of fine-grained mali-
lysis using dependence-based trace pruning,” in ASE, 2014, pp. cious behaviors in Android applications,” in European Symposium
343–348. on Research in Computer Security. Springer, 2014, pp. 163–182.
[53] Google, “Android emulator,” http://developer.android.com/ [79] M. Zhang, Y. Duan, H. Yin, and Z. Zhao, “Semantics-aware An-
tools/help/emulator.html, 2015. droid malware classification using weighted contextual api depen-
[54] D. Octeau, P. McDaniel, S. Jha, A. Bartel, E. Bodden, J. Klein, and dency graphs,” in CCS, 2014, pp. 1105–1116.
Y. L. Traon, “Effective inter-component communication mapping [80] V. Avdiienko, K. Kuznetsov, A. Gorla, A. Zeller, S. Arzt, S. Rastho-
in Android with Epicc: An essential step towards holistic security fer, and E. Bodden, “Mining apps for abnormal usage of sensitive
analysis,” in USENIX Security Symposium, 2013. data,” in ICSE, 2015, pp. 426–436.
[55] K. Xu, Y. Li, and R. H. Deng, “ICCDetector: ICC-Based malware [81] E. Mariconti, L. Onwuzurike, P. Andriotis, E. De Cristofaro,
detection on Android,” TIFS, vol. 11, no. 6, pp. 1252–1264, 2016. G. Ross, and G. Stringhini, “Mamadroid: Detecting android mal-
[56] J. Jenkins and H. Cai, “Dissecting Android inter-component com- ware by building markov chains of behavioral models,” in NDSS,
munications via interactive visual explorations,” in ICSME, 2017, 2017.
pp. 519–523. [82] H. Cai and B. Ryder, “Understanding Android application pro-
[57] ——, “ICC-Inspect: supporting runtime inspection of android gramming and security: A dynamic study,” in ICSME, 2017, pp.
inter-component communications,” in MobileSoft, 2018, pp. 80–83. 364–375.
[58] S. B. Kotsiantis, “Supervised machine learning: A review of classi- [83] G. Canfora, E. Medvet, F. Mercaldo, and C. A. Visaggio, “Acqui-
fication techniques,” 2007. ring and analyzing app metrics for effective mobile malware de-
[59] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, tection,” in Proceedings of the 2016 ACM on International Workshop
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., on Security And Privacy Analytics, 2016.
“Scikit-learn: Machine learning in python,” Journal of Machine Le- [84] L. Onwuzurike, M. Almeida, E. Mariconti, J. Blackburn, G. String-
arning Research, vol. 12, no. Oct, pp. 2825–2830, 2011. hini, and E. De Cristofaro, “A family of droids: Analyzing be-
[60] K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon, “AndroZoo: havioral model based android malware detection via static and
Collecting millions of android apps for the research community,” dynamic analysis,” arXiv preprint arXiv:1803.03448, 2018.
in MSR, 2016, pp. 468–471. [85] M. Dilhara, H. Cai, and J. Jenkins, “Automated detection and
[61] “Google play store,” https://play.google.com/store, 2018. repair of incompatible uses of runtime permissions in android
[62] “Virusshare,” https://virusshare.com/, 2018. apps,” in MobileSoft, 2018, pp. 67–71.
[63] Y. Zhou and X. Jiang, “Dissecting Android malware: Characteriza- [86] Google, “Android Developer Dashboard,” http://
tion and evolution,” in Proc. of the IEEE Symposium on Security and developer.android.com/about/dashboards/index.html, 2016,
Privacy, 2012, pp. 95–109. accessed online 09/20/2016.
14