0% found this document useful (0 votes)

27 views14 pages

Tifs 18

Uploaded by

Nguyễn Thành An

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views14 pages

Tifs 18

Uploaded by

Nguyễn Thành An

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2015 1

DroidCat: Effective Android Malware Detection

and Categorization via App-Level Profiling
Haipeng Cai, Na Meng, Barbara Ryder, and Daphne Yao

Abstract—Most existing Android malware detection and categorization techniques are static approaches, which suffer from evasion
attacks such as obfuscation. By analyzing program behaviors, dynamic approaches are potentially more resilient against these attacks.
Yet existing dynamic approaches mostly rely on characterizing system calls which are subject to system-call obfuscation. This paper
presents DroidCat, a novel dynamic app classification technique, to complement existing approaches. By using a diverse set of
dynamic features based on method calls and ICC Intents without involving permission, app resources, or system calls while fully
handling reflection, DroidCat achieves superior robustness than static approaches as well as dynamic approaches relying on system
calls. The features were distilled from a behavioral characterization study of benign versus malicious apps. Through three
complementary evaluation studies with 34,343 apps from various sources and spanning the past nine years, we demonstrated the
stability of DroidCat in achieving high classification performance and superior accuracy compared to two state-of-the-art peer
techniques that represent both static and dynamic approaches. Overall, DroidCat achieved 97% F1-measure accuracy consistently for
classifying apps evolving over the nine years, detecting or categorizing malware, 16% to 27% higher than any of the two baselines
compared. Further, our experiments with obfuscated benchmarks confirmed higher robustness of DroidCat over these baseline
techniques. We also investigated the effects of various design decisions on DroidCat’s effectiveness and the most important features
for our dynamic classification. We found that features capturing app execution structure such as the distribution of method calls over
user code and libraries are much more important than typical security features such as sensitive flows.

Index Terms—Android, security, malware, dynamic analysis, profiling, detection, categorization, stability, robustness, obfuscation.

1 I NTRODUCTION
Android has been the target platform of 97% malicious apps can request, acquire, and revoke permissions at
mobile apps [1], most of which steal personal information, runtime [19]. This new run-time permission mechanism
abuse privileged resources, and/or install additional implies that static approaches will not be able to discover
malicious software [2]. With the Android market growing when an abnormal permission is requested and granted at
rapidly, it is critically important to differentiate malware runtime, and would suffer from more false alarms if users
from benign apps (i.e., malware detection). Further, for revoke dangerous permissions after app installation. Third,
threat assessment and defense planning, it is also crucial to static approaches have limited capabilities in detecting
differentiate malware of different families (i.e., malware malicious behaviors that are exercised through dynamic
categorization by family). code constructs (e.g., calling sensitive APIs via reflection).
Two main classes of approaches to Android malware These and other limits [20] make static analysis vulnerable
detection/categorization have been studied: static and to widely adopted detection-evading schemes (e.g., code
dynamic. Static approaches leverage static code analysis to obfuscation [21] and metamorphism [22]). Recently,
check whether an app contains abnormal information flows resource-centric features are also used in static code
or calling structures [3], [4], [5], [6], [7], [8], matches analysis to overcome the above-mentioned limitations [23].
malicious code patterns [9], [10], requests for excessive However, such approaches can still be evaded by malware
permissions [11], [12], [13], [14], and/or invokes APIs that adopting resource obfuscation [24], [25].
are frequently used by malware [15], [16], [17], [18]. In comparison, dynamic approaches provide a
Static approaches may have the advantage of being complementary way to detect/categorize malware [26],
sound and scalable to screening large numbers of apps, yet [27], [28], [29]. In particular, behavior-based techniques [30]
they cannot always precisely detect malware for three model program behavioral profiles [31], [32] with
reasons. First, due to the event-driven features of Android, system/API call traces (e.g., [33], [34], [35]) and/or
such as lifecycle callbacks and GUI handling, run-time resource usage [28], [33]. Machine learning has been
control/data flows are not always statically estimatable; increasingly incorporated in these techniques, training
they depend on the run-time environment. This classification models from those profiles to distinguish
approximation makes static analysis unable to reveal many malware from benign apps. However, system-call based
malware activities. Second, the mere existence of some malware detectors can still be evaded when an app
permissions and/or APIs in code does not always mean obfuscates system calls [30], [36], [37], [38]. Sensitive API
that they are actually executed or invoked frequently at usage does not necessarily indicate malicious intentions.
runtime to cause an attack. Purely checking for existence of Abnormal resource usage does not always correspond to
permissions and/or APIs can cause static analysis to abnormal behaviors, either. Generally, behavior-based
wrongly report malware. In particular since API Level 23, approaches relying on system-call sequences and/or
Android has added dynamic permission support such that dependencies may be easily thwarted by system-call
obfuscation techniques (e.g., mimicry attack [37] and
• Haipeng Cai is with the School of Electrical Engineering and Computer illusion [38]). A more comprehensive dynamic app
Science, Washington State University, Pullman, WA, 99163. classifier is needed to capture varied behavioral profiles
E-mail: [email protected]
• Na Meng, Barbara Ryder, and Daphne Yao are with the Department of and thus be robust to attacks against specific profiles.
Computer Science, Virginia Tech, Blacksburg, VA. Recent dynamic Android malware categorization
Manuscript received February 19, 2018; revised September, 2018. approaches utilize the histogram [39] or chains (or
dependencies) [34] of system calls. A recent dynamic sophisticated obfuscation schemes, along with three
Android malware detection technique [26] differentiates different sets of benign apps. Our study showed that
API call counts and strictly matches API signatures to DroidCat worked robustly well on obfuscated malware
distinguish malware from benign apps. Due to the with 96% to 97% F1 accuracy, significantly (5% to 46%)
underlying app features used, both kinds of techniques are higher than the F1 accuracy of either baseline approach.
subject to replacement attacks [30] (replacing system-call Our analysis of the three techniques’ performance with
dependencies with semantically equivalent variants) in respect to varying decision thresholds further corroborated
addition to system-call obfuscation (renaming system-call the consistent advantages of our approach. We also
signatures). Also, the detection approach [26] may not conducted in-depth case studies to assess the performance
work with apps that adopt the dynamic permission of DroidCat on individual malware families and various
mechanism already introduced in Android [19], due to its factors that may impact its performance.
reliance on statically retrieved app permissions. Several In summary, we made the following contributions:
other malware detectors combine dynamic profiles with • We developed DroidCat, a novel Android app
static features (e.g., those based on APIs [18] or static classification approach based on a new, diverse set of
permissions [26], [35]), and thus are vulnerable to the same features that capture app behaviors at runtime through
evasion schemes impeding static approaches. short app-level profiling. The features were discovered
In this paper, we develop a novel app classification from a dynamic characterization study that revealed
technique, DroidCat, based on systematic app-level behavioral differences between benign and malicious
profiling and supervised learning. DroidCat is developed apps in terms of method calls and ICCs.
to not only detect but also categorize Android malware • We evaluated DroidCat via three complementary
effectively (referred to as the malware detection and malware studies versus two state-of-the-art peer approaches as
categorization mode, respectively). Different from existing baselines on 34,343 distinct apps spanning year 2009
learning-based dynamic approaches, DroidCat trains its through year 2017. Our results showed that DroidCat
classification model based on a diverse behavioral app largely outperformed the baselines in stability,
profile consisting of features that cover run-time app classification performance, and robustness in both
characteristics in complementary perspectives. DroidCat classification modes, with competitive efficiency.
profiles inter-component communication (ICC) calls and • We conducted in-depth case studies of DroidCat
invocations of all methods, including those defined by user concerning its performance on individual malware
code, third-party libraries, and the Android framework, families and various factors that affect its classification
instead of monitoring system calls. Also, it fully handles capabilities. Our results confirmed the consistently
reflective calls while not using features based on app high overall performance of DroidCat, and additionally
resources or permissions. DroidCat is thus robust to attacks showed its strong performance on most of the families
targeting system calls or exploiting reflection. DroidCat is we examined. We also identified the most effective
also robust to attacks targeting specific sensitive APIs, learning algorithm and dynamic features for DroidCat,
because they are not the only target of method invocations. and demonstrated the low sensitivity of DroidCat to
The features used in DroidCat were decided based on a the coverage of dynamic inputs.
dynamic characterization study of 136 benign and 135 • We released for public access DroidCat and our
malicious apps. In the study, we traced the execution of benchmark suites, to facilitate reproduction of our
each app, defined and evaluated 122 behavioral metrics to results and the development/evaluation of future
thoroughly characterize behavioral differences between the malware detection and categorization techniques.
two app groups. All these metrics measure the relative
occurrence percentage of method invocations or ICCs, which
can never be captured by static malware analyzers. 2 M OTIVATING E XAMPLE
Based on the study, we discovered 70 discriminating Malware developers increasingly adopt various
metrics with noticeably different values on the two app obfuscation techniques (e.g., code reflection) to evade
groups, and included all of them in the feature set. The 70 security checks [41]. Figure 1 shows five code excerpts in a
features are grouped into three feature dimensions: real trojan-horse sample of a dominant [42] malware family
structure, security, and ICC. By training a model with the FakeInst which sends SMS texts to premium-rate numbers.
Random Forest algorithm [40], DroidCat builds a Except for data encryption (by calling m6 on line 3), this
multi-class classifier that predicts whether an app is benign malware heavily uses reflection to invoke methods
or malicious from a particular malware family. We including Android APIs so as to access privileged resources
extensively assessed DroidCat in contrast to DroidSieve [23], such as device id (lines 1–11). In addition, to exploit the
a state-of-the-art static app classification technique, and SMS service (lines 13–18), it retrieves the text and number
Afonso [27], a state-of-the-art dynamic peer approach. The needed for the malicious messaging via reflection (line
evaluation experiments are conducted on 17,365 benign 27–29) and then invokes sendSms (line 30) which calls
and 16,978 malicious apps that span the past nine years. ad.notify.SmsItem::send via reflection again (lines 20–23).
Our evaluation results revealed very-high stability of While simple reflection with string constants (e.g., lines
DroidCat in providing competitive classification accuracy 22,27,28) can be deobfuscated by static analysis [43], [44] (at
for apps evolving over the years: it achieved 97.4% and extra cost), more complex cases may not be (e.g., lines
97.8% F1 accuracy for malware detection and malware 4,5,15,16 where the class and method names are retrieved
categorization, respectively, all with small variations across from a database object mdb). As a result, static code-based
the datasets of varying years. Our comparative study features related to APIs and sensitive flows would not be
further demonstrated the substantial advantages of extracted from the app, and techniques based on such
DroidCat over both baseline techniques, with 27% and 16% features would not detect the security threats. Also, the
higher F1 accuracy in the detection and categorization malicious behaviour in this sample is exhibited in its code
mode, respectively. We also assessed the robustness of our only, not reflected in its resource/asset files (e.g.,
approach against a set of malware adopting various configuration and UI layout); thus approaches bypassing

2
1 // in ad.notify.Settings::getImei(Context context) sensitive information flows. There are also output APIs that
2 // m6 returns ’phone’; cls returns ’android.telephony.TelephonyManager’ send data out of the current component via network or
3 TelephonyManager tm = context.getSystemService(m6(b, b−1, x|76)); storage. We consider them sinks of potential sensitive
4 Class c = Class.forName(mdb.cls(ci)); information flows. If an app’s execution trace has any
5 Method m = c.getMethod(mdb.met(mi),null); //met returns ’getDeviceId’
6 return m.invoke(tm, null); (control-flow) paths from sources to sinks, the app might
7 be malicious due to potential sensitive data leakage.
8 // in NotificationApplication::onCreate(); cls returns ’ad.notify.Settings’
9 Class c = Class.forName(mdb.cls(ci)); //met returns ’getImei’
10 Method m = c.getMethod(mdb.met(mi), new Class<Context>[1]); 4 F EATURE D ISCOVERY AND C OMPUTATION
11 adUrl += m.invoke(null, context);
12 At the core of our approach are its features that are
13 // in ad.notify.SmsItem::send(String str, String str2) computed from app execution traces. Although from these
14 // cls returns ’android.telephony.SmsManager’
15 Class c = Class.forName(mdb.cls(ci)); //met returns ’sendTextMessage’
traces we could extract many features, not every feature is
16 Method m = c.getMethod(mdb.met(mi), new Class<Object>[5]); a good differentiator of malicious apps from benign ones.
17 SmsManager smsManager = SmgManager.getDefault(); Therefore, with a relatively small dataset (136 benign and
18 m.invoke(smsManager, str, null, str2, null, null) 135 malicious apps) (Section 4.1), we first conducted a
19
20 // in ad.notify.OperaUpdateActivity::sendSms(String str, String str2)
systematic dynamic characterization study by defining and
21 Class c = Class.forName(mdb.cls(ci)); // cls returns ’ad.notify.SmsItem’ measuring 122 metrics (Section 4.2) as possible features.
22 Method m = c.getMethod("send", new Class<String>[2]); Based on the comparison between the two groups of apps,
23 Boolean bs = m.invoke(null, str, str2);
we decided which metrics were good differentiation
24
25 // in ad.notify.OperaUpdateActivity::threadOperationRun(int i, Object o)
factors, and thus included them into our feature set
26 SmsItem smsItem=getSmsItem(ad.notify.NotifyApplication.smsIndex); (Section 4.4). The central objective of this exploratory study
27 Class c = Class.forName("ad.notify.SmsItem"); is to discover the features to be used by DroidCat.
28 Field f1 = c.getField("number"); int number = f1.get(smsItem);
29 Field f2 = c.getField ("text"); Object text = f2.get(smsItem);
30 sendSms(number, text); 4.1 Benchmarks
Fig. 1. Code excerpts from a FakeInst malware sample: the complex and Our characterization study used a benchmark suite of both
heavy use of reflection can thwart static code-based feature extraction. benign apps and malicious apps. To collect benign apps,
code analysis (e.g., DroidSieve [23]) might not succeed we downloaded the top 3,000 most popular free apps in
either. Further, malware developers can easily obfuscate Google Play at the end of year 2015 as our initial candidate
app resources too [25]. In these situations, we believe that a pool. Next, we randomly selected an app from the pool and
robust dynamic approach is a necessary complement for checked whether it met the following three criteria: (1) the
defending against such malware samples. minimum supporting SDK version is 4.4 (API 19) or above,
(2) the instrumented APK file runs successfully with inputs
by Monkey [46], and (3) navigating the app with Monkey
3 BACKGROUND inputs for ten minutes covers at least 50% of user code (we
Android applications. Programmers develop Android used our characterization toolkit DroidFax [47] which
apps primarily using Java, and build them into app includes a statement-coverage measurement tool directly
package (i.e., APK) files. Each APK file can contain three working with APKs, which instruments each statement in
software layers: user code (userCode), Android libraries user code to track coverage at runtime). If an app met all
(SDK), and third-party libraries if any (3rdLib). An Android criteria, we further checked it with VirusTotal [48] to
app typically comprises four components as follows [45]: confirm if the app was benign. As such we obtained 136
Activities which deal with UI and handle user interaction to benign apps. For malicious apps, we started with the
the device screen, Services which handle background MalGenome dataset [49], the most widely used malware
processing associated with an application, Broadcast collection. We found 135 apps meeting the above criteria,
Receivers which handle communication between Android and confirmed them all as malware using VirusTotal. The
OS and applications, and Content Providers which handle APK sizes of our benchmarks vary from 2.9MB to 25.6MB.
data storage and management (e.g., database) issues. Recall that this characterization study is exploratory with
ICC. Components interact with each other through ICC the goal of identifying robust and discriminating dynamic
objects—mainly Intents. If both the sender and the receiver features for app classification, thus we aimed at a relatively
of an Intent are within the same app, we classify the ICC as small scale (in terms of the benchmark suite size).
internal; otherwise, it is external. If an Intent has the receiver
explicitly specified in its content, we classify the ICC as 4.2 Metrics Definition
explicit; otherwise, it is implicit. Based on collected execution traces, we characterized app
Lifecycle methods and callbacks. Each app component behaviors by defining 122 metrics in three orthogonal
follows a prescribed lifecycle that defines how this dimensions: structure, ICC, and security (Table 1).
component is created, used, and destroyed. Intuitively, the more diversely these metrics capture app
Correspondingly, developers are allowed to overwrite execution, the more completely they characterize app
various lifecycle methods (e.g., onCreate(), onStart(), and behaviors. These metrics measure not only the existence of
onDestroy()) to define program behaviors when the events certain method invocations or ICCs, but also their relative
happen. Developers can also overwrite other event occurrence frequencies and distribution. For brevity, we
handlers (e.g., onClick()) or define new callbacks to will only discuss a few metrics in the paper; detailed
implement extra logic when other interesting events occur. description of all metrics can be found at
Security-relevant APIs. There are sensitive APIs that http://chapering.github.io/droidfax/metrics.htm.
acquire personal information of users like locations and Structure dimension contains 63 metrics on the
contacts. For example, Location.getLatitude() and distributions of method calls, their declaring classes, and
Location.getLongitude() retrieve GPS location caller-callee links. 31 of these metrics describe the
coordinates. We consider these APIs sources of potential distributions of all method calls among three code layers

3
TABLE 1
Metrics for dynamic characterization and feature selection
# of Substantially # of Noticeably
Dimension # of Metrics Exemplar Metric Disparate Metrics Different Metrics
Structure 63 The percentage of method calls whose definitions are in user code. 15 32
ICC 7 The percentage of external implicit ICCs. 2 5
Security 52 The percentage of sinks reachable by at least one path from a sensitive source 19 33
Total 122 36 70

(i.e. user code, third-party libraries, and Android SDK), or do not monitor OS-level system calls, because we want
among different components. The other 32 metrics describe DroidCat to be robust to any attacks targeting system calls.
the distributions of a specific kind of methods—callbacks Our instrumentation is not limited to sensitive APIs, either.
(including lifecycle methods and event handlers). One By ensuring that sensitive APIs are not the only target
example Structure metric is the percentage of method calls scope of method-call profiling, we make DroidCat more
to the SDK layer. Another example is the percentage of robust against attacks targeting sensitive APIs. Prior work
Activity lifecycle callbacks over all callbacks invoked. shows that even without invoking malicious system calls or
ICC dimension contains 7 metrics to describe ICC sensitive APIs, some malicious apps still can conduct
distributions. Since there are two ways to classify ICCs, attacks by manipulating other apps via ICCs [54], [55], [56],
internal vs. external, and implicit vs. explicit, enumerating all [57]. Thus, we also trace ICCs to further reveal behavioral
possible combinations leads to four metrics. The other three differences between benign and malicious apps.
metrics are defined based on the type of data contained in To characterize the dynamic behaviors of apps, we need
the Intent object associated with an ICC: the Intent carries to run each instrumented app for a sufficiently long time
data in either its URI or extras field only, or both. One using various inputs to cover as many program paths as
example ICC metric is the percentage of ICCs that carry possible. Manually entering inputs to apps is very
data through URI only. Another example is, out of all ICCs inefficient. In order to quickly trigger diverse executions of
exercised, the percentage that are implicit and external. an app, we used Monkey [46] to randomly generate inputs.
Security dimension contains 52 metrics to describe To balance between efficiency and code coverage, we set
distributions of sources, sinks, and the reachability between Monkey to feed every app for ten minutes. (DroidCat only
them through method-level control flows. The reachability executes each app for five minutes; we investigated the
is used to differentiate from all exercised sources/sinks that effect of dynamic coverage on the effectiveness of DroidCat
are risky. If a source reaches at least one sink, it is in Section 7.3.) Once the trace for an app is collected via the
considered a risky source. Similarly, a risky sink is reachable probed run-time monitors, most of the 122 metrics are
from at least one source. Both of these indicate security computed through straightforward trace statistics. The
vulnerabilities, because sensitive data may be leaked when metrics involving risky sources/sinks are calculated
flowing from sources to sinks. For example, a Security through a dynamic call graph built from the trace, which
metric is the percentage of method calls targeting sources facilitates reachability computation.
over all method calls. Another example is the percentage of
exercised sinks that are risky.
4.4 Metrics (Feature) Selection
To identify any metric that well differentiates between the
4.3 Metrics (Feature) Computation two app groups, we measured the value of each metric on
To compute the 122 metrics of an Android app, we first every benchmark app, and then computed the mean values
instrumented the program for execution trace collection. separately for all benign and malware benchmarks. If a
Specifically, we used Soot [50] to transform each app’s APK metric had a mean value difference greater than or equal to
along with the SDK library (android.jar) into Jimple 5%, we considered the behavioral profile of the two groups
code (Soot’s intermediate representation), and then inserted substantially disparate with respect to the metric. If a
in the Jimple code probes to run-time monitors for tracing metric had a difference greater than or equal to 2%, we said
every method call (including those targeting SDK APIs and the behavioral profile was noticeably different with respect
third-party library functions) and every ICC Intent. We also to the metric. We experimented with various thresholds
labeled additional information for instrumented classes chosen heuristically, and found these two (5% and 2%)
and methods to facilitate metric computation. For instance, reasonably well represent two major levels of
we marked the component type for each instrumented differentiation between our malware and benign samples.
class, the category of each instrumented callback, and the As shown in Table 1, by comparing mean metric values
source or sink property of each relevant SDK API. To across app groups, we found 36 substantially disparate
decide the component type of a class such as Foo, we metrics, and 70 noticeably different metrics. We show the
applied Class Hierarchy Analysis (CHA) [51] to identify all top 10 differentiating metrics in Figure 2. There are ten
the superclasses. If Foo extends any of the four known metrics listed on the Y-axis, and the X-axis corresponds to
component types such as Activity, its component type is mean metric values, which vary from 0% to 100%. Each
labeled accordingly. We used the method-type mapping list metric listed on Y-axis corresponds to: a red bar to show
in [47] to label the category of callbacks and the the mean value of all malicious apps, and a green bar to
source/sink property of APIs. Exception handling and represent the mean of all benign ones. The whisker on each
reflection are two widely used Java constructs. Accordingly, bar represents the standard error of the mean. Empirically,
our instrumentation fully tracks two special kinds of these 10 metrics best demonstrate the behavioral
method and ICC calls: (1) those made via reflection, and (2) differences between malicious and benign apps.
those due to exceptional control flows [52] (e.g., calls from In the structure dimension, malicious apps call fewer
catch blocks and finally blocks). methods defined in SDK and more methods defined in user
Next, we ran the instrumented APK of each app on an code, and involve more callbacks relevant to UI. This
Android emulator [53] to collect execution traces, which indicates that user operations may trigger excessive or
include all method calls and ICCs exercised. Note that we unexpected computation. For instance, on average, SDK

4
SDK‐>SDK calls 8% malware benign
60%
UserCode‐>SDK calls 45%
19%
Structure
3rdLib‐>SDK calls 28%
12%
Activity lifecycle callback 92%
73%
View event handler 84%
56%
System status event handler 5%
23%
External explicit ICC 42%
10%
ICC

ICC carrying URI data only 26%

14%
Logging sink invocation 9%
Security

21%
Risky source invocation 39%
27%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Fig. 2. Top-10 differentiating metrics between malware and benign apps revealed by our exploratory characterization study.
Benign Feature Computation Training
apps Instrumented Supervised
Instrumentation apps learning
Malicious
Behavioral
apps
Monitoring ordinary method Profiling features classification
calls and ICC Intents, and fully Multi-class result
handling calls via reflection Execution classifier
Android and/or exceptional control flows. Feature extraction
traces
apps Testing
Fig. 3. DroidCat overview: it trains a multi-class classifier using benign and malicious apps and then classifies unknown apps.

APIs account for 80% function calls in the execution of

Implication 2: Malware may use more explicit ICCs to
malicious apps, but 91% function calls in benign apps’
potentially attack specific external components, or disse-
executions. Concerning caller-callee links, For instance,
minate potentially malicious URIs more often via ICCs.
SDK->SDK accounts for 60% of method calls in benign apps,
but only accounts for 8% in malware. In comparison,
UserCode->SDK and 3rdLib->SDK are the most frequent In the security dimension, malware invokes more risky
caller-callee links in malware. In terms of callbacks, source APIs, but fewer logging sink APIs. By executing
malicious apps involve 92% Activity lifecycle callbacks more risky sources, malware may cause sensitive data
and 84% View event handlers. Both numbers are leakage. As shown in Figure 2, among all invoked source
significantly larger than their counterparts in benign apps: APIs, malicious apps have 39% risky sources, while benign
73% and 56%. However, malicious apps involve a much ones only have 27%. Among all invoked sink APIs, logging
smaller portion of system status event handlers than sinks accounts for 21% for benign apps, but 9% for
benign ones, which is 5% vs. 23%. malware, which means users may log more frequently in
benign apps to secure their sensitive data operations.
Besides, we also observe that sink APIs like Network access
Implication 1: Malware tended to invoke SDK APIs and Messaging (SMS/MMS), are invoked more frequently
more often from user code or third-party libraries, and by malware than benign apps.
define more UI callbacks indicating that user operations
on them may trigger excessive/unexpected computation. Implication 3: Malicious apps exhibit less logging
practice than benign ones. They execute more risky sour-
ces, which may lead to sensitive data leakage.
In the ICC dimension, malware involves more external
explicit ICCs with more URI data carried by Intents. This
means that malware uses explicit ICCs more often to
potentially exploit specific external components, or sends 5 T HE DroidCat A PPROACH
more URI data via ICCs to disseminate potentially Based on our characterization study, we developed
malicious URIs. Specifically, on average, malware DroidCat, an app classification approach leveraging
executions contain 42% external explicit ICCs, while benign systematic profiling and supervised learning, to decide
app executions only contain 10%. In reality, to whether a given app is benign or belongs to a particular
communicate with external components, benign apps malware family. As shown in Figure 3, there are two phases
usually leverage implicit instead of explicit ICCs. This is in our approach: training and testing. For training,
reasonable because developers usually know little about DroidCat takes both benign and malicious apps as inputs.
the run-time environment of a user’s phone, such as what For each app, it computes the 70 metrics as behavioral
external components are available to receive Intents. features as described above. The features are then provided
Therefore, external implicit ICCs can be used to flexibly to supervised machine learning to train a multi-class
detect all available potential receivers before any Intent is classifier [58], using the Random Forest algorithm [40]. For
sent. In contrast, the frequently used external explicit ICCs testing, given an arbitrary app, DroidCat computes the
by malware seem suspicious. Among data-carrying ICCs, same set of behavioral features and then feeds these
malicious apps have 30% ICCs carrying URI data, while features to the classifier to decide whether the app is
this number is only 10% in benign apps. benign or a member of a malware family.

5
We implemented the learning component of DroidCat in of these samples ranged across the past nine years (i.e.,
Python, using the Scikit-learn toolkit [59], to train and test 2009–2017). We note that there were not exactly the same
the classifier. We provided open source the entire DroidCat samples shared by any two of our datasets (although some
system (including the feature computation component) and samples in one dataset might be the evolved versions of
our datasets at https://chapering.github.io/droidcat. samples in another). In cases where the original datasets
(e.g., DB and MG) overlap, we removed the common
samples from either dataset (e.g., we dismissed MG
6 E VALUATION samples from the original DB dataset). We also ensured
For a comprehensive assessment of DroidCat’s capabilities that these four datasets did not overlap with the dataset
in malware detection and categorization, we conducted used in our characterization study—we excluded the 136
three complementary evaluation studies. In Study I, we aim GP apps and 135 MG malware used in that study when
to gauge the stability of DroidCat by applying it to four forming the four datasets in Table 2. The reason was to
longitudinal datasets (across nine years) to see how well it avoid relevant biases (e.g., overfitting) since the
works for apps in the evolving Android ecosystem. In characterization dataset was used for developing/tuning
Study II, we compare the prediction performance of DroidCat (i.e., for discovering/selecting its features).
DroidCat against state-of-the-art peer approaches, including
a static and a dynamic approach, by applying the three 6.2 Experimental Setup
techniques to two newest datasets among the four. In Study
III, we measure the robustness against obfuscation of the For the baseline techniques, we consider both static and
three techniques using an obfuscation benchmark suite dynamic approaches to Android malware prediction. In
along with varying benign sample sets. We first describe particular, we compare DroidCat to DroidSieve, a
our evaluation datasets in Section 6.1 and procedure in state-of-the-art static malware detection and categorization
Sections 6.2 and 6.3, and then present the evaluation results approach. DroidSieve characterizes an app with
of the three studies in Sections 6.4, 6.5, and 6.6, respectively. resource-centric features (e.g., use permissions extracted
from the manifest file of an APK) in addition to code
(syntactic) features, and then uses these features to train an
6.1 Datasets Extra Trees model that is later used for predicting the label
TABLE 2 of a given app. We chose Afonso as another baseline
Main datasets used in our evaluation studies technique, a state-of-the-art dynamic approach for app
Benign apps Malware
Dataset Period Source #Apps Source #Apps #Families classification. Afonso traces in an app the invocations of
D1617 2016-2017 GP,AZ 5,346 VS,AZ 3,450 153 Android APIs and system calls in specified lists, and uses
D1415 2014-2015 GP,AZ 6,545 VS,AZ 3,190 163 the call frequencies to differentiate malware from benign
D1213 2012-2013 GP,AZ 5,035 VS,AZ,DB,MG 9,084 192 apps based on a Random Forest model.
D0911 2009-2011 AZ 439 VS,AZ,DB,MG 1,254 88
To enable our comparative studies, we obtained the
Table 2 lists the various datasets (named by ids in the feature computation code from the DroidSieve authors and
first column) used in our evaluation experiments. Each implemented the learning component. With help of the
dataset includes a number (fourth column) of benign apps authors, we were able to reproduce the performance results
and a number (sixth column) of malware in a number (the against part of the datasets used in the original evaluation
last column) of families. Apps in each of these four datasets of this technique hence gained confidence about the
are all from the same period (range of years, in the second correctness of our implementation. We implemented the
column), according to the age of each app measured by its Afonso tool according to the API and system call lists
first-seen date we obtained from VirusTotal [48]. The table provided by the authors. We developed DroidCat as
gives the sources (third and fifth columns) of the samples. described earlier. To compute the features and prediction
AndroZoo (AZ) [60] is an online Android app collection results with the two baselines, we followed the exact
that archives both benign and malicious apps. Google Play settings as described in the respective original papers (and
(GP) [61] is the official Android app store. VirusShare by the authors of DroidSieve via emails for which we
(VS) [62] is a database of malware of various kinds initially had difficulties getting performance results close to
including Android malware. The Drebin dataset (DB) is a originally reported ones). In particular, to produce the
set of malware shared in [15], and (Malware) Genome execution traces required by Afonso, we ran each app on a
(MG) is a malware set shared in [63]. We also used an Nexus One emulator with API Level 23, 2G RAM, and 1G
obfuscation benchmark suite along with benign apps from SD storage for 5 minutes as triggered by Monkey random
AZ and GP for Study III (as detailed in Section 6.6). inputs (same as for DroidCat as described in Section 4.3).
Concerning the overhead of dynamic analysis, we All of our experiments were performed on a Ubuntu 15.04
randomly chose a subset of samples from each respective workstation with 8G DDR and a 2.6GHz processor.
source, except for the MG dataset which we used all
samples therein given its relatively small size. A few apps
were discarded during the benchmark collection, because 6.3 Methodology
they could not be unzipped, were missing resource files We evaluated DroidCat in each of its two working modes:
(e.g., assets), or could not be successfully instrumented, (1) malware detection, in which it labels a given app as either
installed, or traced. In particular, for the D1617 and D1415 benign or malicious, and (2) malware categorization, in which
datasets which we used for a comparative study (Study II), it labels a given malware with the predicted malware family.
we also discarded samples with which any of the three To facilitate the assessment of DroidCat in these two different
compared techniques failed in its analysis. We did not modes, we simply treat DroidCat as a multi-class classifier,
apply any of the selection criteria in the characterization with different number of classes to differentiate in different
study (Section 4.1). The numbers (of samples) listed in the modes (e.g., two classes in the detection mode, and two or
table are those of the remaining samples actually used in more classes in the categorization mode).
our studies. In all, our datasets include 17,365 benign apps For Study I, we ran four tests of DroidCat, each using
and 16,978 malware, for a total of 34,343 samples. The age one of the four datasets (D0911 through D1617). For Study

6
II, we executed DroidCat and the two baselines on D1617 1.0 1.0

and D1415, because these two are the most recent datasets.
0.8 0.8
We used three obfuscation datasets for Study III, in which
we ran three tests of the three techniques accordingly.

True Positive Rate

True Positive Rate
0.6 0.6
In each test of these three studies, we sorted apps of
each class by their age (first-seen date) and split the apps by 0.4 0.4

the date at 70 percentile, and then we held out the 30%

D1617 (0.9759) D1617 (0.9476)
newest ones from each class for testing while using the rest 0.2
D1415 (0.9715)
0.2
D1415 (0.9414)
D1213 (0.9500) D1213 (0.9844)
for training. We used this hold-out validation in order to D0911 (0.9748) D0911 (0.9779)
0.0 0.0
avoid overfitting [64]: samples used for fitting a 0.0 0.2 0.4 0.6
False Positive Rate
0.8 1.0 0.0 0.2 0.4 0.6
False Positive Rate
0.8 1.0

classification model are never used in validating the model.

Our evaluation studies did not involve any re-sampling (as Fig. 4. DroidCat ROC curves with AUCs for malware detection (left) and
in cross validation) either, so as to avoid causing biases in categorization (right) on four datasets (D0911 through D1617).
the validation results [65]. This experiment design also
makes sure that we never use a model trained on newer Finally, the overall F1 is computed with:
samples to test older samples—doing so would not be 2 ∗ Poverall ∗ Roverall
sensible with respect to the practical use scenarios of a F 1overall = (5)
malware detector and the evolution of apps. Poverall + Roverall
The three studies share the same set of metrics for To further assess the capabilities of our approach versus
evaluating the performance of the compared techniques in the baselines, we compute the receiver operating
predicting apps of each class. We compute these metrics for characteristic (ROC) curve for each technique and relevant
each class and then average the metrics values among all dataset it applied to. These curves show how a binary
classes to obtain the overall performance of each technique. classifier performs with respect to varying decision
Specifically, for each class Ci , we assessed a technique’s thresholds, as opposed to one (default) threshold
performance with the following three metrics: associated with an F1 score. They also depict various
Precision (P) measures among all the apps labeled as tradeoffs between true positive and false positive rates.
“Ci ” by the technique, how many of them actually belong Thus, the curves complement the three accuracy metrics (P,
to that class. R, and F1), together constituting a comprehensive measure
# of apps belonging to Ci of the classification performance. In particular, we used the
Pi = . (1) prediction probabilities produced by the classifier to
Total # of apps labeled as “Ci00
compute these curves. For multi-class classification (i.e., the
Recall (R) measures among all apps belonging to Ci , malware categorization mode), we wrap the classifier with
how many of them are labeled by the technique as “Ci ”. a one-vs-all classifier to compute the curve for each class,
and then average all per-class curves to produce an
# of apps labeled as “Ci00 averaged curve. For each ROC curve, we also report the
Ri = . (2)
Total # of apps belonging to Ci area under curve (AUC) as a summary metric for the ROC.
F1 score (F1) is the harmonic mean of precision and
recall. It can be interpreted as a weighted average of the 6.4 Study I: Performance Stability
precision and recall. TABLE 3
2 ∗ Pi ∗ Ri DroidCat performance for malware detection and categorization.
F 1i = . (3) Detection Categorization
Pi + R i Dataset P R F1 P R F1
D1617 99.31% 99.27% 99.28% 94.79% 94.74% 94.54%
Note that the technique only labels apps with “C100 , “C2 ”, D1415 97.26% 97.09% 97.16% 97.84% 97.75% 97.70%
. . . , and never uses any label like “not Ci ”. To facilitate the D1213 96.38% 96.04% 96.12% 99.73% 99.71% 99.70%
metrics computation with respect to a particular class like D0911 97.19% 96.96% 97.00% 99.48% 99.43% 99.44%
C1 , we treat all apps with other labels like “C2 ”, “C3 ”, . . . as mean 97.53% 97.34% 97.39% 97.96% 97.91% 97.84%
“not Ci ”. For example, suppose there are 10 apps belonging stdev 1.25% 1.37% 1.34% 2.27% 2.28% 2.38%
to C3 . The technique labels 11 apps with “C3 ”, but only 8 of
them actually belong to C3 . As a result, P3 = 8/11 = 73% Table 3 lists the classification performance of DroidCat in
because only 8 out of the 11 “C3 ”-labeled apps are identified terms of the three metrics we defined earlier: precision (P ),
correctly. R3 = 8/10 = 80% because only 8 out of the 10 C3 recall (R), and F1-measure accuracy (F 1). Each row gives
apps are labeled correctly. F 13 = 2 ∗ 73% ∗ 80%/(73% + the results for the two working modes of DroidCat against
80%) = 76%. one of the four datasets used in this study. For instance, on
With the above effectiveness metrics computed for each the D1617 dataset (apps in year 2016 through year 2017),
class, we further evaluated the overall effectiveness of the DroidCat had 99.31% precision and 99.27% recall for
technique by computing the weighted average among detection (binary classification), and 94.54% F1 for
classes. Intuitively, the larger the number of apps in a class, categorizing malware into families. The last two rows show
the more weight its effectiveness metrics should have. The the mean performance metrics values over the four datasets
malware families vary greatly in size, so we weight each of DroidCat in each of the two modes, and the associated
family’s contribution to the average by its relative size to standard deviation (stdev) of the mean. For instance, for
the entire testing set. Formally, if we use Γ to represent P malware detection, DroidCat achieved an average F1
or R, and use ni to represent the number of testing samples accuracy of 97.39% with a standard deviation of 1.34%
in Ci , then the overall effectiveness in terms of precision across all the four datasets spanning the past nine years.
and recall can be computed for N classes with As shown, the performance of DroidCat depended on
PN both the dataset it was applied to and the working mode.
i=1 Γi ∗ ni Specifically, it had the highest accuracy (99.28%) on the
Γoverall = P . (4)
N newest (D1617) dataset for malware detection, yet
i=1 ni
7
performed the best (with 99.70% F1) on the second oldest DroidCat Afonso DroidSieve
(D1213) dataset for malware family categorization. 99.28% 97.16%
Similarly, the worst-case performance also was associated 100%
90% 87.72%
with different mode for different dataset: lowest detection 81.90%
80% 72.31% 81.89%
accuracy (96.12%) was for D1213 while lowest 70%
categorization accuracy (94.54%) was seen by D1617. 60%
Intuitively, it is more challenging to differentiate more 50%
40%
classes. Our results show that overall DroidCat performed 30%
almost equally well (97.39% versus 97.84%) for detection 20%
and categorization modes. While our features were 10%
discovered originally from a characterization study that 0%
P R F1 P R F1
only concerned two classes (malware versus benign apps),
this overall contrast suggests that the features could also D1617 D1415
well differentiate among varied malware families. On the Fig. 5. DroidCat versus baselines for malware detection.
other hand, however, we observed that over the years the
categorization performance appeared to decline gradually DroidCat Afonso DroidSieve
while the performance for detection did not. What this 100% 94.54% 97.70%
implies is that the features seem to be less robust against 90% 86.95% 88.79%
88.75% 81.78%
the evolution of malware than for differentiating benign 80%
apps from all kinds of malware as a whole. 70%
60%
Figure 4 depicts the ROC curves and associated AUCs
50%
(in the parentheses) of DroidCat on the four datasets for the 40%
two modes. The results show that DroidCat worked the best 30%
for the newest (D1617) dataset in the detection mode, with 20%
an AUC of almost 0.98 (close to the ideal case of 1.0). On 10%
0%
the other three datasets, our classifier was also highly P R F1 P R F1
accurate with different thresholds. The right chart indicates
D1617 D1415
that it performed the worst for categorizing malware in the
D1415 dataset. Nevertheless, even this lowest performance Fig. 6. DroidCat versus baselines for malware categorization.
was still highly competitive (0.94 AUC). The categorization
accuracy was noticeably higher on other datasets. performance metrics (x axis), for the two datasets (D1617
Conclusions. Overall, DroidCat exhibited highly and D14515) used in this study. Our results revealed
competitive performance for any mode and dataset, with considerable advantages of DroidCat over the two
F1 ranging from 94.54% to 99.73%, and AUC from 0.94 to state-of-the-art techniques: on the D1617 dataset, DroidCat
0.98. Importantly, DroidCat appeared to be quite stable in had 3% higher precision, 19% higher recall, and 11% higher
classifying apps seen in the 9-year span we studied, as F1 accuracy than the better-performing baseline DroidSieve.
supported by several observations. First, the standard The advantage of DroidCat over the peer dynamic approach
deviations of performance metrics over the span were Afonso was even greater (27% higher F1). On the D1415
generally small, suggesting DroidCat worked for both old dataset, the improvement of DroidCat over the
and new datasets with promising performance. Second, it is better-performing baseline Afonso was also substantial (15%
noteworthy that the performance of DroidCat was not much higher F1), albeit the gap between the two baselines was
affected by largely varying dataset sizes or the imbalance quite small (0.1%). Across both datasets, DroidSieve was
between malware and benign samples: nor did there exist a more accurate for malware detection than Afonso.
clear correlation between the performance and sample sizes
1.0 1.0
or the imbalances. Third, a larger number of families did
not necessarily make DroidCat perform worse in malware
0.8 0.8
categorization either. In comparison between its two
modes, DroidCat tended to be even more stable for malware
True Positive Rate

True Positive Rate

0.6 0.6
detection (1.34% standard deviation) than for malware
categorization by families (2.38% standard deviation). 0.4 0.4

0.2 0.2
Finding 1: DroidCat achieved mean F1 accuracy of DroidCat (0.9715)
Afonso (0.8204)
DroidCat (0.9759)
Afonso (0.8777)
97.39% and 97.84%, and AUC of 0.95-0.98 and 0.0
DroidSieve (0.9267)
0.0
DroidSieve (0.9018)
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.94-0.98, for malware detection and categorization, re- False Positive Rate False Positive Rate
spectively. It was also stable in classifying apps from diffe- Fig. 7. ROC curves with AUCs of DroidCat versus baselines for malware
rent years within 2009–2017, evidenced by small standard detection on datasets D1415 (left) and D1617 (right).
deviations in F1 of 1.34-2.38% across the nine years. 1.0 1.0

0.8 0.8
True Positive Rate

True Positive Rate

0.6 0.6
6.5 Study II: Comparative Classification Performance
In this study, we aim to compare our approach to the two 0.4 0.4

baselines in their classification capabilities. In particular, we

0.2 0.2
conducted comparative analysis in the two classification DroidCat (0.9414) DroidCat (0.9476)
Afonso (0.8165) Afonso (0.8807)
working modes considered. For malware detection, DroidSieve (0.8655) DroidSieve (0.8675)
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 5 shows the contrast among the three techniques False Positive Rate False Positive Rate
(the three bars in each group) in terms of the three Fig. 8. ROC curves with AUCs of DroidCat versus baselines for malware
categorization on datasets D1415 (left) and D1617 (right).
8
Figure 6 depicts the contrast among the three In addition, both baselines tended to be much less stable
techniques in their performance in categorizing malware than our approach, as evidenced by the variations of
into families. As in the detection mode, DroidSieve performance across varied datasets.
outperformed Afonso for the D1617 dataset, while for the
D1415 dataset Afonso was more accurate. However, Finding 2: DroidCat outperformed the state-of-the-art
considering the gaps in F1, Afonso was overall more techniques compared, with up to 27% and 16% higher F1,
competitive than DroidSieve, opposite to the contrast in the and 0.15 and 0.13 greater AUC, for malware detection
detection mode. On the other hand, the results clearly show and categorization, respectively. DroidCat also appeared
the significant merits of our approach over both baselines. to be noticeably more stable than the two baseline techni-
Relative to the better-performing peer approach, DroidCat ques over time when achieving competitive performance.
had about 6% higher F1 accuracy on the D1617 dataset and
over 9% higher accuracy on the D1415 dataset, although
the gaps were lesser than those in the detection mode.
Interestingly, while Afonso was originally evaluated for 6.6 Study III: Robustness
malware detection only [27], our results revealed that it
performed more accurately for malware categorization. The TABLE 4
Datasets used in the study on robustness (Study III)
performance of DroidSieve varied very slightly between
Benign apps Malware (from Praguard)
these two working modes, though. Dataset Period Source #Apps obf% Period #Apps obf%
Across the two datasets, our results also revealed the OBF1617 2016-2017 GP,AZ 3,196 57.38%
superiority of DroidCat to both baselines in stability. Afonso OBF1415 2014-2015 AZ 4,462 25.59% 2010-2012 1,214
100%
OBF1213 2012-2013 AZ 4,804 12.57% (59 families)
achieve noticeably higher performance on the older
(D1415) dataset than on the newer (D1617) dataset, For this study, we used three obfuscation datasets as
regardless of the classification modes (although the gap described in Table 4. We intended to only use benign apps
was smaller for family categorization). DroidSieve was from AZ, but had to include GP apps of year 2017 because
similarly unstable to Afonso, if not worse, although it of the shortage of AZ apps of that year. For malware, we
performed significantly better on D1617 (than on D1415, focused on those from Praguard [24], an obfuscation
also in both working modes). In addition, both baselines benchmark suite including the 1260 MG apps and 237
had achieved considerably higher performance (e.g., malware from the Contagio dataset [66] all transformed
constantly over 90% F1) on other datasets (mostly older using multiple obfuscation schemes combined. Praguard
than the two we used here) [23], [27], which further suggest has several subsets each using a different scheme
their likely instability. On the other hand, we have been combination. We used the most obfuscated subset while
able to obtain very similar performance numbers to those removing those corresponding to the 135 MG malware
originally reported with respect to the originally used used in our characterization study (to avoid possible
datasets. Thus, both baseline techniques seem to be more overfitting biases). From the remaining 1362 apps, we were
likely to be overfitted to particular datasets relative to our able to compute features of 1214 apps for both DroidCat
technique, which again backs up the stability advantage of and the two baselines. The table also lists the percentage of
DroidCat over the two baselines. apps that are obfuscated (obf%). All the apps’ age was
Figures 7 and 8 depict the ROC curves and again determined by their first-seen dates. Note that even
corresponding AUCs (shown in parentheses) of DroidCat benign apps were increasingly obfuscated, which further
versus the baselines on the two datasets used in this study, justifies the importance of app classifiers being robust
for detection and categorization, respectively. The results against obfuscation. We applied the methodology as
clearly show the advantages of our approach over the two described in Section 6.3. In addition, we confirmed that in
baselines, regardless of the dataset and classification mode each data split (based on 70-percentile first-seen dates)
considered, as also evidenced by the constantly higher there are non-trivial portions of obfuscated samples of each
AUCs of DroidCat. In fact, the curves revealed that DroidCat class in both training and testing sets. As for Studies I and
was more accurate than the baselines at any decision II, we ensured that the three benign sets did not overlap.
threshold. Between the two baselines, Afonso outperformed TABLE 5
DroidSieve only on one dataset for categorization, while in Robustness of DroidCat versus baselines
other cases DroidSieve was more accurate for varying Detection
Technique Perf. OBF1617 OBF1415 OBF1213 Average Cate.
thresholds. This contrast was consistent with how these
prior approaches compared in terms of F1 accuracy. P 97.46% 96.86% 96.40% 96.85% 97.26%
DroidCat R 97.34% 96.71% 96.19% 96.69% 97.07%
Conclusions. Overall, DroidCat surpassed both baseline F1 97.33% 96.66% 96.13% 96.64% 97.06%
P 98.43% 71.07% 85.37% 83.91% 54.55%
techniques in each of the two classification working modes, R 52.07% 86.73% 93.81% 79.89% 56.47%
and in most cases the advantages were highly significant. Afonso
F1 68.11% 78.13% 89.39% 79.59% 51.03%
Across the two modes, the performance gap between the P 86.44% 87.98% 85.08% 86.48% 91.81%
baselines and DroidCat was even greater in the malware DroidSieve R 83.34% 85.94% 81.77% 83.67% 93.49%
F1 80.51% 82.67% 76.09% 79.62% 92.27%
detection mode. Between the two baselines, DroidSieve
appeared to be more competitive for malware detection, Table 5 lists classification performance (Perf.) numbers
while for malware family categorization Afonso was (P, R, and F1) of DroidCat versus Afonso and DroidSieve in
considerably more accurate on one dataset. Afonso the detection and categorization (Cate.) modes on the three
appeared to be more effective for malware categorization datasets (second row) used in this study. The F1 numbers
than for malware detection, while DroidSieve had similar are highlighted in boldface. The numbers in the sixth
performance between these two modes. Knowing that column are the averages over the three datasets, weighted
Afonso classification was based on call frequencies (i.e., by the dataset sizes. For instance, on the OBF1617 dataset,
sheer counts of calls), the substantial performance merits of DroidCat had a 97.33% F1 accuracy, versus 68.11% by Afonso
DroidCat over Afonso suggest that relative statistics of calls and 80.51% by DroidSieve, all in the detection mode. In the
and call distributions are more effective than sheer counts. categorization mode, the three techniques had the same

9
performance across the three datasets since they all use the DroidCat. We summarize our methodology and findings
same malware set (i.e., the 1214 Praguard malware). below. Further details can be found in our technical report
Our results show that DroidCat largely surpassed the on DroidCat [67].
baseline techniques in any performance metric on any of
the three datasets for malware detection. In the malware
categorization mode, DroidSieve achieved an F1 (92%), the 7.1 Setup and Methodology
closest to that of DroidCat (97%) among all our comparative We started with the characterization study dataset and
experiments. We note that DroidSieve achieved 99% F1 for added into it malware samples from years 2016 and 2017 in
categorizing the MG subset of our malware set here, and the wild, resulting in 287 benign apps and 388 malware.
over 99% F1 for detecting the MG malware from The majority of these apps adopted various obfuscation
benign-app sets different from ours [23]. The considerable strategies (e.g., reflection, data encryption, and
drop in accuracy, when 237 more malware and many class/method renaming). The malware samples were in 15
different benign apps were trained and tested, suggested popular families, including DroidDream, BaseBridge, and
the potential overfitting of this technique to particular DroidKungFu which were among the most evasive families
datasets. On the other hand, compared to our results in according to a prior study [24], and FakeInst and OpFake
Study II, DroidSieve performance did not change much due which are known to combine multiple obfuscation schemes
to obfuscation. Thus, the technique tended to be (renaming, string encryption, and native payload). We
obfuscation-resilient indeed, and its performance variation applied the same hold-out validation procedure, and the
with respect to the original evaluation results in [23] seems same three accuracy metrics (P, R, F1) as used in the
to be mainly attributed to its instability. evaluation experiments (Section 6.3).
Afonso appeared to be resilient against obfuscation too,
but only for malware detection. The substantial
7.2 Results
performance drop (by over 30% in F1) because of
obfuscation indicates its weak obfuscation resiliency for For malware categorization, DroidCat performed perfectly
categorizing malware. Meanwhile, its considerable (with 100% F1) for the majority (11) of the (15) studied
performance variations for malware detection across the families. In particular, these 11 families include the three
three datasets corroborate the instability of this technique. that were previously considered highly evasive:
In contrast, DroidCat tended to be both robust and stable. DroidDream, BaseBridge, and DroidKungFu. Previous tools
The robustness was evidenced by the small difference in studied [24] achieved no more than 54% detection rate (i.e,
performance metrics between this study and Study II. Its the recall metric in our study) on these three families. By
performance variations across the three datasets were also weighted average, over all the 15 classes, DroidCat achieved
quite small, showing its stability even in the presence of 97.3% precision, 96.8% recall, and 97.0% F1. In the malware
complicated obfuscation. detection mode, DroidCat worked even more effectively,
We also computed the ROC curves and AUCs of the with 97.1% precision, 99.4% recall, and 98.2% F1. These
three techniques on each of the three datasets. The results are largely consistent with what we obtained from
contrasts between DroidCat and the two baselines was the extensive evaluation studies (Studies I through III).
similar to those seen in Study II. The AUC numbers
(0.97–0.99 for DroidCat) show considerable advantages of 7.3 Effects of Design Factors
our approach as well (0.05 and 0.09 greater AUC than any
baseline for detection and categorization, respectively). In Feature set choice. We investigated several alternative
all, the ROC results confirmed that DroidCat is robust to feature sets, including the full set of 122 metrics, the set of
various obfuscation schemes, with respect to varying metrics in each of the three dimensions, and the set of 36
decision thresholds, more than the two baselines. substantially disparate metrics (see Table 1). We found that
D* (the default set of 70 features used by DroidCat) worked
Conclusions. On overall average, DroidCat achieved a the best, suggesting that adding more features does not
96.64% F1, compared to 79.59% by Afonso and 79.62% by necessarily improve classification performance. The
DroidSieve in the detection mode. In the categorization Structure features had significantly better effectiveness than
mode, DroidCat also significantly outperformed the two ICC and Security features.
baseline techniques, with 5–46% higher F1. ROC results Most important dynamic features. To see which
corroborated the robustness merits of our approach, specific features are the most important to our technique,
compared to the baselines. In absolute terms, the accuracy we computed the importance ranking [23], [68] of the 70
and AUC numbers revealed that DroidCat can work highly features used by DroidCat. We found that Structure features
effectively with obfuscated apps. consistently dominated the top list, especially when there
were a greater number of classes that our classifier had to
Finding 3: DroidCat exhibited superior robustness to differentiate. In particular, two subcategories of Structure
both state-of-the-art techniques compared, by achieving features contributed the most: (1) distribution of
96% to 97% F1 accuracy on malware that adopted sophi- method/class invocation over the three code layers, and (2)
sticated obfuscation schemes along with varying sets of callback invocation for lifecycle management. The Security
benign apps, significantly higher than the two baselines. features were generally less important, with the ICC
features being the least important. Among all Security
features, those capturing risky control flows and accesses to
sensitive data/operations of particular kinds (e.g., sinks for
7 I N -D EPTH C ASE S TUDIES SMS_MMS) exhibited the greatest significance. The very
We have conducted in-depth case studies on a subset of our few ICC features included in these top rankings
datasets to access the capabilities of our approach in contributed more to identifying benign apps from malware
classifying apps with respect to individual malware than to distinguishing malware families.
families. Through the case studies, we also investigated the Learning algorithm choice. In addition to the Random
effects of various design factors on the performance of Forest algorithm (RF, with 128 trees) used by default, we

10
experimented DroidCat with seven other learning for a dynamic analysis of Android apps, our datasets may
algorithms. Our results show that RF performed still be relatively small in size compared to those used by
significantly better than all the alternatives. Support Vector many static approaches. In particular, considering our
Machine [69] (SVM) with linear kernel had the second best datasets split by ranges of years, our samples from each
effectiveness, while SVM with rbf kernel performed the period may not be representative of the app population of
worst. Naive Bayes [70] with Bernoulli distribution had the that period. For this reason, our results are potentially
third best effectiveness, while with Gaussian distribution it subject to overfitting. To mitigate this limitation, we have
had the second worst effectiveness. Neither Decision considered benchmarks from diverse sources. Recall the
Trees [71] nor k-Nearest Neighbors [72] worked as well as goal of DroidCat is to complement static approaches in
the best setting of the above three. scenarios where they are inapplicable (Section 2). In all, our
Input coverage. We have repeated our case studies on a experimental results and conclusions are best interpreted
new dataset obtained by applying the same coverage filter with respect to the datasets used in our studies.
used in our characterization study. Only apps for which Prior studies have shown that learning-based malware
10-minute Monkey inputs covered at least 50% of user code detectors are subject to class imbalances in training
were selected, resulting in 136 benign and 145 malicious datasets [73], [74]. Our results also suffer from this
apps. The user-code coverage for these 281 apps ranged subjection as our datasets contain imbalanced benign and
from 50% to 100% (mean 66%, standard deviation 12%). malware samples, as well as imbalanced malware families.
The higher-coverage dataset contained 10 app categories: There were two causes for these imbalances: (1) our data
BENIGN and 9 malware families. We applied the same sources do not provide balanced sample sets, and (2) for
held-out validation as used in other experiments. For fair evaluation we needed to use exactly the same samples
malware detection and (9-class) malware categorization, for evaluating DroidCat against the two baselines, thus we
DroidCat gained consistent increases in each of the three had to discard some samples for which the features for any
performance metrics (P, R, F1) on the higher-coverage technique cannot be computed (Section 6.1), which further
datasets compared to our results without the coverage perplexed our control of data balance. On the other hand,
filter. Yet, the increases were all quite small (at most 1.5%). however, the imbalances enabled us to additionally assess
These small differences indicate that the performance of the stability of our approach against the baselines: for
DroidCat did not appear to be very sensitive to the instance, in Study I, our results revealed that the
user-code coverage of run-time inputs. performance of DroidCat was not much affected by the
imbalance of both kinds (more benign apps, in D1617 and
8 E FFICIENCY D1415, or more malware, in D1213 and D0911). We also
note that all the datasets against which we compared
The primary source of analysis overhead of all the three DroidCat to the baselines contained much less malware
techniques compared is the cost for feature extraction. For than benign samples. This kind of imbalance resembles
dynamic approaches like DroidCat and Afonso, this cost real-world situations in which we do have much fewer
includes the time for tracing each app, which is five malware than benign apps.
minutes in both techniques. Specifically, DroidCat took
353.9 seconds for feature computation and 0.01 seconds for Intuitively, the more app code covered by the dynamic
testing, on average per app. In contrast, Afonso took 521.74 inputs, the more app behaviors can be captured and
seconds for feature computation and 0.015 seconds for utilized by our approach. We thus conducted a dedicated
testing per app. As expected, the tracing time dominated study in this regard. Our results confirmed that with
the total feature computation cost in DroidCat and Afonso. higher-coverage inputs DroidCat improved in
Also, in these two techniques, the testing time is almost effectiveness. However, the effectiveness differences were
negligible, mainly because their feature vectors are both small (<2%) between two experiments involving datasets
relatively small (at most 122 features per app in DroidCat, that had large differences (20%) in code coverage.
and 163 features per app in Afonso). As a static approach, Nevertheless, these results may not be generalizable; more
DroidSieve does not incur tracing cost. Its average feature conclusive results would need more extensive studies on
computation cost was 74.19 seconds per app. However, the effect of input coverage. Also, although DroidCat relies
DroidSieve uses very-large feature vectors (over 20,000 on capturing app behavioral patterns in execution
features per app), causing its substantial cost for the testing composition and structure (instead of modeling explicit
phase (on average 3.52 seconds per app). Concerning the malicious behaviors through suspicious permission access
storage cost, DroidCat and Afonso took 21KB and 32KB per and/or data flows), reasonable coverage is still required for
app, respectively, mainly for storing the traces. DroidSieve producing usable traces to enable the feature computation.
does not incur trace storage cost, and it took 0.4KB per app We aimed to leverage a diverse set of behavioral features
for storing feature files. (in three orthogonal dimensions) to make DroidCat robust
In all, DroidCat appeared to be reasonably efficient as a to various evasion attacks that target specific kinds of
dynamic malware analysis approach, and was dynamic profiles. To empirically examine the robustness,
lighter-weight than the peer dynamic approach Afonso. we purposely used an obfuscation benchmark suite in the
DroidSieve was the most efficient among the three evaluation. Further, in the case studies, we used datasets in
techniques, due to its lack of tracing overheads. However, which the majority of apps adopted a variety of evasion
given the substantially superior performance of DroidCat techniques, including complex/heavy reflection, data
over DroidSieve, the additional cost incurred by DroidCat encryption, and class/method renaming. However, other
can be seen to be justified. types of evasion (especially anti-dynamic-analysis)
attacks [75] have not been explicitly covered in our
experiments. For instance, some malware might detect and
9 L IMITATIONS AND T HREATS TO VALIDITY then evade particular kinds of run-time environments (e.g.,
The difficulty and overhead of tracing a large number of emulator). In our evaluation, the dynamic features were all
apps present challenges to dynamic analysis, which extracted from traces gathered on an Android emulator.
constrained the scale of our studies. While reasonably large The high classification performance we achieved suggests

11
that our approach seems robust against emulator-evasion leveraged the graphs to categorize malware [39]. Dash et
attacks. On the other hand, after our features are revealed, al. generated features at different levels, including pure
attackers could take adversarial approaches to impede the system calls and higher-level behavioral patterns like file
computation of our dynamic features or pollute the code to system access which conflate sequences of related system
make our features less discriminatory. calls [34]. Some of the malware detection techniques have
DroidCat works at app level without any modification been applied to family categorization as well [23], [78], [79].
of the Android framework and/or OS as in [18], [76]. This Discussion. Table 6 compares our approach to
design makes DroidCat easier to use and more adaptable to representative recent peer works in terms of classification
rapid evolution of the Android ecosystem, but it does not capability with respect to the three possible settings and
handle dynamically loaded code or native code yet. robustness against various analysis challenges. For the
Meanwhile, DroidCat requires app instrumentation, which settings that a tool was not designed to work in, the
may constitute an impediment for its use by end users. A capability was not applicable (hence noted as N/A).
more common deployment setting would be to use Almost all the static approaches compared are
DroidCat for batch screening by an app vetting service (e.g., vulnerable to reflection as they use features based on APIs.
as part of an app store), where the instrumentation, tracing, Marvin [18] as a hybrid technique also suffers from this
learning, and prediction can be packed in one holistic vulnerability as it relies on a number of static API-based
automated process of the service. Finally, our technique features. Techniques using features on static permissions,
follows a learning-based approach using features that can such as DroidSIFT [79], Drebin [15], and StormDroid [26],
be contrived, thus it may be vulnerable to sophisticated face challenges due to run-time permissions [19], [85]
attacks such as mimicry and poisoning [77]. which are increasingly adopted by (over one third already
of) Android apps [86]. The use of features based on system
10 R ELATED W ORK calls comprises the resiliency of DroidScribe [34],
Madam [35], and Afonso [27] against obfuscation schemes
Dynamic Characterization for Android Apps. There have targeting system calls [30], [36], [38]. DroidSieve [23] gains
been only a few studies broadly characterizing run-time high accuracy with resilience against reflection by reducing
behaviors of Android apps. Zhou et al. manually analyzed code analysis and using resource-centered features, but
1,200 samples to understand malware installation methods, may not detect malware that expresses malicious behaviors
activation mechanisms, and the nature of carried malicious only in code while with benign resources/assets. Our
payloads [49]. Cai et al. instrumented 114 benign apps for comparative study results presented in this paper have
tracing method calls and ICCs, and investigated the supported this hypothesis. In addition, like a few other
dynamic behaviors of benign apps [82]. These studies works that extract features (other than permission) from
either focus on malicious apps or benign ones. Canfora et resource files [15], [18], [55], it may not work with malware
al. profiled Android apps to characterize their resource with resources obfuscated [24], [25], [41].
usage and leveraged the profiles to detect malware [83]. We In contrast, DroidCat adopts a purely dynamic
profiled method and ICC invocations in our approach that resolves reflective calls at runtime, thus it is
characterization study as in [82] yet with both benign and fully resilient against even complex cases of reflection. It
malicious samples. Also, our study aimed at not only relies on no features from resource files or based on system
behavior understanding [49], [82]. We further utilized the calls, thus it is robust against obfuscation targeting those
understanding for app classification like [83] yet with features. While it remains to be studied if it well adapts to
different behavioral profiles and not only for malware Android ecosystem evolution, DroidCat would not be
detection (but also for categorizing malware by families). much affected by run-time permissions as it does not use
Android Malware Detection. Most previous detection related features. Also, compared to prior approaches
techniques utilized static app features based on API typically focusing on API calls, DroidCat characterizes the
calls [15], [16], [17], [35], [78], [79], [80] and/or invocations of all methods and ICCs.
permissions [15], [18], [23], [26], [35]. ICCDetector [55] We omitted in the table the effectiveness numbers (e.g.,
modeled ICC patterns to identify malware that exhibits detection rate and accuracy) for these compared works
different ICC characteristics from benign apps. Besides because they are not comparable: the numbers all came
static features, a few works enhanced their capability by from varied evaluation datasets. In this paper, we have
exploiting dynamic features (i.e., hybrid approaches) such extensively studied two of the listed approached versus
as messaging traffic [26], file/network operations [18], and ours on the same datasets. Nonetheless, in terms of any of
system/API calls [35]. However, approaches relying on the effectiveness metrics we considered, DroidCat
static code analysis are generally vulnerable to reflection appeared to have very promising performance relative to
and other code obfuscation schemes [20], which are widely the state-of-the-art peer approaches.
adopted in Android apps (especially in malware) [41].
Suarez-Tangil et al. [23] mined non-code (e.g.,
resources/assets) features for more robust detection.
Static-analysis challenges have motivated dynamic
11 C ONCLUSION
approaches, of which ours is not the first. Afonso et al. [27] We presented DroidCat, a dynamic app classification
built dynamic features on system/API call frequencies for technique that detects and categorizes Android malware
malware detection, similar to [29] where occurrences of with high accuracy. Features that capture the structure of
unique callsites were used as features. A recent static app executions are at the core of our approach, in addition
technique MamaDroid [81] and its dynamic variant [84] to those based on ICC and security sensitive accesses. We
model app behaviors based on the transition probabilities empirically showed that this diverse, novel set of dynamic
between abstracted API calls in the form of Markov chains. features enabled the superior robustness of DroidCat against
Android Malware Categorization. Approaches have analysis challenges such as heavy and complex use of
been proposed to categorize malware into known families. reflection, resource obfuscation, system-call obfuscation,
Xu et al. traced system calls, investigated three alternative use of run-time permissions, and other evasion schemes.
ways to graphically represent the traces, and then These challenges impede most existing peer approaches, a

12
TABLE 6
Comparison of recent works on Android malware classification in capability and robustness. DET: detection, CAT: family categorization, SYSC:
system call, RT_PERM: run-time permission, RES: resource, OBF: obfuscation.
Classification Capability Robustness against Analysis Challenges
Technique Year Approach DET CAT Reflection SYSC_OBF RT_PERM RES_OBF
DroidMiner [78] 2014 Static 3 3 7 3 3 3
DroidSIFT [79] 2014 Static 3 3 7 3 7 3
Drebin [15] 2014 Static 3 N/A 7 3 7 7
MudFlow [80] 2015 Static 3 N/A 7 3 3 3
Afonso et al. [27] 2015 Dynamic 3 N/A unknown 7 3 3
Marvin [18] 2015 Hybrid 3 N/A 7 3 7 7
Madam [35] 2016 Hybrid 3 N/A unknown 7 7 3
ICCDetector [55] 2016 Static 3 N/A 7 3 3 7
DroidScribe [34] 2016 Dynamic N/A 3 3 7 3 3
StormDroid [26] 2016 Hybrid 3 N/A 7 3 7 3
MamaDroid [81] 2017 Static 3 N/A 7 3 3 3
DroidSieve [23] 2017 Static 3 3 3 3 7 7
DroidCat this work Dynamic 3 3 3 3 3 3

real concern since the app traits leading to the challenges [16] Y. Aafer, W. Du, and H. Yin, “DroidAPIMiner: Mining API-level
are increasingly prevalent in modern Android ecosystem. features for robust malware detection in Android,” in SecureComm,
2013.
Through extensive evaluation and in-depth case [17] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P. Wu, “Droid-
studies, we have shown the superior stability of our mat: Android malware detection through manifest and API calls
approach in achieving high classification performance, tracing,” in Proceedings of Asia Joint Conference on Information Secu-
compared to two state-of-the-art peer approaches, one rity, 2012, pp. 62–69.
[18] M. Lindorfer, M. Neugschwandtner, and C. Platzer, “Marvin: Effi-
static and one dynamic. Meanwhile, in absolute terms, cient and comprehensive mobile app classification through static
DroidCat achieved significantly higher accuracy than the and dynamic analysis,” in COMPSAC, vol. 2, 2015, pp. 422–433.
peer approaches studied for both malware detection and [19] “Requesting permission at run time,” https://
developer.android.com/training/permissions/requesting.html,
family categorization. Thus, DroidCat constitutes a 2015.
promising solution complementary to existing alternatives. [20] A. Moser, C. Kruegel, and E. Kirda, “Limits of static analysis for
malware detection,” in ACSAC, 2007, pp. 421–430.
[21] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bry-
R EFERENCES ant, “Semantics-aware malware detection,” in IEEE Symposium on
Security and Privacy, 2005, pp. 32–46.
[1] “Android malware accounts for 97% of malicious mobile apps,” [22] J. Lee, K. Jeong, and H. Lee, “Detecting metamorphic malwares
http://www.scmagazineuk.com/updated-97-of-malicious- using code graphs,” in SAC, 2010, pp. 1970–1977.
mobile-malware-targets-android/article/422783/, 2015. [23] G. Suarez-Tangil, S. K. Dash, M. Ahmadi, J. Kinder, G. Giacinto,
[2] “The ultimate android malware guide: What it does, and L. Cavallaro, “DroidSieve: Fast and accurate classification of
where it came from, and how to protect your phone obfuscated Android malware,” in CODASPY, 2017, pp. 309–320.
or tablet,” http://www.digitaltrends.com/android/the- [24] D. Maiorca, D. Ariu, I. Corona, M. Aresu, and G. Giacinto, “Ste-
ultimate-android-malware-guide-what-it-does-where-it-came alth attacks: An extended insight into the obfuscation effects on
-from-and-how-to-protect-your-phone-or-tablet/. Android malware,” Computers & Security, vol. 51, pp. 16–31, 2015.
[3] K. Lu, Z. Li, V. P. Kemerlis, Z. Wu, L. Lu, C. Zheng, Z. Qian, [25] G. Square, “Dexguard,” https://www.guardsquare.com/en/
W. Lee, and G. Jiang, “Checking more and alerting less: Detecting dexguard, 2017.
privacy leakages via enhanced data-flow analysis and peer vo- [26] S. Chen, M. Xue, Z. Tang, L. Xu, and H. Zhu, “Stormdroid: A stre-
ting.” in NDSS, 2015. aminglized machine learning-based system for detecting Android
[4] W. Yang, X. Xiao, B. Andow, S. Li, T. Xie, and W. Enck, “Appcon- malware,” in AsiaCCS, 2016, pp. 377–388.
text: Differentiating malicious and benign mobile app behaviors [27] V. M. Afonso, M. F. de Amorim, A. R. A. Grégio, G. B. Junquera,
using context,” in ICSE, 2015. and P. L. de Geus, “Identifying Android malware using dynami-
[5] Y. Feng, S. Anand, I. Dillig, and A. Aiken, “Apposcopy: cally obtained features,” Journal of Computer Virology and Hacking
Semantics-based detection of Android malware through static Techniques, vol. 11, no. 1, pp. 9–17, 2015.
analysis,” in FSE, 2014. [28] A. Shabtai, U. Kanonov, Y. Elovici, C. Glezer, and Y. Weiss, ““An-
[6] H. Gascon, F. Yamaguchi, D. Arp, and K. Rieck, “Structural de- dromaly": A behavioral malware detection framework for An-
tection of Android malware using embedded call graphs,” in Pro- droid devices,” Journal of Intelligent Information Systems, vol. 38,
ceedings of the 2013 ACM Workshop on Artificial Intelligence and Se- no. 1, pp. 161–190, 2012.
curity, 2013. [29] I. Burguera, U. Zurutuza, and S. Nadjm-Tehrani, “Crowdroid:
[7] M. Grace, Y. Zhou, Q. Zhang, S. Zou, and X. Jiang, “Riskranker: behavior-based malware detection system for Android,” in Procee-
scalable and accurate zero-day Android malware detection,” in dings of the 1st ACM workshop on Security and privacy in smartphones
MobiSys, 2012, pp. 281–294. and mobile devices. ACM, 2011, pp. 15–26.
[8] H. Cai and J. Jenkins, “Leveraging historical versions of android [30] J. Ming, Z. Xin, P. Lan, D. Wu, P. Liu, and B. Mao, “Impe-
apps for efficient and precise taint analysis,” in MSR, 2018, pp. ding behavior-based malware analysis via replacement attacks to
265–269. malware specifications,” Journal of Computer Virology and Hacking
[9] B. Wolfe, K. Elish, and D. Yao, “High precision screening for An- Techniques, pp. 1–15, 2016.
droid malware with dimensionality reduction,” in International [31] U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda,
Conference on Machine Learning and Applications, 2014. “Scalable behavior-based malware clustering,” in NDSS, 2009.
[10] K. Griffin, S. Schneider, X. Hu, and T.-C. Chiueh, “Automatic gene- [32] K. Tam, S. J. Khan, A. Fattori, and L. Cavallaro, “Copperdroid: Au-
ration of string signatures for malware detection,” in RAID, 2009. tomatic reconstruction of android malware behaviors.” in NDSS,
[11] H. Kang, J.-w. Jang, A. Mohaisen, and H. K. Kim, “Detecting and 2015.
classifying android malware using static analysis along with crea- [33] H. S. Galal, Y. B. Mahdy, and M. A. Atiea, “Behavior-based features
tor information,” Int. J. Distrib. Sen. Netw., 2015. model for malware detection,” Journal of Computer Virology and
[12] H. Peng, C. Gates, B. Sarma, N. Li, Y. Qi, R. Potharaju, Hacking Techniques, pp. 1–9, 2015.
C. Nita-Rotaru, and I. Molloy, “Using probabilistic generative mo- [34] S. K. Dash, G. Suarez-Tangil, S. Khan, K. Tam, M. Ahmadi, J. Kin-
dels for ranking risks of Android apps,” in CCS, 2012. der, and L. Cavallaro, “Droidscribe: Classifying Android malware
[13] B. P. Sarma, N. Li, C. Gates, R. Potharaju, C. Nita-Rotaru, and based on runtime behavior,” Mobile Security Technologies, 2016.
I. Molloy, “Android permissions: A perspective combining risks [35] A. Saracino, D. Sgandurra, G. Dini, and F. Martinelli, “Madam:
and benefits,” in Proceedings of the 17th ACM Symposium on Access Effective and efficient behavior-based Android malware detection
Control Models and Technologies, 2012. and prevention,” TDSC, 2016.
[14] W. Enck, M. Ongtang, and P. D. McDaniel, “On lightweight mobile [36] W. Ma, P. Duan, S. Liu, G. Gu, and J.-C. Liu, “Shadow attacks:
phone application certification,” in CCS, 2009, pp. 235–245. Automatically evading system-call-behavior based malware de-
[15] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, and K. Rieck, tection,” J. Comput. Virol., 2012.
“Drebin: Effective and explainable detection of Android malware [37] S. Forrest, S. Hofmeyr, and A. Somayaji, “The evolution of
in your pocket,” in NDSS, 2014. system-call monitoring,” in ACSAC, 2008, pp. 418–430.

13
[38] A. Srivastava, A. Lanzi, J. Giffin, and D. Balzarotti, “Operating [64] T. Dietterich, “Overfitting and undercomputing in machine lear-
system interface obfuscation and the revealing of hidden operati- ning,” ACM Computing Surveys, vol. 27, no. 3, pp. 326–327, 1995.
ons,” in DIMVA, 2011, pp. 214–233. [65] R. B. Rao, G. Fung, and R. Rosales, “On the dangers of
[39] L. Xu, D. Zhang, M. A. Alvarez, J. A. Morales, X. Ma, and J. Cava- cross-validation. an experimental evaluation,” in SIAM Internatio-
zos, “Dynamic Android malware classification using graph-based nal Conference on Data Mining, 2008, pp. 588–596.
representations,” in Cyber Security and Cloud Computing (CSCloud), [66] contagiodump.blogspot.com/.
2016 IEEE 3rd International Conference on, 2016, pp. 220–231. [67] H. Cai, N. Meng, B. Ryder, and D. Yao, “Droidcat: Unified dyna-
[40] T. K. Ho, “Random decision forests,” in International Conference on mic detection of Android malware,” Tech. Rep. TR-17-01, January
Document Analysis and Recognition, 1995. 2017, http://hdl.handle.net/10919/77523.
[41] K. Tam, A. Feizollah, N. B. Anuar, R. Salleh, and L. Cavallaro, “The [68] D. Cournapeau, “Machine learning in python,” http://scikit-
evolution of Android malware and Android analysis techniques,” learn.org/stable/supervised_learning.html#supervised-learning,
ACM Computing Surveys (CSUR), vol. 49, no. 4, p. 76, 2017. 2016.
[42] “Over 60 percent of Android malware comes from one malware [69] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,
family: Fakeinstaller,” http://tech.firstpost.com/news-analysis/ 1995.
over-60-percent-of-android-malware-comes-from-one-malware- [70] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach,
family-mcafee-48109.html, 2012. 2003.
[43] D. Octeau, D. Luchaup, M. Dering, S. Jha, and P. McDa- [71] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., 1986.
niel, “Composite constant propagation: Application to Android [72] N. S. Altman, “An Introduction to Kernel and Nearest-Neighbor
inter-component communication analysis,” in ICSE, 2015, pp. Nonparametric Regression,” The American Statistician, vol. 46,
77–88. no. 3, pp. 175–185, 1992.
[44] B. Bichsel, V. Raychev, P. Tsankov, and M. Vechev, “Statistical de- [73] S. Roy, J. DeLoach, Y. Li, N. Herndon, D. Caragea, X. Ou, V. P.
obfuscation of Android applications,” in CCS, 2016, pp. 343–355. Ranganath, H. Li, and N. Guevara, “Experimental study with
[45] “Android app components,” http://www.tutorialspoint.com/ real-world data for Android app security analysis using machine
android/android_application_components.htm. learning,” in ACSAC, 2015, pp. 81–90.
[46] Google, “Android Monkey,” http://developer.android.com/ [74] K. Allix, T. F. Bissyandé, Q. Jérome, J. Klein, Y. Le Traon et al.,
tools/help/monkey.html, 2015. “Empirical assessment of machine learning-based malware detec-
[47] H. Cai and B. Ryder, “DroidFax: A toolkit for systematic characte- tors for Android,” EMSE, vol. 21, no. 1, pp. 183–211, 2016.
rization of Android applications,” in ICSME, 2017, pp. 643–647. [75] S. Rasthofer, S. Arzt, S. Triller, and M. Pradel, “Making malory
[48] “VirusTotal,” https://www.virustotal.com/. behave maliciously: Targeted fuzzing of Android execution envi-
[49] Y. Zhou and X. Jiang, “Dissecting Android malware: Characteriza- ronments,” in ICSE, 2017.
tion and evolution,” in Proceedings of IEEE Symposium on Security [76] M. Lindorfer, M. Neugschwandtner, L. Weichselbaum, Y. Fratan-
and Privacy, 2012, pp. 95–109. tonio, V. Van Der Veen, and C. Platzer, “Andrubis–1,000,000 apps
[50] P. Lam, E. Bodden, O. Lhoták, and L. Hendren, “Soot - a Java later: A view on current Android malware behaviors,” in Internati-
bytecode optimization framework,” in Cetus Users and Compiler onal Workshop on Building Analysis Datasets and Gathering Experience
Infrastructure Workshop, 2011, pp. 1–11. Returns for Security, 2014, pp. 3–17.
[51] J. Dean, D. Grove, and C. Chambers, “Optimization of [77] S. Venkataraman, A. Blum, and D. Song, “Limits of learning-based
object-oriented programs using static class hierarchy analysis,” in signature generation with adversaries,” in NDSS, 2008.
ECOOP, 1995. [78] C. Yang, Z. Xu, G. Gu, V. Yegneswaran, and P. Porras, “Droidmi-
[52] H. Cai and R. Santelices, “Diver: Precise dynamic impact ana- ner: Automated mining and characterization of fine-grained mali-
lysis using dependence-based trace pruning,” in ASE, 2014, pp. cious behaviors in Android applications,” in European Symposium
343–348. on Research in Computer Security. Springer, 2014, pp. 163–182.
[53] Google, “Android emulator,” http://developer.android.com/ [79] M. Zhang, Y. Duan, H. Yin, and Z. Zhao, “Semantics-aware An-
tools/help/emulator.html, 2015. droid malware classification using weighted contextual api depen-
[54] D. Octeau, P. McDaniel, S. Jha, A. Bartel, E. Bodden, J. Klein, and dency graphs,” in CCS, 2014, pp. 1105–1116.
Y. L. Traon, “Effective inter-component communication mapping [80] V. Avdiienko, K. Kuznetsov, A. Gorla, A. Zeller, S. Arzt, S. Rastho-
in Android with Epicc: An essential step towards holistic security fer, and E. Bodden, “Mining apps for abnormal usage of sensitive
analysis,” in USENIX Security Symposium, 2013. data,” in ICSE, 2015, pp. 426–436.
[55] K. Xu, Y. Li, and R. H. Deng, “ICCDetector: ICC-Based malware [81] E. Mariconti, L. Onwuzurike, P. Andriotis, E. De Cristofaro,
detection on Android,” TIFS, vol. 11, no. 6, pp. 1252–1264, 2016. G. Ross, and G. Stringhini, “Mamadroid: Detecting android mal-
[56] J. Jenkins and H. Cai, “Dissecting Android inter-component com- ware by building markov chains of behavioral models,” in NDSS,
munications via interactive visual explorations,” in ICSME, 2017, 2017.
pp. 519–523. [82] H. Cai and B. Ryder, “Understanding Android application pro-
[57] ——, “ICC-Inspect: supporting runtime inspection of android gramming and security: A dynamic study,” in ICSME, 2017, pp.
inter-component communications,” in MobileSoft, 2018, pp. 80–83. 364–375.
[58] S. B. Kotsiantis, “Supervised machine learning: A review of classi- [83] G. Canfora, E. Medvet, F. Mercaldo, and C. A. Visaggio, “Acqui-
fication techniques,” 2007. ring and analyzing app metrics for effective mobile malware de-
[59] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, tection,” in Proceedings of the 2016 ACM on International Workshop
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., on Security And Privacy Analytics, 2016.
“Scikit-learn: Machine learning in python,” Journal of Machine Le- [84] L. Onwuzurike, M. Almeida, E. Mariconti, J. Blackburn, G. String-
arning Research, vol. 12, no. Oct, pp. 2825–2830, 2011. hini, and E. De Cristofaro, “A family of droids: Analyzing be-
[60] K. Allix, T. F. Bissyandé, J. Klein, and Y. Le Traon, “AndroZoo: havioral model based android malware detection via static and
Collecting millions of android apps for the research community,” dynamic analysis,” arXiv preprint arXiv:1803.03448, 2018.
in MSR, 2016, pp. 468–471. [85] M. Dilhara, H. Cai, and J. Jenkins, “Automated detection and
[61] “Google play store,” https://play.google.com/store, 2018. repair of incompatible uses of runtime permissions in android
[62] “Virusshare,” https://virusshare.com/, 2018. apps,” in MobileSoft, 2018, pp. 67–71.
[63] Y. Zhou and X. Jiang, “Dissecting Android malware: Characteriza- [86] Google, “Android Developer Dashboard,” http://
tion and evolution,” in Proc. of the IEEE Symposium on Security and developer.android.com/about/dashboards/index.html, 2016,
Privacy, 2012, pp. 95–109. accessed online 09/20/2016.

Iso 3691-4 2023 (E)
50% (2)
Iso 3691-4 2023 (E)
7 pages
Assessment in Double Entry Accounting
No ratings yet
Assessment in Double Entry Accounting
7 pages
Frito-Lay: Operations Management in Manufacturing
No ratings yet
Frito-Lay: Operations Management in Manufacturing
2 pages
Ducati Monster S4RS 2006 Parts List WWW - Manualedereparatie.info PDF
No ratings yet
Ducati Monster S4RS 2006 Parts List WWW - Manualedereparatie.info PDF
120 pages
BUILDING AND ENHANCING NEW LITERACIES ACROSS THE CURRICULUM Module 2
No ratings yet
BUILDING AND ENHANCING NEW LITERACIES ACROSS THE CURRICULUM Module 2
11 pages
Diagrama Esmeriladora Dewalt Dwe4120 b3 PDF
No ratings yet
Diagrama Esmeriladora Dewalt Dwe4120 b3 PDF
4 pages
Farm Machinery & Equipment - W2
No ratings yet
Farm Machinery & Equipment - W2
33 pages
SINAMICS G120 PN at S7-1200 DOCU V1d0 en
No ratings yet
SINAMICS G120 PN at S7-1200 DOCU V1d0 en
63 pages
BSCPL Tech Spec MLTP Botanical R00
No ratings yet
BSCPL Tech Spec MLTP Botanical R00
57 pages
Facilitator's CALA Guide: Learning Area: CALA Type: Level: Topic: Duration
No ratings yet
Facilitator's CALA Guide: Learning Area: CALA Type: Level: Topic: Duration
8 pages
Fashion Polka Dot Background Business PPT Templates
No ratings yet
Fashion Polka Dot Background Business PPT Templates
25 pages
Discovering Optimal Features Using Static Analysis and A Genetic Search Based Method For Android Malware Detection
No ratings yet
Discovering Optimal Features Using Static Analysis and A Genetic Search Based Method For Android Malware Detection
25 pages
En User Instructions Gasfires
No ratings yet
En User Instructions Gasfires
40 pages
SBST1303 - MAY2020 - Take Home Exam
No ratings yet
SBST1303 - MAY2020 - Take Home Exam
9 pages
A Hybrid Analysis-Based Approach To Android Malware Family Classification
No ratings yet
A Hybrid Analysis-Based Approach To Android Malware Family Classification
23 pages
PPG Sewage PDF
No ratings yet
PPG Sewage PDF
4 pages
Real-Time Behavior Analysis and Identification For Android Application
No ratings yet
Real-Time Behavior Analysis and Identification For Android Application
12 pages
Edu 210 Quiz
No ratings yet
Edu 210 Quiz
4 pages
16.experimental Comparison of Features and Classifiers For Android Malware Detection
No ratings yet
16.experimental Comparison of Features and Classifiers For Android Malware Detection
12 pages
كاتلوج 2
No ratings yet
كاتلوج 2
44 pages
Deep Learning Based Android Malware Detection Using Real PDF
No ratings yet
Deep Learning Based Android Malware Detection Using Real PDF
11 pages
User Manual 2195612
No ratings yet
User Manual 2195612
2 pages
A Comparative Study of Static, Dynamic and Hybrid
No ratings yet
A Comparative Study of Static, Dynamic and Hybrid
4 pages
Systemair Fans KVO Data Sheet Eng PDF
No ratings yet
Systemair Fans KVO Data Sheet Eng PDF
4 pages
Droiddeep: Using Deep Belief Network To Characterize and Detect Android Malware
No ratings yet
Droiddeep: Using Deep Belief Network To Characterize and Detect Android Malware
14 pages
Applsci 12 10755 v2
No ratings yet
Applsci 12 10755 v2
12 pages
Android Application Malware Analysis
No ratings yet
Android Application Malware Analysis
29 pages
Investing For Inclusion Exploring Lgbti Lens
No ratings yet
Investing For Inclusion Exploring Lgbti Lens
48 pages
PROJECT
No ratings yet
PROJECT
72 pages
Academic Calendar Spring 2018 FINAL
No ratings yet
Academic Calendar Spring 2018 FINAL
1 page
A Review On The Use of Deep Learning in Android Malware Detection PDF
No ratings yet
A Review On The Use of Deep Learning in Android Malware Detection PDF
17 pages
Malware - Me Project Document
No ratings yet
Malware - Me Project Document
31 pages
Ntdroid: Android Malware Detection Using Network Traffic: Features
No ratings yet
Ntdroid: Android Malware Detection Using Network Traffic: Features
12 pages
Malware Detection in Android in Different Application Categories
No ratings yet
Malware Detection in Android in Different Application Categories
6 pages
Malware - Me Project Document
No ratings yet
Malware - Me Project Document
28 pages
Machine Learning Aided Android Malware Classification
No ratings yet
Machine Learning Aided Android Malware Classification
21 pages
Android Malware
No ratings yet
Android Malware
21 pages
tddd17 Android Malware Fabha972 Andwi954 PDF
No ratings yet
tddd17 Android Malware Fabha972 Andwi954 PDF
11 pages
Phannarak CV
No ratings yet
Phannarak CV
2 pages
A Comprehensive Survey On Machine Learning Techniques For Android Malware Detection
No ratings yet
A Comprehensive Survey On Machine Learning Techniques For Android Malware Detection
12 pages
Droiddetector: Android Malware Characterization and Detection Using Deep Learning
No ratings yet
Droiddetector: Android Malware Characterization and Detection Using Deep Learning
10 pages
Codaspy 17 Droidsieve
No ratings yet
Codaspy 17 Droidsieve
12 pages
Malware Detection Using Static Analysis
No ratings yet
Malware Detection Using Static Analysis
54 pages
MS015 User Manual Multi
No ratings yet
MS015 User Manual Multi
90 pages
CSC 407 Report Body
No ratings yet
CSC 407 Report Body
38 pages
Iet-Ifs 2014 0099
No ratings yet
Iet-Ifs 2014 0099
8 pages
BT Nhóm
No ratings yet
BT Nhóm
16 pages
Report Documentation: Key Words: - Android Security, Malware Detection, Characterization, Deep Learning
No ratings yet
Report Documentation: Key Words: - Android Security, Malware Detection, Characterization, Deep Learning
1 page
Detection of Malicious Android Apps Using Machine Learning Techniques
No ratings yet
Detection of Malicious Android Apps Using Machine Learning Techniques
7 pages
An Effective End-To-End Android Malware Detection Method - Research Base Paper PDF
No ratings yet
An Effective End-To-End Android Malware Detection Method - Research Base Paper PDF
10 pages
Eagle Point
100% (1)
Eagle Point
5 pages
RC1665 - Mindi Puspita Anggraeni
No ratings yet
RC1665 - Mindi Puspita Anggraeni
5 pages
Shi Et Al. - 2023 - SFCGDroid Android Malware Detection Based On Sens
No ratings yet
Shi Et Al. - 2023 - SFCGDroid Android Malware Detection Based On Sens
10 pages
Machine Learning Approach For Malware de
No ratings yet
Machine Learning Approach For Malware de
11 pages
Significant Permission Identification For Machine Learning Based Android Malware Detection
No ratings yet
Significant Permission Identification For Machine Learning Based Android Malware Detection
10 pages
Mamadroid: Detecting Android Malware by Building Markov Chains of Behavioral Models (Extended Version)
No ratings yet
Mamadroid: Detecting Android Malware by Building Markov Chains of Behavioral Models (Extended Version)
34 pages
A Review of Deep Learning Models To Detect Malware in Android Applications
No ratings yet
A Review of Deep Learning Models To Detect Malware in Android Applications
9 pages
A Comprehensive Review Paper
No ratings yet
A Comprehensive Review Paper
36 pages
Feature Engineering and Evaluation For Android Malware Detection Scheme
No ratings yet
Feature Engineering and Evaluation For Android Malware Detection Scheme
18 pages
Malware Detection in Android Applications
No ratings yet
Malware Detection in Android Applications
3 pages
Spre It Zen Barth 2013 Mobile Sandbox
No ratings yet
Spre It Zen Barth 2013 Mobile Sandbox
8 pages
SOA - A Malware Detection System Using A Hybrid Approach of Multi-Heads Attention-Based Control Flow Traces and Image Visualization
No ratings yet
SOA - A Malware Detection System Using A Hybrid Approach of Multi-Heads Attention-Based Control Flow Traces and Image Visualization
47 pages
Harsha
No ratings yet
Harsha
13 pages
Great Debaters
No ratings yet
Great Debaters
51 pages
03 - Product Specification
No ratings yet
03 - Product Specification
4 pages
ActDroid An Active Learning Framework For Android Malware Detection
No ratings yet
ActDroid An Active Learning Framework For Android Malware Detection
32 pages
Android Malware Detection Using Machine Learning
No ratings yet
Android Malware Detection Using Machine Learning
4 pages
7.analysis and Detection of Malware in Android Applications Using Machine Learning
No ratings yet
7.analysis and Detection of Malware in Android Applications Using Machine Learning
55 pages
Heuristic-Based Malware Detection For Android Using Machine Learning
No ratings yet
Heuristic-Based Malware Detection For Android Using Machine Learning
6 pages
Malware Detection Using ML
No ratings yet
Malware Detection Using ML
19 pages
1 s2.0 S2667305323001436 Main
No ratings yet
1 s2.0 S2667305323001436 Main
10 pages
FINAL REVIEW PAPER Android Dynamic Malware Analysis
No ratings yet
FINAL REVIEW PAPER Android Dynamic Malware Analysis
12 pages
Iron Ore Mining Feasibility Study Word
No ratings yet
Iron Ore Mining Feasibility Study Word
13 pages
A Hybrid Approach For Android Mal Ware Detection
No ratings yet
A Hybrid Approach For Android Mal Ware Detection
15 pages
CIC AndMal 2017
No ratings yet
CIC AndMal 2017
5 pages
Topher Eufaula Layton Resume
No ratings yet
Topher Eufaula Layton Resume
2 pages
V25I0107
No ratings yet
V25I0107
6 pages
Android Malware Detection Report
No ratings yet
Android Malware Detection Report
9 pages
Towards A Fair Comparison and Realistic Evaluation Framework of Android Malware
No ratings yet
Towards A Fair Comparison and Realistic Evaluation Framework of Android Malware
18 pages
Odusami2018 Chapter AndroidMalwareDetectionASurvey
No ratings yet
Odusami2018 Chapter AndroidMalwareDetectionASurvey
12 pages
Research BT4260
No ratings yet
Research BT4260
5 pages
IEEE Xplore Citation Plain Text Download 2025.1.5.19.1.38
No ratings yet
IEEE Xplore Citation Plain Text Download 2025.1.5.19.1.38
9 pages
Hybrid Android Malware Detection and Classification Using Deep Neural Networks
No ratings yet
Hybrid Android Malware Detection and Classification Using Deep Neural Networks
26 pages
Multimodal ML Approach
No ratings yet
Multimodal ML Approach
16 pages
A Deep Learning Based Android Malware Detection System With Static Analysis
No ratings yet
A Deep Learning Based Android Malware Detection System With Static Analysis
7 pages
Spre It Zen Barth 2015 Mobile Sandbox
No ratings yet
Spre It Zen Barth 2015 Mobile Sandbox
13 pages
University Chemistry 1st Edition Peter E Siska Ebook and TestBank Bundle Fast Access
No ratings yet
University Chemistry 1st Edition Peter E Siska Ebook and TestBank Bundle Fast Access
325 pages
Spre It Zen Barth 2013 Mobile Sandbox
No ratings yet
Spre It Zen Barth 2013 Mobile Sandbox
9 pages
Intel AI Global Impact Festival 2025 - Flyer
No ratings yet
Intel AI Global Impact Festival 2025 - Flyer
1 page
Veracode Essentials: Definitive Reference for Developers and Engineers
From Everand
Veracode Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Tifs 18

Uploaded by

Tifs 18

Uploaded by

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

DroidCat: Effective Android Malware Detection

ICC carrying URI data only 26%

APIs account for 80% function calls in the execution of

True Positive Rate

the date at 70 percentile, and then we held out the 30%

classification model are never used in validating the model.

True Positive Rate

True Positive Rate

baselines in their classification capabilities. In particular, we

You might also like