FLAG - Gorla CHABADA ICSE14
FLAG - Gorla CHABADA ICSE14
FLAG - Gorla CHABADA ICSE14
Weather
ABSTRACT "Weather", + Travel
"Map"…
How do we know a program does what it claims to do? After clus-
tering Android apps by their description topics, we identify outliers
in each cluster with respect to their API usage. A “weather” app that "Travel",
sends messages thus becomes an anomaly; likewise, a “messaging” "Map"…
app would typically not be expected to access the current location. "Theme" Themes
Applied on a set of 22,500+ Android applications, our CHABADA
prototype identified several anomalies; additionally, it flagged 56% 1. App collection 2. Topics 3. Clusters
of novel malware as such, without requiring any known malware
patterns. Weather
+ Travel
lated by applications that cover substantially different topics. For As a malware detector, we again used the OC-SVM model; but
instance, SlideIT free Keyboard , which is a popular tool to insert this time, we would use it as a classifier—that is, we used the SVM
text by sliding the finger along the keyboard letters, ended up in the model for a binary decision on whether an element would be part of
language cluster, since more than half of the description is about the the same distribution or not.
language support. Romanian Racing , which is a racing game, also
ended up in the same cluster, mainly because its brief description 4.2.1 Classification using topic clusters
mentions multi-language support. We ran the classification on the “benign” set of apps used in the
We also had some rare cases of applications that were associated previous study, and we included the 172 malware samples for which
to completely unexpected clusters. Hamster Life, a pet training game, we could find the corresponding description. We trained the OC-
ended up in the “religious wallpapers” cluster. SVM only on “benign” applications as before, and we excluded the
applications manually classified as “malicious” during the previous
Uncommon behavior. experiment (Section 4.1); We then trained within each cluster the
Some apps were tagged as outliers because their behavior, al- OC-SVM on the APIs of 90% of these apps, and then used the OC-
though benign and clearly described, is just very uncommon for the SVM as a classifier on a testing set composed of the known malicious
clusters these applications belong to. For instance, SoundCloud – apps in that cluster as well as the remaining 10% benign apps. What
Music & Audio was tagged as an anomaly mainly because it records we thus simulated is a situation in which the malware attack is
audio. This is expected behavior, since the application connects to a entirely novel—CHABADA must correctly identify the malware as
platform for finding and sharing new music, and it allows to record such without knowing previous malware patterns.
music that users produce. However, recording audio is not one of As for the previous experiment, we would do this 10 times, each
the common features of Cluster 1, and consequently SoundCloud time considering a different test set. The number of malicious
was tagged as an anomaly. Similarly, Llama – Location Profiles, applications would not be equally distributed across clusters, as
which allows to change the ringtones depending on the context and malicious applications are assigned to clusters depending on their
location, was tagged as an anomaly because it is not common for descriptions. In our evaluation setting, with our data set, the number
personalization applications to access the user’s location and calen- of malicious applications per cluster spans from 0 to 39.
dar. This application, though, uses this information to automatically The results of our classification are shown in Table 6. We report
switch between vibrate and ringtones when the user, for instance, is the average results of the 10 different runs. CHABADA correctly
at home, in office, or has a meeting. identifies 56% of the malicious apps as such, while only 16% of
“benign” apps are misclassified. If our approach would be used to
Benign outliers. guard against yet unknown malicious behavior, it would detect the
In some clusters, CHABADA identified the uncommon behavior as majority of malware as such.
the lack of malicious behavior. For instance, Cluster 13 (“money”)
contains several applications that use libraries for advertisements, Table 6: Checking APIs and descriptions within topic clusters
and thus access sensitive information. As a consequence, Mr. Will’s (our approach)
Predicted as malicious Predicted as benign
Stud Poker and Mr. Will’s Draw Poker were tagged as anomalies
because they do not access sensitive information, despite them being Malicious apps 96.5 (56%) 75.5 (44%)
poker games. Benign apps 353.9 (16%) 1,884.4 (84%)
CHABADA is behavior-agnostic: It cannot determine whether an Compared against standard malware detectors, these results of
application is good or bad, only if it is common or not. course leave room for improvement—but that is because existing
malware detectors compare against known malware, whose signa-
4.2 Malware Detection tures and behavior are already known. For instance, accessing all
Let us now turn to RQ2: Can our technique be used to identify user accounts on the device, as the London Restaurants app does,
malicious Android applications? For this purpose, we used the is a known pattern of malicious behavior. In practice, our approach
dataset of Zhou et al. [28] containing more than 1,200 known mali- would thus be used to complement such detectors, and be specifi-
cious apps for Android. In their raw form, these apps lack metadata cally targeted towards novel attacks which would be different from
such as title or description. As many of these apps are repackaged existing malware—but whose API usage is sufficiently abnormal to
versions of an original app, we were able to collect the appropri- be flagged as an outlier.
ate description from the Google Play Store. We used the title of
the application and the package identifier to search for the right In our sample, even without knowing existing malware patterns,
match in the Store. For 72 cases we could find exactly the same CHABADA detects the majority of malware as such.
package identifier, and for 116 applications we found applications
whose package identifier was very similar. We manually checked 4.2.2 Classification without clustering
that the match was correct. As with our original set of “benign” We further evaluate the effectiveness of our approach by compar-
apps (Section 2.1), we only kept those applications with an English ing it against alternatives. To show the impact of topic clustering,
description in the set, reducing it to 172 apps. we compare our classification results against a setting in which the
OC-SVM would be trained on sensitive APIs and NLP-preprocessed Researcher bias. Our evaluation of outliers is based on the classi-
words from the description alone—that is, all applications form one fication by a single person, who is a co-author of this paper.
big cluster. As Table 7 shows, the malware detection rate decreases This poses the risk of researcher bias, i.e. the desire of an
dramatically. This shows the benefits of our clustering approach. author to come up with best possible results. To counter this
threat, we are making our dataset publicly available (Sec-
Table 7: Checking APIs and descriptions in one single cluster tion 6).
Predicted as malicious Predicted as benign
Malicious apps 41 (24%) 131 (76%) Native code and obfuscation. We limit our analyses to the Dalvik
Benign apps 334.9 (15%) 1,903.1 (85%) bytecode. We do not analyze native code. Hence, an applica-
tion might rely on native code or use obfuscation to perform
covert behavior; but then, such features may again charac-
Classifying without clustering yields more false negatives. terize outliers; also, neither of these would change the set of
APIs that must be called.
4.2.3 Classification using given categories
Finally, one may ask why our specific approach for clustering Static analysis. As we rely on static API usage, we suffer from
based on description topics would be needed, as one could also limitations that are typical for static analysis. In particular,
easily use the given store categories. To this end, we clustered the we may miss behavior induced through reflection, i.e. code
applications based on their categories in the Google Play Store, generated at runtime. Although there exist techniques to
and repeated the experiment with the resulting 30 clusters. The statically analyze Java code using reflection, such techniques
results (Table 8) demonstrate the general benefits of clustering; are not directly applicable with Android apps [5, 8]; in the
however, topic clustering as in our approach, is still clearly superior. long run, dynamic analysis paired with test generation may
(Additionally, one may argue that a category is something that some be a better option.
librarian would assign, thus requiring more work and more data.)
Static API declarations. Since we extract API calls statically, we
may consider API calls that are never executed by the app.
Table 8: Checking APIs and descriptions within Google Play
Checking statically whether an API is reached is an instance of
Store categories
Predicted as malicious Predicted as benign the (undecidable) halting problem. As a workaround, we de-
cided to consider an API only if the corresponding permission
Malicious apps 81.6 (47%) 90.4 (53%)
is also declared in the manifest.
Benign apps 356.9 (16%) 1,881.1 (84%)
Sensitive APIs. Our detection of sensitive APIs (Section 3.2) relies
Clustering by description topics is superior to clustering by given on the mapping by Felt et al. [7], which now, two years later,
categories. may be partially outdated. Incorrect or missing entries in the
mapping would make CHABADA miss or misclassify relevant
4.3 Limitations and Threats to Validity behavior of the app.
Like any empirical study, our evaluation is subject to threats to
validity, many of which are induced by limitations of our approach. 5. RELATED WORK
The most important threats and limitations are listed below. While this work may be the first to generally check app descrip-
tions against app behavior, it builds on a history of previous work
External validity. CHABADA relies on establishing a relationship combining natural language processing and software development.
between description topics and program features from exist-
ing, assumed mostly benign, applications. We cannot claim 5.1 Mining App Descriptions
that said relationships could be applied in other app ecosys- Most related to our work is the WHYPER framework of Pandita
tems, or be transferable to these. We have documented our et al. [19]. Just like our approach, WHYPER attempts to automate
steps to allow easy replication of our approach. the risk assessment of Android apps, and applies natural language
processing to app descriptions. The aim of WHYPER is to tell
Free apps only. Our sample of 22,521 apps is based on free appli- whether the need for sensitive permissions (such as accesses to
cations only; i.e. applications that need to generate income contacts or calendar) is motivated in the application description.
through ads, purchases, or donations. Not considering paid In contrast to CHABADA, which fully automatically learns which
applications makes our dataset biased. However, the bias topics are associated with which APIs (and by extension, which
would shift “normality” more towards apps supported by ads permissions), WHYPER requires manual annotation of sentences
and other income methods, which are closer to undesired be- describing the need for permissions. Also, CHABADA goes beyond
havior exposed by malware. Our results thus are conservative permissions in two ways: first, it focuses on APIs, which provide
and would rather be improved through a greater fraction of a more detailed view, and it aims for general mismatches between
paid applications, which can be expected to be benign. expectations and implementations.
The very idea of app store mining was introduced one year earlier
App and malware bias. Our sample also only reflects the top 150 when Harman et al. mined the Blackberry app store [9]. They
downloads from each category in the Google Play Store. This focused on app meta-data to find patterns such as a correlation
sample is biased towards frequently used applications, and consumer rating and the rank of app downloads, but would not
towards lesser used categories; likewise, our selection of mal- download or analyze the apps themselves.
ware (Section 4) may or may not be representative for current Our characterization of “normal” behavior comes from mining
threats. Not knowing which actual apps are being used, and related applications; in general, we assume what most applications
how, by Android users, these samples may be biased. Again, in a well-maintained store do is also what most users would expect
we allow for easy reproduction of our approach. to be legitimate. In contrast, recent work by Lin et al. [14] suggests
crowdsourcing to infer what users expect from specific privacy what is actually being done with their sensitive data, nor would
settings; just like we found, Lin et al. also highlight that privacy they understand the consequences. Users understand, though, what
expectations vary between app categories. Such information from regular apps do; and CHABADA is set to point out and highlight
users can well complement what we infer from app descriptions. differences, which should be way easier to grasp.
Although our present approach came to be by exploring and
5.2 Behavior/Description Mismatches refining several alternatives, we are well aware that it is by no means
Our approach is also related to techniques that apply natural perfect or complete. Our future work will focus on the following
language processing to infer specifications from comments and doc- topics:
umentation. Lin Tan et al. [24] extract implicit program rules from
program corpora and use these rules to automatically detect inconsis- Detailed Behavior patterns. Static API usage is a rather broad ab-
tencies between comments and source code, indicating either bugs straction for characterizing what an app does and what not.
or bad comments. Rules apply to ordering and nesting of calls and More advanced methods could focus on the interaction of
resource accesses (“fa must not be called from fb ”). APIs, notably information flow between APIs.
Høst and Østvold [11] learn from program corpora which verbs
and phrases would normally be associated with specific method Dynamic Behavior. Exploring actual executions would give a far
calls, and used these to identify misnamed methods. more detailed view of what an app actually does—in particu-
Pandita et al. [20] identify sentences that describe code contracts lar, concrete values for all APIs accessing remote resources.
from more than 2,500 sentences of API documents; these contracts We are working on GUI test generators for Android apps that
can be checked either through tests or static analysis. aim for coverage of specific APIs or dynamic information
All these approaches compare program code against formal pro- flow.
gram documentation, whose semi-formal nature makes it easier to
Natural Language Processing. The state of the art in natural lan-
extract requirements. In contrast, CHABADA works on end-user
guage processing can retrieve much more than just topics.
documentation, which is decoupled from the program structure.
Looking at dependencies between words (such as conjunc-
5.3 Detecting Malicious Apps tions, subject-verb, verb-object) could retrieve much more
detailed patterns. Likewise, leveraging known ontologies
There is a large body of industrial products and research proto-
would help in identifying synonyms.
types that focus on identifying known malicious behavior. Most
influential for our work was the paper by Zhou and Jiang [28], who A Rosetta Stone for Topics and Behavior. By mining thousands
use the permissions requested by applications as a filter to identify of applications, we can associate natural language descrip-
potentially malicious applications; the actual detection uses static tions with specific program behavior. The resulting mapping
analysis to compare sequences of API calls against those of known between natural language and program fragments help in pro-
malware. In contrast to all these approaches, CHABADA identifies gram understanding as well as synthesis of programs and
outliers even without knowing what makes malicious behavior. tests.
The TAINTDROID system [6] tracks dynamic information flow
within Android apps and thus can detect usages of sensitive infor- To allow easy reproduction and verification of our work, we
mation. Using such dynamic flow information would yield far more have packaged all data used within this work for download. In
precise behavior insights than static API usage; similarly, profilers particular, we have prepared a 50 MB dataset with the exact data
such as ProfileDroid [26] would provide better information; how- that goes into CHABADA, including app names, descriptions, other
ever, both TAINTDROID and ProfileDroid require a representative set metadata, permissions, and API usage. All of this can be found on
of executions. Integrating such techniques in CHABADA, combined the CHABADA web site:
with automated test generation [12, 27, 15, 1], would allow to learn
normal and abnormal patterns of information flow; this is part of http://www.st.cs.uni-saarland.de/chabada/
our future work (Section 6).
7. ACKNOWLEDGMENTS
6. CONCLUSION AND CONSEQUENCES We thank Vitalii Avdiienko, Juan Pablo Galeotti, Clemens Ham-
By clustering apps by description topics, and identifying outliers macher, Konrad Jamrozik, and Sascha Just for their helpful feed-
by API usage within each cluster, our CHABADA approach effec- back on earlier revisions of this paper. Special thanks go to Andrea
tively identifies applications whose behavior would be unexpected Fischer, Joerg Schad, Stefan Richter, and Stefan Schuh for their
given their description. We have identified several examples of valuable assistance during the project.
false and misleading advertising; and as a side effect, obtained a This work was funded by the European Research Council (ERC)
novel effective detector for yet unknown malware. Just like min- Advanced Grant “SPECMATE – Specification Mining and Testing”.
ing software archives has opened new opportunities for empirical
software engineering, we see that mining apps and their descrip-
tions opens several new opportunities for automated checking of 8. REFERENCES
natural-language requirements. [1] D. Amalfitano, A. R. Fasolino, P. Tramontana, S. De Carmine,
During our work, we have gained a number of insights into the and A. M. Memon. Using GUI ripping for automated testing
Android app ecosystem that call for action. First and foremost, of Android applications. In IEEE/ACM International
application vendors must be much more explicit about what their Conference on Automated Software Engineering (ASE), pages
apps do to earn their income. App store suppliers such as Google 258–261, New York, NY, USA, 2012. ACM.
should introduce better standards to avoid deceiving or incomplete [2] K. W. Y. Au, Y. F. Zhou, Z. Huang, and D. Lie. PScout:
advertising. Second, the way Android asks its users for permissions analyzing the Android permission specification. In ACM
is broken. Regular users will not understand what “allow access to Conference on Computer and Communications Security
the device identifier” means, nor would they have means to check (CCS), pages 217–228, New York, NY, USA, 2012. ACM.
[3] A. Bartel, J. Klein, M. Monperrus, and Y. Le Traon. Engineering (ESEC/FSE), pages 224–234, New York, NY,
Automatically securing permission-based software by USA, 2013. ACM.
reducing the attack surface: An application to Android. In [16] J. B. MacQueen. Some methods for classification and analysis
IEEE/ACM International Conference on Automated Software of multivariate observations. In L. M. L. Cam and J. Neyman,
Engineering (ASE), pages 274–277, 2012. editors, Berkeley Symposium on Mathematical Statistics and
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Probability, volume 1, pages 281–297. University of
allocation. Journal of Machine Learning Research, California Press, 1967.
3:993–1022, 2003. [17] L. M. Manevitz and M. Yousef. One-class SVMs for
[5] E. Bodden, A. Sewe, J. Sinschek, H. Oueslati, and M. Mezini. document classification. Journal of Machine Learning
Taming reflection: Aiding static analysis in the presence of Research, 2:139–154, 2002.
reflection and custom class loaders. In ACM/IEEE [18] A. K. McCallum. Mallet: A machine learning for language
International Conference on Software Engineering (ICSE), toolkit. http://mallet.cs.umass.edu, 2002.
pages 241–250, New York, NY, USA, 2011. ACM. [19] R. Pandita, X. Xiao, W. Yang, W. Enck, and T. Xie.
[6] W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, WHYPER: Towards automating risk assessment of mobile
P. McDaniel, and A. N. Sheth. TaintDroid: an applications. In USENIX Security Symposium, pages 527–542,
information-flow tracking system for realtime privacy 2013.
monitoring on smartphones. In USENIX conference on [20] R. Pandita, X. Xiao, H. Zhong, T. Xie, S. Oney, and
Operating Systems Design and Implementation (OSDI), pages A. Paradkar. Inferring method specifications from natural
1–6, Berkeley, CA, USA, 2010. USENIX Association. language API descriptions. In ACM/IEEE International
[7] A. P. Felt, E. Chin, S. Hanna, D. Song, and D. Wagner. Conference on Software Engineering (ICSE), 2012.
Android permissions demystified. In ACM Conference on [21] P. Rousseeuw. Silhouettes: a graphical aid to the interpretation
Computer and Communications Security (CCS), pages and validation of cluster analysis. Journal of Computational
627–638, New York, NY, USA, 2011. ACM. and Applied Mathematics, 20(1):53–65, 1987.
[8] C. Fritz, S. Arzt, S. Rasthofer, E. Bodden, A. Bartel, J. Klein, [22] B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and
Y. le Traon, D. Octeau, and P. McDaniel. Highly precise taint R. C. Williamson. Estimating the support of a
analysis for Android applications. Technical Report high-dimensional distribution. Neural Computation,
TUD-CS-2013-0113, EC SPRIDE, 2013. 13(7):1443–1471, 2001.
[9] M. Harman, Y. Jia, and Y. Zhang. App store mining and [23] R. Stevens, J. Ganz, P. Devanbu, H. Chen, and V. Filkov.
analysis: MSR for app stores. In IEEE Working Conference on Asking for (and about) permissions used by Android apps. In
Mining Software Repositories (MSR), pages 108–111, 2012. IEEE Working Conference on Mining Software Repositories
[10] K. A. Heller, K. M. Svore, A. D. Keromytis, and S. J. Stolfo. (MSR), pages 31–40, San Francisco, CA, 2013.
One class support vector machines for detecting anomalous [24] L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /* iComment: Bugs
windows registry accesses. In ICDM Workshop on Data or bad comments? */. In ACM SIGOPS Symposium on
Mining for Computer Security (DMSEC), 2003. Operating Systems Principles (SOSP), pages 145–158, 2007.
[11] E. W. Høst and B. M. Østvold. Debugging method names. In [25] A. Wasylkowski, A. Zeller, and C. Lindig. Detecting object
European Conference on Object-Oriented Programming usage anomalies. In European Software Engineering
(ECOOP), pages 294–317. Springer, 2009. Conference held jointly with ACM SIGSOFT International
[12] C. Hu and I. Neamtiu. Automating GUI testing for Android Symposium on Foundations of Software Engineering
applications. In International Workshop on Automation of (ESEC/FSE), pages 35–44, New York, NY, 2007. ACM.
Software Test (AST), pages 77–83, New York, NY, USA, 2011. [26] X. Wei, L. Gomez, I. Neamtiu, and M. Faloutsos.
ACM. ProfileDroid: multi-layer profiling of Android applications. In
[13] K. S. Jones. A statistical interpretation of term specificity and ACM Annual International Conference on Mobile Computing
its application in retrieval. Journal of Documentation, and networking (MobiCom), pages 137–148, New York, NY,
28(1):11–21, 1972. USA, 2012. ACM.
[14] J. Lin, S. Amini, J. I. Hong, N. Sadeh, J. Lindqvist, and [27] W. Yang, M. R. Prasad, and T. Xie. A grey-box approach for
J. Zhang. Expectation and purpose: understanding users’ automated GUI-model generation of mobile applications. In
mental models of mobile app privacy through crowdsourcing. International Conference on Fundamental Approaches to
In ACM Conference on Ubiquitous Computing (UbiComp), Software Engineering (FASE), pages 250–265, Berlin,
pages 501–510, New York, NY, USA, 2012. ACM. Heidelberg, 2013. Springer-Verlag.
[15] A. Machiry, R. Tahiliani, and M. Naik. Dynodroid: an input [28] Y. Zhou and X. Jiang. Dissecting Android malware:
generation system for Android apps. In European Software Characterization and evolution. In IEEE Symposium on
Engineering Conference held jointly with ACM SIGSOFT Security and Privacy (SP), pages 95–109, Washington, DC,
International Symposium on Foundations of Software USA, 2012. IEEE Computer Society.