FLAG - Gorla CHABADA ICSE14

Checking App Behavior Against App Descriptions
Alessandra Gorla · Ilaria Tavecchia∗ · Florian Gross · Andreas Zeller

Saarland University
Saarbrücken, Germany
{gorla, tavecchia, fgross, zeller}@cs.uni-saarland.de
Weather
ABSTRACT "Weather", + Travel
"Map"…
How do we know a program does what it claims to do? After clus-
tering Android apps by their description topics, we identify outliers
in each cluster with respect to their API usage. A “weather” app that "Travel",
sends messages thus becomes an anomaly; likewise, a “messaging” "Map"…
app would typically not be expected to access the current location. "Theme" Themes
Applied on a set of 22,500+ Android applications, our CHABADA
prototype identified several anomalies; additionally, it flagged 56% 1. App collection 2. Topics 3. Clusters
of novel malware as such, without requiring any known malware
patterns. Weather
+ Travel
Categories and Subject Descriptors

D.4.6 [Security and Protection]: Invasive software
Internet Access-Location Internet Access-Location Send-SMS
General Terms 4. Used APIs 5. Outliers

Security
Figure 1: Detecting applications with unadvertised behavior.
Keywords Starting from a collection of “good” apps (1), we identify their
description topics (2) to form clusters of related apps (3). For
Android, malware detection, description analysis, clustering each cluster, we identify the sentitive APIs used (4), and can
then identify outliers that use APIs that are uncommon for that
1. INTRODUCTION cluster (5).
Checking whether a program does what it claims to do is a long-
standing problem for developers. Unfortunately, it now has become
a problem for computer users, too. Whenever we install a new app, • An app that sends a text message to a premium number to
we run the risk of the app being “malware”—that is, to act against raise money is suspicious? Maybe, but on Android, this is a
the interests of its users. legitimate payment method for unlocking game features.
Research and industry so far have focused on detecting malware
• An app that tracks your current position is malicious? Not if
by checking static code and dynamic behavior against predefined
it is a navigation app, a trail tracker, or a map application.
patterns of malicious behavior. However, this will not help against
new attacks, as it is hard to define in advance whether some program • An application that takes all of your contacts and sends them
behavior will be beneficial or malicious. The problem is that any to some server is malicious? This is what WhatsApp does
specification on what makes behavior beneficial or malicious very upon initialization, one of the world’s most popular mobile
much depends on the current context. In the mobile world, for messaging applications.
instance, a behavior considered malicious in one app may well be a
feature of another app: The question thus is not whether the behavior of an app matches
∗
Ilaria Tavecchia is now with SWIFT, Brussels, Belgium. a specific pattern or not; it is whether the program behaves as
advertised. In all the examples above, the user would be informed
and asked for authorization before any questionable behavior. It is
the covert behavior that is questionable or downright malicious.
Permission to make digital or hard copies of all or part of this work for In this paper, we attempt to check implemented app behavior
personal or classroom use is granted without fee provided that copies are against advertised app behavior. Our domain is Android apps,
not made or distributed for profit or commercial advantage and that copies so chosen because of its market share and history of attacks and
bear this notice and the full citation on the first page. To copy otherwise, to frauds. As a proxy for the advertised behavior of an app, we use
republish, to post on servers or to redistribute to lists, requires prior specific its natural language description from the Google Play Store. As
permission and/or a fee.
ICSE ’14, May 31 - June 7, 2014, Hyderabad, India a proxy for its implemented behavior, we use the set of Android
Copyright 14 ACM 978-1-4503-2756-5/14/05 ...$15.00. application programming interfaces (APIs) that are used from within
Figure 4: APIs, grouped by associated permission, used within
the “navigation and travel” cluster. Standard behavior is to
access the location (“ACCESS-FINE-LOCATION”) and access the
Internet (“ACCESS-NETWORK-STATE”, “INTERNET”).
Figure 2: Words in app descriptions of the “navigation and

travel” cluster. The bigger a word, the more app descriptions it
appears in.
Figure 5: APIs for the “personalize” cluster. In con-

trast to Figure 4, these apps frequently access the SD card
(“WRITE-EXTERNAL-STORAGE”), enable and disable new fea-
tures (“CHANGE-COMPONENT-ENABLED-STATE”), but rarely,
if ever, access the device location.
server). Internet access is also used by “personalize” apps

(Figure 5); however, they would access external storage and
components rather than the current location.
5. Using unsupervised One-Class SVM anomaly classification,

Figure 3: Words in “personalize” apps CHABADA identifies outliers with respect to API usage. It
produces a ranked list of applications for each cluster, where
the app binary. The key idea is to associate descriptions and API the top apps are most abnormal with respect to their API
usage to detect anomalies: “This ‘weather’ application accesses the usage—indicating possible mismatches between description
messaging API, which is unusual for this category.” and implementation. Likewise, yet unknown applications
Specifically, our CHABADA approach1 takes five steps, illustrated would first be assigned to the cluster implied by their descrip-
in Figure 1 and detailed later in the paper: tion and then be classified as being normal or abnormal.
1. CHABADA starts with a collection of 22,500+ “good” Android In the “navigation and travel” cluster, the usage of any API
applications downloaded from the Google Play Store. that changes the configuration of the phone or its components
would be unusual; this, however, is common for “personalize”
2. Using Latent Dirichlet Allocation (LDA) on the app descrip- apps, which in turn rarely access the current location. In both
tions, CHABADA identifies the main topics (“theme”, “map”, clusters, any app that would read or write contacts, calendar
“weather”, “download”) for each application. entries, pictures, text messages, the browser history, network
settings, etc. would immediately be flagged as an outlier—and
3. CHABADA then clusters applications by related topics. For in- thus be subject to further scrutiny.
stance, apps related to “navigation” and “travel” share several
topics in their description, so they form one cluster. By flagging anomalous API usage within each cluster, CHABADA
Figure 2 shows the word cloud for the descriptions of the apps is set up to detect any kind of mismatch between advertised and
in the “navigation and travel” cluster. Clusters very much implemented behavior. How does this work in practice? Figure 6
differ by their word clouds; compare, for instance, the word shows London Restaurants & Pubs +2 , available from the Google
cloud for the “personalization” cluster (Figure 3). Play Store. Its description clearly puts it into the “navigation and
travel” cluster. Besides expected API calls to access the current
4. In each cluster, CHABADA identifies the APIs each app stat- location and the internet, however, it also uses three API calls to re-
ically accesses. We only consider sensitive APIs, which are trieve the list of user accounts on the device—getAccountsByType(),
governed by a user permission. For instance, APIs related to getDeviceId(), and getLine1Number(), all governed by the “GET-
internet access are controlled by an “INTERNET” permission. ACCOUNTS” permission. These calls, which retrieve sensitive
Figure 4 shows the sensitive APIs used by “navigation and information such as the device identifier and mobile phone number,
travel” applications, grouped by the governing permission. make London Restaurants an outlier within “navigation and travel”.
Obviously, the “normal” behavior for these apps is to access Indeed, London Restaurants covertly sends this information (in-
the current location and access the internet (typically, a map cluding additional information such as the current location) to its ad-
vertisement service, which is not mentioned at all in the description.
1
CHABADA stands for CHecking App Behavior Against Descrip-
2
tions of Apps. “Chabada” is a French word for the base ternary https://play.google.com/store/apps/details?
rhythm pattern in Jazz. id=com.alarisstudio.maps.restaurants.london
Looking for a restaurant, a bar, a pub or just to single complete run of our script thus returned 4,500 apps; as the
have fun in London? Search no more! This
application has all the information you need: top 150 apps shifted during our collection, we obtained a total of
* You can search for every type of food you 32,136 apps across all categories.
want: french, british, chinese, indian etc.
* You can use it if you are in a car, on a bicycle In addition to the actual app (coming as an APK file), we also
or walking collected the store metadata—such as name, description, release
* You can view all objectives on the map
* You can search objectives date, user ratings, or screenshots. As CHABADA is set to identify
* You can view objectives near you outliers before they get released to the public, it only uses name and
* You can view directions (visual route,
distance and duration) description.
* You can use it with Street View
* You can use it with Navigation
Keywords: london, restaurants, bars, pubs, 2.2 Preprocessing Descriptions with NLP
food, breakfast, lunch, dinner, meal, eat,
supper, street view, navigation Before subjecting our descriptions to topic analysis, we applied
standard techniques of natural language processing (NLP) for filter-
ing and stemming.
App descriptions in the Google Play Store frequently contain
paragraphs in multiple languages—for instance, the main descrip-
tion is in English, while at the end of the description developers
add a short sentence in different languages to briefly describe the
application. To be able to cluster similar descriptions, we had to
choose one single language, and because of its predominance we
chose English. To remove all paragraphs of text that were not in
Figure 6: The app London Restaurants Bars & Pubs +, together English, we ran Google’s Compact Language Detector5 to detect
with complete description and API groups accessed their most likely language; non-English paragraphs were removed.
After multi-language filtering, we removed stop words (common
words such as “the”, “is”, “at”, “which”, “on”, . . . ), and applied
Is this malware? Possibly. Is this unexpected behavior? Certainly.3
stemming on all descriptions. Stemming is a common NLP technique
If London Restaurants had been explicit about what it does, it would
to identify the word’s root, and it is essential to make words such as
have fallen in an “advertisements” cluster instead, where it would
“playing”, “player”, and “play” all match to the single common root
no longer be an outlier.
“plai”. Stemming can improve the results of later NLP processes,
In our research, we found several more examples of false adver-
since it reduces the number of words. We also removed non-text
tising, plain fraud, masquerading, and other questionable behavior.
items such as numerals, HTML tags, links and email addresses.
As a side effect, our approach is also effective as a malware de-
As an example, consider the description of London Restaurants
tector: Training per-cluster SVM classifiers on benign applications,
in Figure 6; after stop word removal and stemming, it appears as:
CHABADA flagged 56% of known malware as such, without requir-
ing any training on malware patterns. look restaur bar pub just fun london search applic inform need
The remainder of this paper is organized as follows. We first can search everi type food want french british chines indian etc
detail how to cluster applications by description topics in Section 2. can us car bicycl walk can view object map can search object
can view object near can view direct visual rout distanc durat
Section 3 describes how in each cluster we detect outliers with re- can us street view can us navig keyword london restaur bar pub
spect to their API usage. Section 4 evaluates our approach, manually food breakfast lunch dinner meal eat supper street view navig
and automatically, quantitatively and qualitatively. After discussing
the related work (Section 5), Section 6 closes with conclusion and With that, we eliminated those applications from our set whose
consequences. description would have less than 10 words after the above NLP
preprocessing. Also, we eliminated all applications without any
2. CLUSTERING APPS BY DESCRIPTION sensitive APIs (see Section 3 for details). This resulted in a final set
of 22,521 apps, which form the base for our approach.
The intuition behind CHABADA is simple: applications that are
similar, in terms of their descriptions, should also behave simi- 2.3 Identifying Topics with LDA
larly. For this, we must first establish what makes two descriptions
“similar”. We start with our collection method for Android apps To identify sets of topics for the apps under analysis, we resort to
(Section 2.1). After initial processing (Section 2.2), CHABADA iden- topic modeling using Latent Dirichlet Allocation (LDA) [4].
LDA relies on statistical models to discover the topics that occur
tifies topics of app descriptions (Section 2.3), and then clusters the
apps based on common topics (Section 2.4 to Section 2.6). in a collection of unlabeled text. A “topic” consists of a cluster
of words that frequently occur together. By analyzing a set of app
2.1 Collecting Applications descriptions on navigation and travels, for instance, LDA would
Our approach is based on detecting anomalies from “normal”, group words such as “map”, “traffic”, “route”, and “position” into
hopefully benign applications. As a base for such “normal” behav- one cluster, and “city”, “attraction”, “tour”, and “visit” into another
ior, we collected a large set of applications from the Google Play cluster. Applications whose description is mainly about navigation
Store, the central resource for Android apps. Our automated col- would thus be assigned to the first topic, since most of the words
lection script ran at regular intervals during the Winter and Spring occurring in the description belong to the first cluster. Applications
of 2013, and for each of the 30 categories in the Google Play Store, such as London Restaurants, however, would be assigned to both
downloaded the top 150 free4 applications in each category. A topics, as the words in the description appear in both clusters.
Our implementation feeds output of NLP pre-processing (i.e., the
3
When installing London Restaurants, the user must explicitly ac- English text without stop words, and after stemming) into the Mallet
knowledge its set of permissions, but why would the user find framework [18]. We could freely choose the number of topics to
something like “account access” unusual or suspicious?
4 5
Section 4.3 discusses possible induced bias. http://code.google.com/p/chromium-compact-language-detector/
However, what we want is to identify groups of applications with
Table 1: Topics mined from Android Apps similar descriptions, and we do that using the K-means algorithm,
Id Assigned Name Most Representative Words (stemmed) one of the most common clustering algorithms [16].
0 “personalize” galaxi, nexu, device, screen, effect, instal,
customis Given a set of elements in a metric space, and the number K
1 “game and cheat game, video, page, cheat, link, tip, trick of desired clusters, K-means selects one centroid for each cluster,
sheets” and then associates each element of the data set with the nearest
2 “money” slot, machine, money, poker, currenc, market,
trade, stock, casino coin, finance centroid, thus identifying clusters. In this context, the elements
3 “tv” tv, channel, countri, live, watch, germani, na- to be clustered are the applications as identified by their vector of
tion, bbc, newspap affinities to topics.
4 “music” music, song, radio, play, player, listen
5 “holidays” and christmas, halloween, santa, year, holiday, is-
Table 2 shows four applications app1 , . . . , app4 , with the corre-
religion lam, god sponding probabilities of belonging to four topics. When applied to
6 “navigation and map, inform, track, gps, navig, travel this set of applications with K = 2 clusters, K-means returns one
travel”
7 “language” language, word, english, learn, german,
cluster with app1 and app3 , and another cluster with app2 and app4 .
translat
8 “share” email, ad, support, facebook, share, twitter,
rate, suggest Table 2: Four applications and their likelihoods of belonging to
9 “weather and stars” weather, forecast, locate, temperatur, map, specific topics
city, light
10 “files and video” file, download, video, media, support, man-
age, share, view, search
Application topic1 topic2 topic3 topic4
11 “photo and social” photo, friend, facebook, share, love, twitter, app1 0.60 0.40 — —
pictur, chat, messag, galleri, hot, send social app2 — — 0.70 0.30
12 “cars” car, race, speed, drive, vehicl, bike, track
13 “design and art” life, peopl, natur, form, feel, learn, art, design,
app3 0.50 0.30 — 0.20
uniqu, effect, modern app4 — — 0.40 0.60
14 “food and recipes” recip, cake, chicken, cook, food
15 “personalize” theme, launcher, download, install, icon,
menu
2.5 Finding the Best Number of Clusters
16 “health” weight, bodi, exercise, diet, workout, medic One of the challenges with K-means is to estimate the number
17 “travel” citi, guid, map, travel, flag, countri, attract of clusters that should be created. The algorithm needs to be given
18 “kids and bodies” kid, anim, color, girl, babi, pictur, fun, draw,
design, learn
either some initial potential centroids, or the number K of clusters to
19 “ringtones and sound, rington, alarm, notif, music identify. There exist several approaches to identify the best solution,
sound” among a set of possible solutions. Therefore, we run K-means
20 “game” game, plai, graphic, fun, jump, level, ball, 3d,
score
several times, each time with a different K number, to obtain a set
21 “search and search, icon, delet, bookmark, link, homepag, of clusterings we would then be able to evaluate. The range for K
browse” shortcut, browser covers solutions among two extremes: having a small number of
22 “battle games” story, game, monster, zombi, war, battle clusters (even just 2) with a large variety of apps; or having many
23 “settings and utils” screen, set, widget, phone, batteri
24 “sports” team, football, leagu, player, sport, basketbal clusters (potentially even one per app) and thus being very specific.
25 “wallpapers” wallpap, live, home, screen, background, We fixed num_topics × 4 as an upper bound, since in our settings
menu an application can belong to up to 4 topics.
26 “connection” device, connect, network, wifi, blootooth, in-
ternet, remot, server To identify the best solution, i.e., the best number of clusters, we
27 “policies and ads” live, ad, home, applovin, notif, data, polici, pri- used the elements silhouette, as discussed in [21]. The silhouette of
vacy, share, airpush, advertis an element is the measure of how closely the element is matched to
28 “popular media” seri, video, film, album, movi, music, award,
star, fan, show, gangnam, top, bieber the other elements within its cluster, and how loosely it is matched
29 “puzzle and card game, plai, level, puzzl, player, score, chal- to other elements of the neighboring clusters. When the value of
games” leng, card the silhouette of an element is close to 1, it means that the element
is in the appropriate cluster. If the value is close to −1, instead, it
be identified by LDA; and we chose 30, the number of categories means that the element is in the wrong cluster. Thus, to identify the
covered by our apps in the Google Play Store. Furthermore, we set best solution, we compute the average of the elements’ silhouette
up the LDA such that an app would belong to at most 4 topics, and for each solution using K as the number of clusters, and we select
we consider an app related to a topic only if its probability for that the solution whose silhouette was closest to 1.
topic is at least 5%.
Table 1 shows the resulting list of topics for the 22,521 descrip- 2.6 Resulting App Clusters
tions that we analyzed; the “assigned name” is the abstract concept Table 3 shows the list of clusters that were identified for the
we assigned to that topic. Our example application, London Restau- 22,521 apps that we analyzed. Each of these 32 clusters contains
rants, is assigned to three of these topics: apps whose descriptions contain similar topics, listed under “Most
Important Topics”. The percentages reported in the last column
• Topic 6 (“navigation and travel”) with a probability of 59.8%,
represent the weight of specific topics within each cluster.
• Topic 14 (“food and recipes”) with a probability of 19.9%, The clusters we identified are quite different from the categories
and one would find in an app store such as the Google Play Store. Clus-
ter 22 (“advertisements”), for instance, is filled with applications
• Topic 17 (“travel”) with a probability of 14.0%. that do nothing but display ads in one way or another; these apps
typically promise or provide some user benefit in return. Cluster 16
2.4 Clustering Apps with K-means (“connection”) represents all application that deal with Bluetooth,
Topic modeling assigns an application description to each topic Wi-Fi, etc.; there is no such category in the Google Play Store. The
with a certain probability. In other words, each application is char- several “wallpaper” clusters, from adult themes to religion, simply
acterized by a vector of affinity values (probabilities) for each topic. represent the fact that several apps offer very little functionality.
refinements. We briefly list the most important ones here as to have
Table 3: Clusters of applications. “Size” is the number of appli- future researchers avoid some of the problems we encountered.
cations in the respective cluster. “Most Important Topics” list
the three most prevalent topics; most important (> 10%) shown Usage of topics. One might wonder if it is really necessary to clus-
in bold. Topics less than 1% not listed. ter based on topics instead of clustering plain descriptions
Id Assigned Name Size Most Important Topics directly. The reason is that K-means, as well as any other
1 “sharing” 1,453 share (53%), settings and utils, clustering algorithm, works better when few features are in-
navigation and travel volved. Hence, abstracting descriptions into topics was crucial
2 “puzzle and card 953 puzzle and card games (78%),
games” share, game to obtain better clustering results.
3 “memory puzzles” 1,069 puzzle and card games (40%),
game (12%), share Usage of clusters. Having just one dominant topic for applications
4 “music” 714 music (58%), share, settings and did not yield better results, since several applications may
utils incorporate multiple topics at once. This also excluded the us-
5 “music videos” 773 popular media (44%), holidays
and religion (20%), share age of the given Google Play Store categories as a clustering
6 “religious 367 holidays and religion (56%), de- strategy. Despite one might argue that clustering does not pro-
wallpapers” sign and art, wallpapers duce different results than just clustering on the predominant
7 “language” 602 language (67%), share, settings
and utils topics (the number of topics and cluster is almost the same),
8 “cheat sheets” 785 game and cheat sheets (76%), one should also notice that clusters have quite different fea-
share, popular media tures than topics. For instance, Cluster 22 (“advertisements”)
9 “utils” 1,300 settings and utils (62%), share,
connection
groups applications whose main topic is about wallpapers
10 “sports game” 1,306 game (63%), battle games, puzzle and mention in the description that the application is using
and card games advertisements. This contrasts to Cluster 32 (“settings and
11 “battle games” 953 battle games (60%), game (11%),
design and art
wallpapers”), for instance, which also groups applications that
12 “navigation and 1,273 navigation and travel (64%), are about wallpapers, but do not mention advertisements in
travel” share, travel the description.
13 “money” 589 money (57%), puzzle and card
games, settings and utils One cluster per app. As it is now, each application belongs to one
14 “kids” 1,001 kids and bodies (62%), share,
puzzle and card games
cluster, which may incorporate multiple topics. This leads to a
15 “personalize” 304 personalize (71%), wallpapers good clustering of similar apps. A yet unexplored alternative
(15%), settings and utils is to allow an app to be a member of multiple clusters. This
16 “connection” 823 connection (63%), settings and might potentially provide better clustering results.
utils, share
17 “health” 669 health (63%), design and art,
share
Choice of clustering method. Before using K-means, we experi-
18 “weather” 282 weather and stars (61%), set- mented with formal concept analysis to detect related con-
tings and utils (11%), navigation cepts of topics and features [25] without successful results;
and travel the analysis was overwhelmed by the number of apps and
19 “sports” 580 sports (62%), share, popular me-
dia features. K-means has known limitations, and we believe that
20 “files and videos” 679 files and videos (63%), share, other clustering algorithms could improve the clustering.
settings and utils
21 “search and browse” 363 search and browse (64%), game, Low quality apps. App stores like the Google Play Store contain
puzzle and card games several free applications of questionable value. Restricting
22 “advertisements” 380 policies and ads (97%)
23 “design and art” 978 design and art (48%), share, our approach to a minimum number of downloads or user
game ratings may yield very different results. However, the goal of
24 “car games” 449 cars (51%), game, puzzle and our approach is to identify outliers before users see them, and
card games
25 “tv live” 500 tv (57%), share, navigation and consequently we should consider all apps.
travel
26 “adult photo” 828 photo and social (59%), share, 3. IDENTIFYING OUTLIERS BY APIS
settings and utils
27 “adult wallpapers” 543 wallpapers (51%), share, kids Now that we have clustered apps based on similarity of their
and bodies description topics, we can search for outliers regarding their ac-
28 “ad wallpapers” 180 policies and ads (46%), wallpa-
pers, settings and utils tual behavior. Section 3.1 shows how we extract API features from
29 “ringtones and 662 ringtones and sound (68%), Android binaries. Section 3.2 focuses on APIs controlled by permis-
sound” share, settings and utils sions. Section 3.3 describes how CHABADA detects API outliers.
30 “theme wallpapers” 593 wallpapers (90%), holidays and
religion, share
31 “personalize” 402 personalize (86%), share, set-
3.1 Extracting API Usage
tings and utils As discussed in the introduction, we use static API usage as a
32 “settings and 251 settings and utils (37%), wallpa-
wallpapers” pers (37%), personalize
proxy for behavior. Going for API usage is straightforward: while
Android bytecode can also be subject to advanced static analysis
The London Restaurants app ended up in Cluster 12, together such as information flow analysis and standard obfuscation tech-
with other applications that are mostly about navigation and travels. niques that easily thwart any static analysis, API usage has to be
These are the clusters of apps related by their descriptions in which explicitly declared; in Android binaries, as in most binaries on other
we now can search for outliers with respect to their behavior. platforms, static API usage is easy to extract. For each Android ap-
plication, we extracted the (binary) APK file with apktool6 , and with
2.7 Alternative Clustering Approaches a smali disassembler, we extracted all API invocations, including the
As with most scientific work, the approach presented in this number of call sites for each API.
6
paper only came to be through several detours, dead-ends, and https://code.google.com/p/android-apktool
anomalous Windows registry accesses [10]. In our context, the
Table 4: Sensitive APIs used in London Restaurants. The bold interesting feature of OC-SVM is that one can provide only samples
APIs make this app an outlier in its cluster.
of one class (say, of regular benign applications), and the classifier
android.net.ConnectivityManager.getActiveNetworkInfo() will be able to identify samples belonging to the same class, tagging
android.webkit.WebView()
the others as anomalies. OC-SVMs, therefore, are mainly used
java.net.HttpURLConnection.connect()
android.app.NotificationManager.notify() in those cases in which there exist many samples of one class of
java.net.URL.openConnection() elements (e.g. benign applications), and not many samples of other
android.telephony.TelephonyManager.getDeviceId() classes (e.g. malicious applications).
org.apache.http.impl.client.DefaultHttpClient() With the sensitive APIs as binary features, CHABADA trains an
org.apache.http.impl.client.DefaultHttpClient.execute() OC-SVM within each cluster with a subset of the applications in
android.location.LocationManager.getBestProvider()
order to model which APIs are commonly used by the applications
android.telephony.TelephonyManager.getLine1Number()
android.net.wifi.WifiManager.isWifiEnabled() in that cluster. The resulting cluster-specific models are then used to
android.accounts.AccountManager.getAccountsByType() identify the outlier applications, i.e., applications whose used APIs
android.net.wifi.WifiManager.getConnectionInfo() differ from the common use of the API within the same cluster.
android.location.LocationManager.getLastKnownLocation() To represent the distance of an element from the common behav-
android.location.LocationManager.isProviderEnabled() ior, we use the actual distance of the element from the hyperplane
android.location.LocationManager.requestLocationUpdates()
that the OC-SVM builds. The bigger this distance, the further an
android.net.NetworkInfo.isConnectedOrConnecting()
android.net.ConnectivityManager.getAllNetworkInfo() element is from the commonly observed behavior. Thus, by ranking
the elements (i.e. apps) by their distance to the OC-SVM hyperplane,
3.2 Sensitive APIs we can identify which ones have behaviors that differ the most from
what has been commonly observed.
Using all API calls as features would induce overfitting in later
As an example, consider again our London Restaurants app. After
stages. Therefore, we select a subset of APIs, namely the sensitive
training the OC-SVM on the apps in Cluster 12, it classifies London
APIs that are governed by an Android permission setting. These
Restaurant as an outlier. The reason is the APIs shown in bold in
APIs access sensitive information (such as the user’s picture library,
Table 4—indeed, accessing the device identifier, the user’s account
the camera, or the microphone) or perform sensitive tasks (altering
info, or his mobile phone number is uncommon for the apps in the
system settings, sending messages, etc.) When installing an app, the
“navigation and travel” cluster. In our evaluation in Section 4, we
user must explicitly permit usage of these APIs. For this purpose,
discuss trends that make an app an outlier.
each Android app includes a manifest file which lists the permissions
that the application requires for its execution. To obtain the set of 3.4 Alternative Approaches
sensitive APIs, we relied on the work of Felt et al., who identified and
used the mapping between permissions and Android methods [7]; To detect anomalies, we have identified several alternative set-
we considered a sensitive API to be used by the app if and only tings; some of which we already experimented with. Again, we
if it is declared in the binary and if its corresponding permission briefly list them here.
is requested in the manifest file. This allowed us to eliminate API
Class and package names as abstraction. Before going for sensi-
calls that are used within third party libraries, and not used by the
tive APIs with Felt et al.’s mapping, we considered abstracting
application directly.
the API method invocations by considering only the class
As an example for such sensitive APIs, consider Table 4. These
name or the package name instead of the whole method sig-
are the APIs used by the London Restaurants app that would be
nature. Although this helped reducing the number of features,
governed by a specific permission. Through these APIs, the app
it also caused a loss of relevant information. Invocations to
accesses the current network provider, the last known location (as
sensitive methods such as getLine1Number(), which returns
well as updates), and the internet via an HTTP connection. These
the user’s phone number, would be indistinguishable from
APIs also reveal that the app accesses the device identifier, the
relatively harmless methods such as getNetworkType(), which
user’s account info, as well as the mobile phone “line1” number.
returns the current data connection, since they are both de-
These latter APIs, shown in bold, would be governed by the “GET-
clared in class TelephonyManager.
ACCOUNTS” permission. As each permission governs several APIs,
using permissions alone would give us too few features to learn Number of call sites. For each API we considered a binary value
from. Instead, the sensitive APIs allow for a much more fine-grained (i.e. there exists at least one call site for that API in the app
characterization of the application behavior. or not). We could have considered the normalized number of
3.3 Identifying API Outliers with OC-SVMs call sites for each API as a feature. We expect similar results,
although we only experimented with binary values.
Now that we have all API features for all apps, the next step is
to identify outliers—that is, those applications whose API usage TF-IDF for APIs. Since some APIs are commonly used across all
would be abnormal within their respective topic cluster. To identify clusters, for instance APIs for Internet access, we could have
these outliers, we use One-Class Support Vector Machine learning considered only the most relevant and representative methods
(OC-SVM) [22], which is a machine learning technique to learn for the cluster instead of all the methods. Term frequency-
the features of one class of elements. The resulting SVM model inverse document frequency (TF-IDF) [13] could filter out
can later be used for anomaly/novelty detection within this class. some of the non discriminant features, thus providing even
(Note how this is in contrast to the more common usage of Support greater support for the OC-SVM algorithm. We have not tried
Vector Machines as classifiers, where each app additionally has this path yet, although we plan to investigate it in the near
to be labeled as belonging to a specific class—say, “benign” vs. future.
“malicious”—during training.)
OC-SVM has been successfully applied in various contexts that Insensitive APIs. To avoid overfitting, limiting the number of fea-
span from document classification [17] to automatic detection of tures is crucial. Hence, we focused on APIs that would be
“sensitive”, that is, impacted by Android permissions. Ex- We applied an “innocent unless proven guilty” principle: On
panding this set to other relevant APIs might yield even better average, each such assessment would take 2 minutes for benign
results. applications, and up to 5 minutes for dubious or malicious applica-
tions.
Permissions instead of APIs. Instead of APIs, we could have used Table 5 summarizes our assessment. Overall, we clearly identified
the list of permissions in the manifest file as features. How- 42 outliers as malware, which is 26% of our sample. Given that these
ever, studies showed that almost 30% of Android applications apps stem from an app store that is supposed to maintain quality,
request more permissions than they actually use [7, 2, 3, 23]. this is a worrying result.7 Even more worrying is that there are
We chose APIs since they provide a more fine-grained view clusters of apps in which the majority of outliers are all malicious.
into what apps do and do not. In Clusters 14 (“kids”), 16 (“connection”), 28 (“ad wallpapers”), 30
(“theme wallpapers”), the majority of outliers are malicious.
Avoiding APIs as predictors alone. When training a classifier on The good news in all of this is that the outliers as reported by
descriptions and APIs of known malware, the specific APIs CHABADA are worth looking into: 39% of the top 5 outliers require
being used in the malware set (typically, sending text mes- additional scrutiny by app store managers—or end users, who may
sages) will dominate the descriptions. By first clustering by just as well run CHABADA for their protection.
descriptions and then classifying, we obtain better results.
Top outliers, as produced by CHABADA, contain 26% malware;
additional 13% apps show dubious behavior.
4. EVALUATION
To evaluate the effectiveness of our technique, we investigated 4.1.2 What Makes an Outlier?
the following main research questions: During our investigation, we identified a number of repeating
trends that determined whether an app would be an outlier or not.
RQ1 Can our technique effectively identify anomalies (i.e., mis-
These trends can be characterized as follows:
matches between description and behavior) in Android ap-
plications? For this purpose, we manually inspected the top
outliers as produced by our technique and classified them with Spyware ad frameworks.
respect to covert behavior (Section 4.1). A large portion of the “malicious” apps we found are spyware.
We identified multiple applications in different clusters that get
RQ2 Can our technique be used to identify malicious Android sensitive information such as the user’s phone number, the device
applications? For this purpose, we included in our set of id, the user’s current location, and the list of emails used in different
applications a set of known malware, and we ran OC-SVM as accounts such as the Google and Facebook accounts. Apps do not
a classifier (Section 4.2). use this information for themselves, but they retrieve these data
only because they include third party libraries for advertisements,
4.1 Outlier Detection and advertisement companies such as apploving and airpush pay
Let us start with RQ1: Can our technique effectively identify application developers for users’ sensitive information. Some apps
anomalies (i.e., mismatches between description and behavior) in clearly state that they include such frameworks, and most of them
Android applications? For this purpose, we ran CHABADA on all ended up in the advertisements cluster, where such behavior is
32 clusters, as described in Section 3. Following K-fold validation, normal. Apps that do not mention the usage of such frameworks
we partitioned the entire set of 22,521 applications in 10 subsets, and their impact on privacy, however, are spyware.
and we used 9 subsets for training the model and 1 for testing. Just to mention a few examples, we found “mosquito killer” apps
We ran this 10 times, each time considering a different subset for such as Mosquito Repellent – No Ads, Anti-Mosquitoes, Mosquito
Repellent Plus, wedding apps such as Wedding Ideas Gallery , and
testing. Out of the whole list of outliers identified in each run, we
identified the top 5 outliers from the ranked list for each cluster. apps with collections of wallpapers such as Christmas girl live wall-
paper and Twilight live wallpaper . Since they all include the same
These 160 outliers would now have to be assessed whether they
would really exhibit suspicious behavior. ad frameworks, they all get the sensitive data mentioned above and
send it to ad servers.
4.1.1 Manual Assessment
In the end, only a human can interpret properly what is in an app Dubious behavior.
description. Therefore, for these 160 applications, we would man- In “dubious” applications whose description does not completely
ually examine their description, the list of APIs used, as manually justify the behavior. For instance, the UNO Free game application
inspect the code. We would classify each of the apps into one of accesses the user’s location without explanation, and WICKED , the
three categories: official application for a Broadway show, can record audio, but it
does not say for which purposes. Runtastic , a widely used training
Malicious – the app shows unadvertised (covert) behavior using application, can also record audio, but it does not mention it in the
sensitive APIs that acts against the interest of its users. description. Finally, Yahoo! Mail , which is the official application to
browse and compose emails with Yahoo, can send SMSs. From the
Dubious – the app shows unadvertised (covert) behavior using description it is not clear why the application should do that.
sensitive APIs, but would not necessarily act against the user’s
interests. Misclassified apps.
Some “benign” applications were misclassified on the basis of
Benign – all sensitive behavior is properly described. This includes their descriptions, and consequently were assigned to clusters popu-
apps which clearly list the sensitive data they collect, and also 7
Note, though, that between the time of download and the time of
applications placed in the wrong cluster due to inadequate writing, many of these applications had already been identified as
descriptions. malicious by users and since been removed.
Table 5: Manual assessment of the top 5 outliers, per cluster and total.
Behavior 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Total
Malicious 1 2 1 0 0 0 0 2 0 0 0 0 0 3 0 4 2 0 0 1 3 5 1 3 1 1 2 3 1 4 1 1 42 (26%)
Dubious 1 2 1 0 0 0 0 0 0 2 2 1 1 0 0 0 1 2 1 1 0 0 2 0 1 0 1 0 1 0 0 0 20 (13%)
Benign 3 1 3 5 5 5 5 3 5 3 3 4 4 2 5 1 2 3 4 3 2 0 2 2 3 4 2 2 3 1 4 4 98 (61%)
lated by applications that cover substantially different topics. For As a malware detector, we again used the OC-SVM model; but
instance, SlideIT free Keyboard , which is a popular tool to insert this time, we would use it as a classifier—that is, we used the SVM
text by sliding the finger along the keyboard letters, ended up in the model for a binary decision on whether an element would be part of
language cluster, since more than half of the description is about the the same distribution or not.
language support. Romanian Racing , which is a racing game, also
ended up in the same cluster, mainly because its brief description 4.2.1 Classification using topic clusters
mentions multi-language support. We ran the classification on the “benign” set of apps used in the
We also had some rare cases of applications that were associated previous study, and we included the 172 malware samples for which
to completely unexpected clusters. Hamster Life, a pet training game, we could find the corresponding description. We trained the OC-
ended up in the “religious wallpapers” cluster. SVM only on “benign” applications as before, and we excluded the
applications manually classified as “malicious” during the previous
Uncommon behavior. experiment (Section 4.1); We then trained within each cluster the
Some apps were tagged as outliers because their behavior, al- OC-SVM on the APIs of 90% of these apps, and then used the OC-
though benign and clearly described, is just very uncommon for the SVM as a classifier on a testing set composed of the known malicious
clusters these applications belong to. For instance, SoundCloud – apps in that cluster as well as the remaining 10% benign apps. What
Music & Audio was tagged as an anomaly mainly because it records we thus simulated is a situation in which the malware attack is
audio. This is expected behavior, since the application connects to a entirely novel—CHABADA must correctly identify the malware as
platform for finding and sharing new music, and it allows to record such without knowing previous malware patterns.
music that users produce. However, recording audio is not one of As for the previous experiment, we would do this 10 times, each
the common features of Cluster 1, and consequently SoundCloud time considering a different test set. The number of malicious
was tagged as an anomaly. Similarly, Llama – Location Profiles, applications would not be equally distributed across clusters, as
which allows to change the ringtones depending on the context and malicious applications are assigned to clusters depending on their
location, was tagged as an anomaly because it is not common for descriptions. In our evaluation setting, with our data set, the number
personalization applications to access the user’s location and calen- of malicious applications per cluster spans from 0 to 39.
dar. This application, though, uses this information to automatically The results of our classification are shown in Table 6. We report
switch between vibrate and ringtones when the user, for instance, is the average results of the 10 different runs. CHABADA correctly
at home, in office, or has a meeting. identifies 56% of the malicious apps as such, while only 16% of
“benign” apps are misclassified. If our approach would be used to
Benign outliers. guard against yet unknown malicious behavior, it would detect the
In some clusters, CHABADA identified the uncommon behavior as majority of malware as such.
the lack of malicious behavior. For instance, Cluster 13 (“money”)
contains several applications that use libraries for advertisements, Table 6: Checking APIs and descriptions within topic clusters
and thus access sensitive information. As a consequence, Mr. Will’s (our approach)
Predicted as malicious Predicted as benign
Stud Poker and Mr. Will’s Draw Poker were tagged as anomalies
because they do not access sensitive information, despite them being Malicious apps 96.5 (56%) 75.5 (44%)
poker games. Benign apps 353.9 (16%) 1,884.4 (84%)
CHABADA is behavior-agnostic: It cannot determine whether an Compared against standard malware detectors, these results of
application is good or bad, only if it is common or not. course leave room for improvement—but that is because existing
malware detectors compare against known malware, whose signa-
4.2 Malware Detection tures and behavior are already known. For instance, accessing all
Let us now turn to RQ2: Can our technique be used to identify user accounts on the device, as the London Restaurants app does,
malicious Android applications? For this purpose, we used the is a known pattern of malicious behavior. In practice, our approach
dataset of Zhou et al. [28] containing more than 1,200 known mali- would thus be used to complement such detectors, and be specifi-
cious apps for Android. In their raw form, these apps lack metadata cally targeted towards novel attacks which would be different from
such as title or description. As many of these apps are repackaged existing malware—but whose API usage is sufficiently abnormal to
versions of an original app, we were able to collect the appropri- be flagged as an outlier.
ate description from the Google Play Store. We used the title of
the application and the package identifier to search for the right In our sample, even without knowing existing malware patterns,
match in the Store. For 72 cases we could find exactly the same CHABADA detects the majority of malware as such.
package identifier, and for 116 applications we found applications
whose package identifier was very similar. We manually checked 4.2.2 Classification without clustering
that the match was correct. As with our original set of “benign” We further evaluate the effectiveness of our approach by compar-
apps (Section 2.1), we only kept those applications with an English ing it against alternatives. To show the impact of topic clustering,
description in the set, reducing it to 172 apps. we compare our classification results against a setting in which the
OC-SVM would be trained on sensitive APIs and NLP-preprocessed Researcher bias. Our evaluation of outliers is based on the classi-
words from the description alone—that is, all applications form one fication by a single person, who is a co-author of this paper.
big cluster. As Table 7 shows, the malware detection rate decreases This poses the risk of researcher bias, i.e. the desire of an
dramatically. This shows the benefits of our clustering approach. author to come up with best possible results. To counter this
threat, we are making our dataset publicly available (Sec-
Table 7: Checking APIs and descriptions in one single cluster tion 6).
Predicted as malicious Predicted as benign
Malicious apps 41 (24%) 131 (76%) Native code and obfuscation. We limit our analyses to the Dalvik
Benign apps 334.9 (15%) 1,903.1 (85%) bytecode. We do not analyze native code. Hence, an applica-
tion might rely on native code or use obfuscation to perform
covert behavior; but then, such features may again charac-
Classifying without clustering yields more false negatives. terize outliers; also, neither of these would change the set of
APIs that must be called.
4.2.3 Classification using given categories
Finally, one may ask why our specific approach for clustering Static analysis. As we rely on static API usage, we suffer from
based on description topics would be needed, as one could also limitations that are typical for static analysis. In particular,
easily use the given store categories. To this end, we clustered the we may miss behavior induced through reflection, i.e. code
applications based on their categories in the Google Play Store, generated at runtime. Although there exist techniques to
and repeated the experiment with the resulting 30 clusters. The statically analyze Java code using reflection, such techniques
results (Table 8) demonstrate the general benefits of clustering; are not directly applicable with Android apps [5, 8]; in the
however, topic clustering as in our approach, is still clearly superior. long run, dynamic analysis paired with test generation may
(Additionally, one may argue that a category is something that some be a better option.
librarian would assign, thus requiring more work and more data.)
Static API declarations. Since we extract API calls statically, we
may consider API calls that are never executed by the app.
Table 8: Checking APIs and descriptions within Google Play
Checking statically whether an API is reached is an instance of
Store categories
Predicted as malicious Predicted as benign the (undecidable) halting problem. As a workaround, we de-
cided to consider an API only if the corresponding permission
Malicious apps 81.6 (47%) 90.4 (53%)
is also declared in the manifest.
Benign apps 356.9 (16%) 1,881.1 (84%)
Sensitive APIs. Our detection of sensitive APIs (Section 3.2) relies
Clustering by description topics is superior to clustering by given on the mapping by Felt et al. [7], which now, two years later,
categories. may be partially outdated. Incorrect or missing entries in the
mapping would make CHABADA miss or misclassify relevant
4.3 Limitations and Threats to Validity behavior of the app.
Like any empirical study, our evaluation is subject to threats to
validity, many of which are induced by limitations of our approach. 5. RELATED WORK
The most important threats and limitations are listed below. While this work may be the first to generally check app descrip-
tions against app behavior, it builds on a history of previous work
External validity. CHABADA relies on establishing a relationship combining natural language processing and software development.
between description topics and program features from exist-
ing, assumed mostly benign, applications. We cannot claim 5.1 Mining App Descriptions
that said relationships could be applied in other app ecosys- Most related to our work is the WHYPER framework of Pandita
tems, or be transferable to these. We have documented our et al. [19]. Just like our approach, WHYPER attempts to automate
steps to allow easy replication of our approach. the risk assessment of Android apps, and applies natural language
processing to app descriptions. The aim of WHYPER is to tell
Free apps only. Our sample of 22,521 apps is based on free appli- whether the need for sensitive permissions (such as accesses to
cations only; i.e. applications that need to generate income contacts or calendar) is motivated in the application description.
through ads, purchases, or donations. Not considering paid In contrast to CHABADA, which fully automatically learns which
applications makes our dataset biased. However, the bias topics are associated with which APIs (and by extension, which
would shift “normality” more towards apps supported by ads permissions), WHYPER requires manual annotation of sentences
and other income methods, which are closer to undesired be- describing the need for permissions. Also, CHABADA goes beyond
havior exposed by malware. Our results thus are conservative permissions in two ways: first, it focuses on APIs, which provide
and would rather be improved through a greater fraction of a more detailed view, and it aims for general mismatches between
paid applications, which can be expected to be benign. expectations and implementations.
The very idea of app store mining was introduced one year earlier
App and malware bias. Our sample also only reflects the top 150 when Harman et al. mined the Blackberry app store [9]. They
downloads from each category in the Google Play Store. This focused on app meta-data to find patterns such as a correlation
sample is biased towards frequently used applications, and consumer rating and the rank of app downloads, but would not
towards lesser used categories; likewise, our selection of mal- download or analyze the apps themselves.
ware (Section 4) may or may not be representative for current Our characterization of “normal” behavior comes from mining
threats. Not knowing which actual apps are being used, and related applications; in general, we assume what most applications
how, by Android users, these samples may be biased. Again, in a well-maintained store do is also what most users would expect
we allow for easy reproduction of our approach. to be legitimate. In contrast, recent work by Lin et al. [14] suggests
crowdsourcing to infer what users expect from specific privacy what is actually being done with their sensitive data, nor would
settings; just like we found, Lin et al. also highlight that privacy they understand the consequences. Users understand, though, what
expectations vary between app categories. Such information from regular apps do; and CHABADA is set to point out and highlight
users can well complement what we infer from app descriptions. differences, which should be way easier to grasp.
Although our present approach came to be by exploring and
5.2 Behavior/Description Mismatches refining several alternatives, we are well aware that it is by no means
Our approach is also related to techniques that apply natural perfect or complete. Our future work will focus on the following
language processing to infer specifications from comments and doc- topics:
umentation. Lin Tan et al. [24] extract implicit program rules from
program corpora and use these rules to automatically detect inconsis- Detailed Behavior patterns. Static API usage is a rather broad ab-
tencies between comments and source code, indicating either bugs straction for characterizing what an app does and what not.
or bad comments. Rules apply to ordering and nesting of calls and More advanced methods could focus on the interaction of
resource accesses (“fa must not be called from fb ”). APIs, notably information flow between APIs.
Høst and Østvold [11] learn from program corpora which verbs
and phrases would normally be associated with specific method Dynamic Behavior. Exploring actual executions would give a far
calls, and used these to identify misnamed methods. more detailed view of what an app actually does—in particu-
Pandita et al. [20] identify sentences that describe code contracts lar, concrete values for all APIs accessing remote resources.
from more than 2,500 sentences of API documents; these contracts We are working on GUI test generators for Android apps that
can be checked either through tests or static analysis. aim for coverage of specific APIs or dynamic information
All these approaches compare program code against formal pro- flow.
gram documentation, whose semi-formal nature makes it easier to
Natural Language Processing. The state of the art in natural lan-
extract requirements. In contrast, CHABADA works on end-user
guage processing can retrieve much more than just topics.
documentation, which is decoupled from the program structure.
Looking at dependencies between words (such as conjunc-
5.3 Detecting Malicious Apps tions, subject-verb, verb-object) could retrieve much more
detailed patterns. Likewise, leveraging known ontologies
There is a large body of industrial products and research proto-
would help in identifying synonyms.
types that focus on identifying known malicious behavior. Most
influential for our work was the paper by Zhou and Jiang [28], who A Rosetta Stone for Topics and Behavior. By mining thousands
use the permissions requested by applications as a filter to identify of applications, we can associate natural language descrip-
potentially malicious applications; the actual detection uses static tions with specific program behavior. The resulting mapping
analysis to compare sequences of API calls against those of known between natural language and program fragments help in pro-
malware. In contrast to all these approaches, CHABADA identifies gram understanding as well as synthesis of programs and
outliers even without knowing what makes malicious behavior. tests.
The TAINTDROID system [6] tracks dynamic information flow
within Android apps and thus can detect usages of sensitive infor- To allow easy reproduction and verification of our work, we
mation. Using such dynamic flow information would yield far more have packaged all data used within this work for download. In
precise behavior insights than static API usage; similarly, profilers particular, we have prepared a 50 MB dataset with the exact data
such as ProfileDroid [26] would provide better information; how- that goes into CHABADA, including app names, descriptions, other
ever, both TAINTDROID and ProfileDroid require a representative set metadata, permissions, and API usage. All of this can be found on
of executions. Integrating such techniques in CHABADA, combined the CHABADA web site:
with automated test generation [12, 27, 15, 1], would allow to learn
normal and abnormal patterns of information flow; this is part of http://www.st.cs.uni-saarland.de/chabada/
our future work (Section 6).
7. ACKNOWLEDGMENTS
6. CONCLUSION AND CONSEQUENCES We thank Vitalii Avdiienko, Juan Pablo Galeotti, Clemens Ham-
By clustering apps by description topics, and identifying outliers macher, Konrad Jamrozik, and Sascha Just for their helpful feed-
by API usage within each cluster, our CHABADA approach effec- back on earlier revisions of this paper. Special thanks go to Andrea
tively identifies applications whose behavior would be unexpected Fischer, Joerg Schad, Stefan Richter, and Stefan Schuh for their
given their description. We have identified several examples of valuable assistance during the project.
false and misleading advertising; and as a side effect, obtained a This work was funded by the European Research Council (ERC)
novel effective detector for yet unknown malware. Just like min- Advanced Grant “SPECMATE – Specification Mining and Testing”.
ing software archives has opened new opportunities for empirical
software engineering, we see that mining apps and their descrip-
tions opens several new opportunities for automated checking of 8. REFERENCES
natural-language requirements. [1] D. Amalfitano, A. R. Fasolino, P. Tramontana, S. De Carmine,
During our work, we have gained a number of insights into the and A. M. Memon. Using GUI ripping for automated testing
Android app ecosystem that call for action. First and foremost, of Android applications. In IEEE/ACM International
application vendors must be much more explicit about what their Conference on Automated Software Engineering (ASE), pages
apps do to earn their income. App store suppliers such as Google 258–261, New York, NY, USA, 2012. ACM.
should introduce better standards to avoid deceiving or incomplete [2] K. W. Y. Au, Y. F. Zhou, Z. Huang, and D. Lie. PScout:
advertising. Second, the way Android asks its users for permissions analyzing the Android permission specification. In ACM
is broken. Regular users will not understand what “allow access to Conference on Computer and Communications Security
the device identifier” means, nor would they have means to check (CCS), pages 217–228, New York, NY, USA, 2012. ACM.
[3] A. Bartel, J. Klein, M. Monperrus, and Y. Le Traon. Engineering (ESEC/FSE), pages 224–234, New York, NY,
Automatically securing permission-based software by USA, 2013. ACM.
reducing the attack surface: An application to Android. In [16] J. B. MacQueen. Some methods for classification and analysis
IEEE/ACM International Conference on Automated Software of multivariate observations. In L. M. L. Cam and J. Neyman,
Engineering (ASE), pages 274–277, 2012. editors, Berkeley Symposium on Mathematical Statistics and
[4] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Probability, volume 1, pages 281–297. University of
allocation. Journal of Machine Learning Research, California Press, 1967.
3:993–1022, 2003. [17] L. M. Manevitz and M. Yousef. One-class SVMs for
[5] E. Bodden, A. Sewe, J. Sinschek, H. Oueslati, and M. Mezini. document classification. Journal of Machine Learning
Taming reflection: Aiding static analysis in the presence of Research, 2:139–154, 2002.
reflection and custom class loaders. In ACM/IEEE [18] A. K. McCallum. Mallet: A machine learning for language
International Conference on Software Engineering (ICSE), toolkit. http://mallet.cs.umass.edu, 2002.
pages 241–250, New York, NY, USA, 2011. ACM. [19] R. Pandita, X. Xiao, W. Yang, W. Enck, and T. Xie.
[6] W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, WHYPER: Towards automating risk assessment of mobile
P. McDaniel, and A. N. Sheth. TaintDroid: an applications. In USENIX Security Symposium, pages 527–542,
information-flow tracking system for realtime privacy 2013.
monitoring on smartphones. In USENIX conference on [20] R. Pandita, X. Xiao, H. Zhong, T. Xie, S. Oney, and
Operating Systems Design and Implementation (OSDI), pages A. Paradkar. Inferring method specifications from natural
1–6, Berkeley, CA, USA, 2010. USENIX Association. language API descriptions. In ACM/IEEE International
[7] A. P. Felt, E. Chin, S. Hanna, D. Song, and D. Wagner. Conference on Software Engineering (ICSE), 2012.
Android permissions demystified. In ACM Conference on [21] P. Rousseeuw. Silhouettes: a graphical aid to the interpretation
Computer and Communications Security (CCS), pages and validation of cluster analysis. Journal of Computational
627–638, New York, NY, USA, 2011. ACM. and Applied Mathematics, 20(1):53–65, 1987.
[8] C. Fritz, S. Arzt, S. Rasthofer, E. Bodden, A. Bartel, J. Klein, [22] B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and
Y. le Traon, D. Octeau, and P. McDaniel. Highly precise taint R. C. Williamson. Estimating the support of a
analysis for Android applications. Technical Report high-dimensional distribution. Neural Computation,
TUD-CS-2013-0113, EC SPRIDE, 2013. 13(7):1443–1471, 2001.
[9] M. Harman, Y. Jia, and Y. Zhang. App store mining and [23] R. Stevens, J. Ganz, P. Devanbu, H. Chen, and V. Filkov.
analysis: MSR for app stores. In IEEE Working Conference on Asking for (and about) permissions used by Android apps. In
Mining Software Repositories (MSR), pages 108–111, 2012. IEEE Working Conference on Mining Software Repositories
[10] K. A. Heller, K. M. Svore, A. D. Keromytis, and S. J. Stolfo. (MSR), pages 31–40, San Francisco, CA, 2013.
One class support vector machines for detecting anomalous [24] L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /* iComment: Bugs
windows registry accesses. In ICDM Workshop on Data or bad comments? */. In ACM SIGOPS Symposium on
Mining for Computer Security (DMSEC), 2003. Operating Systems Principles (SOSP), pages 145–158, 2007.
[11] E. W. Høst and B. M. Østvold. Debugging method names. In [25] A. Wasylkowski, A. Zeller, and C. Lindig. Detecting object
European Conference on Object-Oriented Programming usage anomalies. In European Software Engineering
(ECOOP), pages 294–317. Springer, 2009. Conference held jointly with ACM SIGSOFT International
[12] C. Hu and I. Neamtiu. Automating GUI testing for Android Symposium on Foundations of Software Engineering
applications. In International Workshop on Automation of (ESEC/FSE), pages 35–44, New York, NY, 2007. ACM.
Software Test (AST), pages 77–83, New York, NY, USA, 2011. [26] X. Wei, L. Gomez, I. Neamtiu, and M. Faloutsos.
ACM. ProfileDroid: multi-layer profiling of Android applications. In
[13] K. S. Jones. A statistical interpretation of term specificity and ACM Annual International Conference on Mobile Computing
its application in retrieval. Journal of Documentation, and networking (MobiCom), pages 137–148, New York, NY,
28(1):11–21, 1972. USA, 2012. ACM.
[14] J. Lin, S. Amini, J. I. Hong, N. Sadeh, J. Lindqvist, and [27] W. Yang, M. R. Prasad, and T. Xie. A grey-box approach for
J. Zhang. Expectation and purpose: understanding users’ automated GUI-model generation of mobile applications. In
mental models of mobile app privacy through crowdsourcing. International Conference on Fundamental Approaches to
In ACM Conference on Ubiquitous Computing (UbiComp), Software Engineering (FASE), pages 250–265, Berlin,
pages 501–510, New York, NY, USA, 2012. ACM. Heidelberg, 2013. Springer-Verlag.
[15] A. Machiry, R. Tahiliani, and M. Naik. Dynodroid: an input [28] Y. Zhou and X. Jiang. Dissecting Android malware:
generation system for Android apps. In European Software Characterization and evolution. In IEEE Symposium on
Engineering Conference held jointly with ACM SIGSOFT Security and Privacy (SP), pages 95–109, Washington, DC,
International Symposium on Foundations of Software USA, 2012. IEEE Computer Society.

FLAG - Gorla CHABADA ICSE14

Uploaded by

Copyright:

Available Formats

FLAG - Gorla CHABADA ICSE14

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FLAG - Gorla CHABADA ICSE14

Uploaded by

Copyright:

Available Formats

Checking App Behavior Against App Descriptions

Alessandra Gorla · Ilaria Tavecchia∗ · Florian Gross · Andreas Zeller

Categories and Subject Descriptors

General Terms 4. Used APIs 5. Outliers

Figure 2: Words in app descriptions of the “navigation and

Figure 5: APIs for the “personalize” cluster. In con-

server). Internet access is also used by “personalize” apps

5. Using unsupervised One-Class SVM anomaly classification,

You might also like