Android Testing
Android Testing
ABSTRACT mobile applications has also grown, and with it the amount of re-
arXiv:1503.07217v2 [cs.SE] 31 Mar 2015
1
We had to exclude some techniques and tools from our study
because either they were not available or we were not able to install
them despite seeking their authors’ help.
test inputs for each of the apps considered and measured the cov-
erage achieved by the different tools on each app. Although code Android Apps
coverage is a well understood and commonly used measure, it is Pre-installed
normally a gross approximation of behavior. Ultimately, test input Apps
generation tools should generate inputs that are effective at revealing Android API
faults in the code under test. For this reason, in our study we also Android Framework
measured how many of the inputs generated by a tool resulted in
one or more failures (identified as uncaught exceptions) in the apps
Android Runtime
considered. We also performed additional manual and automated
checks to make sure that the thrown exceptions represented actual Dalvik VM ART Zygote
failures. Because of the fragmentation of the Android ecosystem, native
another important characteristic for the tools considered is their JNI
ability to work on different hardware and software configurations.
We therefore considered and assessed also this aspect of the tools, Linux Kernel & Libraries
by running them on different versions of the Android environment.
Finally, we also evaluated the ease of use of the tools, by assessing
how difficult it was to install and run them and the amount of manual
work involved in their use. Although this is a very practical aspect, Figure 1: The Android architecture
and one that normally receives only limited attention in research
prototypes, (reasonable) ease of use can enable replication studies system. So far, there have been 20 different framework releases and
and allow other researchers to build on the existing technique and consequent changes in the API. Framework versioning is the first
tool. element that causes the fragmentation problem in Android. Since
Our results show that, although the existing techniques and tools it takes several months for a new framework release to become
we studied are effective, they also have weaknesses and limita- predominant on Android devices, most of the devices in the field
tions, and there is room for improvement. In our analysis of the run older versions of the framework. As a consequence, Android
results, we discuss such limitations and identify future research developers should make an effort to make their apps compatible with
directions that, if suitably investigated, could lead to more effec- older versions of the framework, and it is therefore particularly desir-
tive and efficient testing tools for Android. To allow other re- able to test apps on different hardware and software configurations
searchers to replicate our studies and build on our work, we made before releasing them.
all of our experimental infrastructure and data publicly available at In the Android Runtime layer, the Zygote daemon manages the ap-
http://www.cc.gatech.edu/~orso/software/androtest.
plications’ execution by creating a separate Dalvik Virtual Machine
The main contributions of this paper are: for each running app. Dalvik Virtual Machines are register-based
• A survey of the main existing test input generation techniques for VMs that can interpret the Dalvik bytecode. The most recent version
apps that run on the Android operating system. of Android includes radical changes in the runtime layer, as it intro-
• An extensive comparative study of such techniques and tools duces ART (i.e., Android Run-Time), a new runtime environment
performed on over 60 real-world Android apps. that dramatically improves app performance and will eventually
• An analysis of the results that discusses strengths and weaknesses replace the Dalvik VM.
of the different techniques considered and highlights possible At the bottom of the Android software stack stands a customized
future directions in the area. Linux kernel, which is responsible of the main functionality of the
• A set of artifacts, consisting of experimental infrastructure as system. A set of native code libraries, such as WebKit, libc and SSL,
well as data, that are freely available and allow for replicating our communicate directly with the kernel and provide a basic hardware
work and building on it. abstraction to the runtime layer.
The remainder of the paper is structured as follows. Section 2
provides background information on the Android environment and
apps. Section 3 discusses the test input generation techniques and
Android applications
tools that we consider in our study. Section 4 describes our study Android applications declare their main components in the Android-
setup and presents our results. Section 5 analyzes and discusses our Manifest.xml file. Components can be of four different types:
findings. Finally, Section 6 concludes the paper. • Activities are the components in charge of an app’s user interface.
Each activity is a window containing various UI elements, such
as buttons and text areas. Developers can control the behavior
2. THE ANDROID PLATFORM of each activity by implementing appropriate callbacks for each
Android applications are mainly written in Java, although some life-cycle phase (i.e., created, paused, resumed, and destroyed).
high-performance demanding applications delegate critical parts Activities react to user input events, such as clicks, and conse-
of the implementation to native code written in C or C++. During quently are the primary target of testing tools for Android.
the build process, Java source code gets first compiled into Java • Services are application components that can perform long-
bytecode, then translated into Dalvik bytecode, and finally stored running operations in the background. Unlike activities, they do
into a machine executable file in .dex format. Apps are distributed not provide a user interface, and consequently they are usually
in the form of apk files, which are compressed folders containing not a direct target of Android testing tools, although they might
dex files, native code (whenever present), and other application be indirectly tested through some activities.
resources. • Broadcast Receivers and Intents allow inter-process com-
Android applications run on top of a stack of three other main munication. Applications can register broadcast receivers to be
software layers, as represented in Figure 1. The Android framework, notified, by means of intents, about specific system events. Apps
which lays below the Android apps, provides an API to access fa- can thus, for instance, react whenever a new SMS is received, a
cilities without dealing with the low level details of the operating new connection is available, or a new call is being made. Broad-
cast receivers can either be declared in the manifest file or at possible, which typically extract high-level properties of the app,
runtime, in the application code. In order to properly explore the such as the list of activities and the list of UI elements contained
behavior of an app, testing tools should be aware of what are the in each activity, in order to generate events that will likely expose
relevant broadcast receivers, so that they could trigger the right unexplored behavior.
intents. Table 1 provides an overview of the test input generation tools for
• Content Providers act as a structured interface to shared data Android that have been presented in different venues. To the best
stores, such as contacts and calendar databases. Applications may of our knowledge this list is complete. The table reports all these
have their own content providers and may make them available to tools, and classifies them according to the metrics reported above.
other apps. Like all software, the behavior of an app may highly Moreover, the tables reports other relevant features of each tool,
depend on the state of such content providers (e.g., on whether such as whether it is publicly available or rather it is only described
the list of contacts is empty or it contains duplicates). As a in a paper or its distribution is under restricted policies of a company,
consequence, testing tools should “mock” content providers in an whether the tool requires the source code of the application under
attempt to make tests deterministic and achieve higher coverage test, and whether it requires instrumentation, either at the applica-
of an app’s behavior. tion level or of the underlying Android framework. The following
Despite being GUI-based and mainly written in Java, Android sections report more details of each of these tools.
apps significantly differ from Java standalone GUI applications
and manifest somehow different kinds of bugs [13, 14]. Existing 3.1 Random exploration strategy
test input generation tools for Java [11, 20, 21] cannot therefore be The first class of test input generation tools we consider employs
straightforwardly used to test Android apps, and custom tools must a random strategy to generate inputs for Android apps. In the
be adopted instead. For this reason, a great deal of research has been most simple form, the random strategy generates only UI events.
performed in this area, and several test input generation techniques Randomly generating system events would be highly inefficient, as
and tools for Android applications have been proposed. The next there are too many such events, and applications usually react to
section provides an overview of the main existing tools in this arena. only few of them, and only under specific conditions.
Many tools that fall in this category aim to test inter-application
3. EXISTING ANDROID TESTING TOOLS: communications by randomly generating values for Intents. In-
tent fuzzers, despite being test input generators, have quite a differ-
AN OVERVIEW ent purpose. By randomly generating inputs, they mainly generate
As we mentioned in the Introduction, there are several existing invalid ones, thus testing the robustness of an app. These tools
test input generation tools for Android. The primary goal of these are also quite effective at revealing security vulnerabilities, such as
tools is to detect existing faults in Android apps, and app developers denial-of-service vulnerabilities. We now provide a brief description
are thus typically the main stakeholders, as they can automatically of the tools that fall in this category and their salient characteristics.
test their application and fix discovered issues before deploying
it. The dynamic traces generated by these tools, however, can Monkey [23] is the most frequently used tool to test Android apps,
also be the starting point of more specific analyses, which can be since it is part of the Android developers toolkit and thus does not
of primary interest of Android market maintainers and final users. require any additional installation effort. Monkey implements the
In fact, Android apps heavily use features such as native code, most basic random strategy, as it considers the app under test a
reflection and code obfuscation that hit the limitations of almost black-box and can only generate UI events. Users have to specify
every static analysis tool [4, 10]. Thus, to explore the behavior of the number of events they want Monkey to generate. Once this
Android apps and overcome such limitations it is common practice upper bound has been reached, Monkey stops.
to resort to dynamic analysis and use test input generation tools to Dynodroid [17] is also based on random exploration, but it has
explore enough behaviors for the analysis. Google, for instance, is several features that make its exploration more efficient compared
known to run every app on its cloud infrastructure to simulate how to Monkey. First of all, it can generate system events, and it does
it might work on user devices and look for malicious behavior [16]. so by checking which ones are relevant for the application. Dyn-
Only apps that pass this test are listed in the Google Play market. odroid gets this information by monitoring when an application
Finally, users can analyze apps focusing on specific aspects, such as registers a listener within the Android framework. For this reason
observing possible leaks of sensitive information [9] and profiling it requires to instrument the framework.
battery, memory, or networking usage [33]. The random event generation strategy of Dynodroid is smarter
Test input generation tools can either analyze the app in isolation than the one that Monkey implements. It can either select the
or focus on the interaction of the app with other apps or the under- events that have been least frequently selected (Frequency strat-
lying framework. Whatever is the final usage of these tools, the egy) and can keep into account the context (BiasedRandom
challenge is to generate relevant inputs to exercise as much behavior strategy), that is, events that are relevant in more contexts will be
of application under test as possible. selected more often. For our study we used the BiasedRandom
As Android apps are event-driven, inputs are normally in the strategy, which is the default one.
form of events, which can either mimic user interactions (UI events),
such as clicks, scrolls, and text inputs, or system events, such as the An additional improvement of Dynodroid is the ability to let
notification of a newly received SMS. Testing tools can generate users manually provide inputs (e.g., for authentication) when the
such inputs following different strategies. They can generate them exploration is stalling.
randomly or follow a systematic exploration strategy. In this latter Null intent fuzzer [24] is an open-source basic intent fuzzer that
case, exploration can either be guided by a model of the app, which aims to reveal crashes of activities that do not properly check input
constructed statically or dynamically, or exploit techniques that aim intents. While quite effective at revealing this type of problems,
to achieve as much code coverage as possible. Along a different it is fairly specialized and not effective at detecting other issues.
dimension, testing tools can generate events by considering Android Intent Fuzzer [28] mainly tests how an app can interact with other
apps as either a black box or a white box. In this latter case, they apps installed on the same device. It includes a static analysis
would consider the code structure. Grey box approaches are also component, which is built on top of FlowDroid [4], for identifying
Table 1: Overview of existing test input generation tools for Android.
Exploration Needs source Testing
Name Available Instrumentation Events
strategy code strategy
Platform App UI System
Monkey [23] X × × X × Random × Black-box
Dynodroid [17] X X × X X Random × Black-box
DroidFuzzer [35] X × × × × Random × Black-box
IntentFuzzer [28] X × × × × Random × White-box
Null IntentFuzzer [24] X × × × × Random × Black-box
GUIRipper [1] Xa × X X × Model-based × Black-box
ORBIT [34] × × × X × Model-based X Grey-box
A3 E -Depth-first [5] X × X X × Model-based × Black-box
SwiftHand [7] X × X X × Model-based × Black-box
PUMA [12] X × X X × Model-based × Black-box
A3 E -Targeted [5] × × X X × Systematic × Grey-box
EvoDroid [18] × × X X × Systematic × White-box
ACTEve [3] X X X X X Systematic X White-box
JPF-Android [31] X × × X × Systematic X White-box
a) Not open source.
the expected structure of intents, so that the fuzzer can generate on system events. GUIRipper has two characteristics that make it
them accordingly. This tool has shown to be effective at revealing unique among model-based tools. First, it allows for exploring an
security issues. Maji et al. worked on a similar intent fuzzer [19], application from different starting states. (This, however, has to
but their tool has more limitations than Intent Fuzzer. be done manually.) Moreover, it allows testers to provide a set of
DroidFuzzer [35] is different from other tools that mainly generate input values that can be used during the exploration. GUIRipper
UI events or intents. It solely generates inputs for activities that is publicly available, but unfortunately it is not open source, and
accept MIME data types such as AVI, MP3, and HTML files. The its binary version is compiled for Windows machines.
authors of the paper show how this tool could make some video ORBIT [34] implements the same exploration strategy of GUIRip-
player apps crash. DroidFuzzer is supposed to be implemented per, but statically analyzes the application’s source code to under-
as an Android application. However, it is not available, and the stand which UI events are relevant for a specific activity. It is thus
authors did not reply to our request for the tool. supposed to be more efficient than GUIRipper, as it should gener-
ate only relevant inputs. However, the tool is unfortunately not
In general, the advantage of random test input generators is that available, as it is propriety of Fujitsu Labs. It is unclear whether
they can efficiently generate events, and this makes them particularly ORBIT requires any instrumentation of the platform or of the
suitable for stress testing. However, a random strategy would hardly application to run, but we believe that this is not the case.
be able to generate highly specific inputs. Moreover, these tools are A3 E -Depth-first [5] is an open source tool that implements two
not aware of how much behavior of the application has been already totally distinct and complementary strategies. The first one imple-
covered, and thus are likely to generate redundant events that do not ments a depth first search on the dynamic model of the application.
help the exploration. Finally, they do not have a stopping criterion In essence, it implements the exact same exploration strategy of
that indicates the success of the exploration, but rather resort to a the previous tools. Its model representation is more abstract than
manually specified timeout. the one used by other tools, as it represents each activity as a
3.2 Model-based exploration strategy single state, without considering different states of the elements
of the activity. This abstraction does not allow the tool to dis-
Following the example of several Web crawlers [8, 22, 27] and tinguish some states that are different, and may lead to missing
GUI testing tools for stand alone applications [11, 20, 21], some some behavior that would be easy to exercise if a more accurate
Android testing tools build and use a GUI model of the application model where to be used.
to generate events and systematically explore the behavior of the
application. These models are usually finite state machines that have SwiftHand [7] has, as its main goal, that to maximize the cover-
activities as states and events as transitions. Some tools build more age of the application under test. Similarly to the previously
precise models by differentiating the state of activity elements when mentioned tools, SwiftHand uses a dynamic finite state machine
representing states (e.g., the same activity with a button enabled and model of the app, and one of its main characteristics is that it opti-
disabled would be represented as two separate states). Most tools mizes the exploration strategy to minimize the restarts of the app
build such model dynamically and terminate when all the events while crawling. SwiftHand generates only touching and scrolling
that can be triggered from all the discovered states lead to already UI events and cannot generate system events.
explored states. PUMA [12] is a novel tool that includes a generic UI automator that
provides the random exploration also implemented by Monkey.
GUIRipper [1] , which later became MobiGUITAR [2], dynami- The novelty of this tool is not in the exploration strategy, but
cally builds a model of the app under test by crawling it from a rather in its design. PUMA is a framework that can be easily
starting state. When visiting a new state, it keeps a list of events extended to implement any dynamic analysis on Android apps
that can be generated on the current state of the activity, and using the basic monkey exploration strategy. Moreover, it allows
systematically triggers them. GUIRipper implements a DFS strat- for easily implementing different exploration strategies, as the
egy, and it restarts the exploration from the starting state when it framework provides a finite state machine representation of the
cannot detect new states during the exploration. It generates only app. It also allows for easily redefining the state representation
UI events, thus it cannot expose behavior of the app that depend and the logic to generate events. PUMA is publicly available and
open source. It is, however, only compatible with the most recent applications on a common virtualized infrastructure. Such infras-
releases of the Android framework. tructure aims to ease the comparison of testing tools for Android,
and we make it available such that researchers and practitioners can
Using a model of the application should intuitively lead to more easily evaluate new Android testing tools against existing ones in
effective results in terms of code coverage. Using a model would, in the future. Our evaluation does not include all the tools listed in
fact, limit the number of redundant inputs that a random approach Table 1. We explain the reasons why we had to exclude some of the
generates. The main limitation of these tools stands in the state tools in Section 4.1. The following sections provide more details on
representation they use, as they all represent new states only when the virtualized infrastructure (Section 4.3) and on the set of mobile
some event triggers changes in the GUI. Some events, however, may apps that we considered as benchmarks for our study (Section 4.2).
change the internal state of the app without affecting the GUI. In Our study evaluated these tools according to four main criteria:
such situations these algorithm would miss the change, consider the
event irrelevant, and continue the exploration in a different direction. C1: Effectiveness of the exploration strategy. The inputs that -
A common scenario in which this problem occurs is in the presence these tools generate should ideally cover as much behavior as
of services, as services do not have any user interface (see Section 2). possible of the app under test. Since code coverage is a common
proxy to estimate behavior coverage, we evaluate the statement
3.3 Systematic exploration strategy coverage that each tool achieves on each benchmark. We also
Some application behavior can only be revealed upon providing report a comparison study, in terms of code coverage, among
specific inputs. This is the reason why some Android testing tools different tools.
use more sophisticated techniques such as symbolic execution and C2: Fault detection ability. The primary goal of test input genera-
evolutionary algorithms to guide the exploration towards previously tors is to expose existing faults. We therefore evaluate, for each
uncovered code. tool, how many failures it triggers for each app, and we then com-
pare the effectiveness of different tools in terms of the failures
A3 E -Targeted [5] provides an alternative exploration strategy that
they trigger.
complements the one described in Section 3.2. The targeted
C3: Ease of use. Usability should be a primary concern for all tool
approach relies on a component that, by means of taint analysis,
developers, even when tools are just research prototypes. We
can build the Static Activity Transition Graph of the app. Such
evaluate the usability of each tool by considering how much
graph is an alternative to the dynamic finite state machine model
effort it took us to install and use it.
of the depth-first search exploration and allows the tool to cover
activities more efficiently by generating intents. While the tool is C4: Android Framework Compatibility. One of the major prob-
available on a public repository, this strategy does not seems to lems in the Android ecosystem is fragmentation. Test input gen-
be. eration tools for Android should therefore ideally run on multiple
EvoDroid [18] relies on evolutionary algorithms to generate rele- versions of the Android framework, so that developers could
vant inputs. In the evolutionary algorithms framework, EvoDroid assess how their app behaves in different environments.
represents individuals as sequences of test inputs and implements Each of these research questions is addressed separately in Sec-
the fitness function so as to maximize coverage. EvoDroid is no tion 4.4 (C1), Section 4.5 (C2), Section 4.5 (C2), and Section 4.7
longer publicly available. In Section 4.1, we provide more details (C4).
about this.
ACTEve [3] is a concolic-testing tool that symbolically tracks 4.1 Selected Tools
events from the point in the framework where they are generated Our evaluation could not consider all the tools that we list in
up to the point where they are handled in the app. For this Table 1. First, we decided to ignore intent fuzzers, as these tools do
reasons, ACTEve needs to instrument both the framework and not aim to test the whole behavior of an app, but rather to expose
the app under test. ACTEve handles both system and UI events. possible security vulnerabilities. Therefore, it would have been hard
JPF-Android [32] extends Java PathFinder (JPF), a popular model to compare the results provided by these tools with other test input
checking tool for Java, to support Android apps. This would generators. We initially considered the possibility of evaluating
allow to verify apps against specific properties. Liu et al. were DroidFuzzer, IntentFuzzer, and Null Intent Fuzzer separately. How-
the first who investigated the possibility of extending JPF to work ever, Null Intent Fuzzer requires to manually select each activity
with Android apps [15]. What they present, however, is mainly that the fuzzer should target, and it is therefore not feasible to eval-
a feasibility study. They themselves admit that developing the uate it on large scale experiments. Moreover, we had to exclude
necessary components would require a lot of additional engineer- DroidFuzzer because the tool is not publicly available, and we were
ing efforts. Van Der Merwe et al. went beyond that and properly not successful in reaching the authors by email.
implemented and open sourced the necessary extensions to use Among the rest of the tools, we had to exclude also EvoDroid
JPF with Android. JPF-Android aim to explore all paths in an and ORBIT. EvoDroid used to be publicly available on its project
Android app and can identify deadlocks and runtime exceptions. website, and we tried to install and run it. We also contacted the
The current limitations of the tool, however, seriously limit its authors after we ran into some problems with missing dependencies,
practical applicability. but despite their willingness to help, we never managed to get all the
files we needed and the tool to work. Obtaining the source code and
Implementing a systematic strategy leads to clear benefits in
fixing the issues ourselves was unfortunately not an option, due to
exploring behavior that would be hard to reach with random tech-
the contractual agreements with their funding agencies. Moreover,
niques. Compared to random techniques, however, these tools are
at the time of writing the tool is no longer available, even as a closed-
considerably less scalable.
source package. The problem with ORBIT is that it is a proprietary
software of Fujitsu Labs, and therefore the authors could not share
4. EMPIRICAL STUDY the tool with us.
To evaluate the test input generation tools that we considered We also excluded JPF-Android, even if the tool is publicly avail-
(see Section 3), we deployed them along with a group of Android able. This tool in fact, expects users to manually specify the input
sequence that JPF should consider for verification, and this would
have been time consuming to do for all the benchmarks we consid- Table 2: List of subject apps. (X indicates that the app was used
ered. originally in the tool’s evaluation and ⊗ indicates that the app
Finally, we could not evaluate the A3 E -targeted, since this strat- crashed while being exercised by the tool in our experiments)
DynoDroid
GuiRipper
SwiftHand
egy was not available in the public distribution of the tool at the
ACTEve
Monkey
time of our experiments.
PUMA
A3 E
Subject
4.2 Mobile App Benchmarks Name Ver. Category
To evaluate the selected tools, we needed a common set of bench- Amazed 2.0.2 Casual ⊗ X ⊗
AnyCut 0.5 Productiv. X
marks. Obtaining large sets of Android apps is not an issue, as Divide&Conquer 1.4 Casual X ⊗
binaries can be directly downloaded from app markets, and there LolcatBuilder 2 Entertain. X
are many platforms such as F-Droid 2 that distribute open-source MunchLife 1.4.2 Entertain. X
PasswordMakerPro 1.1.7 Tools X
Android applications. However, since many of these tools are not Photostream 1.1 Media X
maintained, and therefore could easily crash on apps that utilize QuickSettings 1.9.9.3 Tools X
recent features. Thus, for our experiments we combined all the open RandomMusicPlay 1 Music X X
SpriteText - Sample X
source mobile application benchmarks that were considered in the SyncMyPix 0.15 Media X
evaluation of at least one tool. We retrieved the same version of the Triangle - Sample X
benchmarks, as they were reported in each paper. A2DP Volume 2.8.11 Transport X
aLogCat 2.6.1 Tools X
PUMA and A3 E were originally evaluated on a set of apps down- AardDictionary 1.4.1 Reference X
loaded from the Google Play market. We excluded these apps from BaterryDog 0.1.1 Tools X
our dataset because some tools need the application source code, FTP Server 2.2 Tools X
Bites 1.3 Lifestyle X
and therefore it would have been impossible to run them on these Battery Circle 1.81 Tools X
benchmarks. Addi 1.98 Tools X
We collected 68 applications in total. 52 of them come from Manpages 1.7 Tools X
the Dynodroid paper [17], 3 from GUIRipper [1], 5 from ACTEve Alarm Clock 1.51 Productiv. X
Auto Answer 1.5 Tools X
[3], and 10 from SwiftHand [7]. Table 2 reports the whole list of HNDroid 0.2.1 News X
apps that we collected, together with the corresponding version Multi SMS 2.3 Comm. X
and category. For each app we report whether it was part of the World Clock 0.6 Tools X ⊗
Nectroid 1.2.4 Media X
evaluation benchmarks of a specific tool, and we also report whether aCal 1.6 Productiv. X
during our evaluation the tools failed in analyzing it. Jamendo 1.0.6 Music X
AndroidomaticK. 1.0 Comm. ⊗ X
Yahtzee 1 Casual X
4.3 Experimental Setup aagtl 1.0.31 Tools X
We ran our experiments on Ubuntu 14.04 virtual machines run- Mirrored 0.2.3 News X
Dialer2 2.9 Productiv. X
ning on a Linux server. We used Oracle VirtualBox3 as our FileExplorer 1 Productiv. X
virtualization software and vagrant4 to manage these virtual ma- Gestures 1 Sample X
chines. Each virtual machine was configured with 2 cores and 6GB HotDeath 1.0.7 Card X
ADSdroid 1.2 Reference ⊗ X
RAM. Inside the virtual machine we installed the test input genera- myLock 42 Tools X
tion tools, the Android apps, and three versions of the Android SDK LockPatternGen. 2 Tools X
versions 10 (Gingerbread), 16 (Ice-cream sandwich) and 19 (Kitkat). aGrep 0.2.1 Tools ⊗ ⊗ X ⊗ ⊗ ⊗
K-9Mail 3.512 Comm. X
We chose these versions based on their popularity, to satisfy tool NetCounter 0.14.1 Tools X ⊗
dependencies and the most recent at the time of the experiments. Bomber 1 Casual X
The emulator was configured to use 4GB RAM and each tool was FrozenBubble 1.12 Puzzle ⊗ X ⊗ ⊗ ⊗
AnyMemo 8.3.1 Education ⊗ X ⊗ X⊗
allowed to run for 1 hour on each benchmark application. For every Blokish 2 Puzzle X
run, our infrastructure creates a new emulator with necessary tool ZooBorns 1.4.4 Entertain. X ⊗
configuration and later destroys this emulator to avoid side-effects ImportContacts 1.1 Tools X
between tools and applications. Given that many testing tools and Wikipedia 1.2.1 Reference ⊗ X
KeePassDroid 1.9.8 Tools X
applications are non-deterministic, we repeated each experiment 10 SoundBoard 1 Sample X
times and we report the mean values of all the run. CountdownTimer 1.1.0 Tools X
Ringdroid 2.6 Media ⊗ X⊗ ⊗ ⊗
SpriteMethodTest 1.0 Sample X
Coverage and System Log Collection. BookCatalogue 1.6 Tools X
For each run, we collected the code coverage for the app under Translate 3.8 Productiv. X
TomdroidNotes 2.0a Social X
test. We selected Emma 5 as our code coverage tool because it is Wordpress 0.5.0 Productiv. X ⊗
officially supported and available with the Android platform. How- Mileage 3.1.1 Finance X
ever, since Emma does not allow exporting raw statement coverage Sanity 2.11 Comm. X⊗
results, we parsed the HTML coverage reports to extract line cov- DalvikExplorer 3.4 Tools ⊗ X
MiniNoteViewer 0.4 Productiv. X
erage information for comparison between tools. In particular, we MyExpenses 1.6.0 Finance X⊗
used this information to compute for each tool pair, 1) the number of LearnMusicNotes 1.2 Puzzle X
TippyTipper 1.1.3 Finance X
2
WeightChart 1.0.4 Health X⊗
http://f-droid.org WhoHasMyStuff 1.0.7 Productiv. X
3
Oracle VM Virtualbox – http://virtualbox.org
4
Vagrant – http://vagrantup.com
5 statements covered by both tools, and 2) the number of statements
Emma: a free Java code coverage tool – http://emma.
sourceforge.net/ covered by each of them separately.
100
100 monkey
acteve
80 dynodroid
a3e
80 guiripper
60 puma
Coverage
60 40
Coverage
20
40
00 5 10 15 20 25 30 35 40 45 50 55 60
20 Time in minutes
Figure 3: Progressive coverage
0 monkey
acteve dynodroid a3e guiripper puma Figure 2 reports the variance of the mean coverage of 10 runs that
Android Test Input Generation Tools
each tool achieved on the considered benchmarks. We can see that
Figure 2: Variance of the coverage across the applications over on average Dynodroid and Monkey perform better than other tools,
10 runs. followed by ACTEve . The other three tools (i.e., A3 E , GUIRipper
and PUMA) achieve quite a low coverage.
To each benchmark, we added a broadcast receiver to save in- Despite this, even those tools that on average achieve low cover-
termediate coverage results to the disk. Dynodroid used a similar age can reach very high coverage (approximately 80%) for few apps.
strategy, and this was necessary to collect coverage from the appli- We manually investigated these apps, and we saw that these are, as
cations before they were restarted by the test input generation tools, expected, the most simple ones. The two outliers for which every
and also to track the progress of the tools at regular intervals. tool achieved very high coverage are DivideAndConquer and Ran-
SwiftHand is an exception to this protocol. This tool, in fact, domMusicPlayer. The first one is a game that accepts only touches
internally instruments the app under test for collecting branch cover- and swipes as events, and they can be provided without much logic
age and to keep track of the app’s lifecycle. The instrumentation is in order to proceed with the game. RandomMusicPlayer is a music
critical to the tool’s functioning but conflicts with Android’s Emma player that randomly plays music. The possible user actions are
instrumentation, which could not be resolved by us within 2 weeks. quite limited, as there is only one activity with 4 buttons.
Thus, we decided not to collect and compare the statement coverage Similarly, there are some applications for which every tool, even
information of this tool with others. However, the branch coverage the ones that performed best, achieved very low coverage (i.e., lower
information collected by SwiftHand on the benchmark applications than 5%). Two of these apps, K9mail (a mail client) and Password-
is available with our dataset. To gather the different failures in the MakerPro (an app to generate and store authentication data), highly
app, we also collected the entire system log (also called logcat), depend on external factors, such as the availability of a valid account.
from the emulator running the app under test. From these logs, we Such inputs are nearly impossible to generate automatically, and
extracted failures that occurred while the app was being tested in therefore every tool stalls at the beginning of the exploration. Few
a semi-automated fashion. Specifically, we wrote a script to find tools provide an option to manually interact with application at first,
patterns of exceptions or errors in the log file and extract them along and then use the tool to perform subsequence test input generation.
with their available stack traces in the log. We manually analyzed However, we did not use this feature for scalability reasons and
them to ignore any failures not related to the app’s execution (e.g., concerns of giving such tools an unfair advantage of the manual
failures of the tool themselves or initialization errors of other apps intelligence.
in the Android emulator). All unique instances of remaining failures Figure 3 reports the progressive coverage of each tool over the
were considered for our results. maximum time bound we gave, i.e., 60 minutes. The plot reports
the mean coverage across all apps over the 10 runs. This plot evi-
4.4 C1: Exploration Strategy Effectiveness dences an interesting finding, that is, all tools can hit their maximum
Test input generation tools for Android implement different strate- coverage within few minutes (between 5 and 10 minutes). The only
gies to explore as much behavior as possible of the application under exception to this is GUIRipper. The likely reason why this happens
test. Section 3 presented an overview of the three main strategies, in that GUIRipper frequently restarts the exploration from the start-
that is, random, model-based, and systematic. Although some of ing state, and this operation takes time. This is the main problem
these tools have been already evaluated according to how much code that SwiftHand addresses by implementing an exploration strategy
coverage they can achieve, it is still unclear whether there is any that limits the number of restarts.
strategy that is better than others in practice. Previous evaluations
were either incomplete because they did not include comparison 4.5 C2: Fault Detection Ability
with other tools, or they were, according to our opinion, biased. The final goal of test input generation tools is to expose faults in
Since we believe that the most critical resource is time, tools should the app under test. Therefore, beside code coverage, we checked
evaluate how much coverage they can achieve within a certain time how many failures each tool can reveal with a time budget of one
limit. Tools such as Dynodroid and EvoDroid, instead, have been hour per app. None of the Android tools can identify failures other
compared to Monkey by comparing the coverage achieved given the than runtime exceptions, although there is some promising work
same number of generated events. We thus set up the experiment by that goes in that direction [36].
running each tool 10 times on each application of Table 2, and we Figure 4 reports the results of this study. Numbers on the y axis
collected the achieved coverage as described in Section 4.3. represent the cumulative unique failures across the 10 runs across
90
java.io Table 3: Ease of use and compatibility of each tool with the
java.net
80 java.lang most common Android framework versions.
library Name Ease Use Compatibility
custom
70 android.database Monkey [23] NO_EFFORT any
android.content Dynodroid [17] NO_EFFORT v.2.3
60 GUIRipper [1] MAJOR_EFFORT any
A3 E -Depth-first [5] LITTLE_EFFORT any
50
Failures
75 31.5%
75 52.1%
75 26.5%
75 28.3%
75 28.8%
50 50 50 50 50
25 25 25 25 25
0 0 0 0 0
monkey acteve monkey dynodroid monkey a3e monkey guiripper monkey puma
100 100 100 100 100
1 33.7% 25.4% 25.1% 25.6%
36 34.3% 33.3% 31.2% 33.3%
75 75 75 75 75
acteve
50 50 50 50 50
25 25 25 25 25
0 0 0 0 0
acteve monkey acteve dynodroid acteve a3e acteve guiripper acteve puma
100 100 100 100 100
1 0 26.3% 26.1% 27.2%
dynodroid
50 50 50 50 50
25 25 25 25 25
0 0 0 0 0
dynodroid monkey dynodroid acteve dynodroid a3e dynodroid guiripper dynodroid puma
100 100 100 100 100
3 3 1 22.5% 21.6%
24 24 24 26.0% 27.8%
75 83
75 36
75 73
75 28.3%
75 29.2%
a3e
50 50 50 50 50
25 25 25 25 25
0 0 0 0 0
a3e monkey a3e acteve a3e dynodroid a3e guiripper a3e puma
100 100 100 100 100
4 3 1 6 21.6%
guiripper
40 40 40 40 29.3%
75 83
75 36
75 73
75 24
75 28.5%
50 50 50 50 50
25 25 25 25 25
0 0 0 0 0
guiripper monkey guiripper acteve guiripper dynodroid guiripper a3e guiripper puma
100 100 100 100 100
0 0 0 0 0
32 32 32 32 32
75 83
75 36
75 73
75 24
75 40
puma
50 50 50 50 50
25 25 25 25 25
0 0 0 0 0
puma monkey puma acteve puma dynodroid puma a3e puma guiripper
100 100 100 100 100 100
0 0 0 0 0 0
swifthand
15 15 15 15 15 15
75 83
75 36
75 73
75 24
75 40
75 32
50 50 50 50 50 50
25 25 25 25 25 25
0 0 0 0 0 0
swifthand monkey swifthand acteve swifthand dynodroid swifthand a3e swifthand guiripper swifthand puma
Figure 5: Pairwise comparison of tools in terms of coverage and failures triggered. The plots on top-right show percent statement
coverage of the tools and the ones in the bottom-left section show absolute number of failures triggered. The gray bars in all plots
show commonalities between the pairs of tools .
droid framework in order to let developers assess the quality of their Table 3 reports the results of this study. The table shows that 4
app in different environments. Therefore, we ran each tool on three out of 7 tools do not offer this feature. Some tools (PUMA and
popular Android framework releases, as described in Section 4.3, SwiftHand) are compatible only with the most recent releases of
and assessed whether it could work correctly. the Android Framework, while others (ACTEve and Dynodroid) are
bound to a very old one. ACTEve and Dynodroid could be compati- During our study, we also identified limitations that significantly
ble with other frameworks, but this would require to instrument them affect the effectiveness of all tools. We report these limitations,
first. SwiftHand and PUMA, instead, are not compatible with older together with desirable and missing features, such that they could
releases of the Android Framework because they use features of the be the focus of future research in this area:
underlying framework that are not available in previous releases.
Reproducible test cases. None of the tools allows to easily repro-
duce failures. They report uncaught runtime exceptions on the
5. DISCUSSION AND FUTURE RESEARCH logfile, but they do not generate test cases that can be later rerun.
DIRECTIONS We believe that this is an essential feature that every tool of this
The experiments presented in Section 4 report unexpected results: type should have.
the random exploration strategies implemented by Monkey and Dyn-
Debugging support. The lack of reproducible test cases makes it
odroid can obtain higher coverage than more sophisticated strategies
hard to identify the root cause of the failures. The stack trace of
implemented by other tools. It thus seems that Android applica-
the runtime exception is the only information that a developer
tions are different from Java stand-alone application, where random
can use, and this information is lost in the execution logs. Testing
strategies have been shown to be highly inefficient compared to
tools for Android should make failures more visible, and should
systematic strategies [11, 20, 21]. Our results show that 1) most of
provide more information to ease debugging. In our evaluation,
the behavior can be exercised by generating only UI events, and 2)
we had a hard time understanding if failures were caused by real
to expose this behavior the random approach is effective enough.
faults or rather were caused by limitations of the emulator. More
Considering the four criterion of the study, Monkey would clearly
information about each failure would have helped in this task.
be the winner among the existing test input generation tools, since
it achieves, on average, the best coverage, it can report the largest Mocking. Most apps for which tools had low coverage highly de-
number of failures, it is ease to use, and works for every platform. pend on environment elements such as content providers. It is
This does not mean that the other tools should not be considered. impossible to explore most of K9mail functionalities unless there
Actually it is quite the opposite, since every other tool has strong is a valid email account already set up and unless there are exist-
points that, if properly implemented and properly combined can lead ing emails in the account. GUIRipper alleviates this problem by
to significant improvements. We now list some of the features that letting users prepare different snapshots of the app. We believe
some tools already implement and should be considered by other that working on a proper mocking infrastructure for Android apps
tools: would be a significant contribution, as it would lead to drastic
code coverage improvements.
System events. Dynodroid and ACTEve can generate system events
beside standard UI events. Even if the behavior of an app may de- Sandboxing. Testing tools should also provide proper sandboxing,
pend only partially on system events, generating them can reveal that is, they should block operations that may potentially have
failures that would be hard to uncover otherwise. disruptive effects (for instance sending emails using a valid ac-
count, or allow critical changes using a real social networking
Minimize restarts. Progressive coverage shows that tools that fre- account). None of the tools keeps this problem into account.
quently restart their exploration from the starting point need more
time to achieve their maximum coverage. The search algorithm Focus of fragmentation problem. While C4 of our evaluation showed
that Swifthand implements aims to minimize such restarts, and that some tools can run on multiple versions of the Android frame-
thus allows tools to achieve high coverage in less time. work, none of them is specifically designed for cross-device test-
Manually provided inputs. Specific behaviors can sometimes only ing. While this is a different testing problem, we believe that
be explored by providing specific inputs, which may be hard to testing tools for Android should also move towards this direction,
generate randomly or by means of systematic techniques. Some as fragmentation is the major problem that Android developers
tools, such as Dynodroid and GUIRipper let users manually pro- have to face.
vide values that the tool can later use during the analysis. This
feature is highly desirable, since it allows tools to explore the app 6. CONCLUSION
in presence of login forms and similar complex inputs.
In this paper, we presented a comparative study of the main
Multiple starting states. The behavior of many applications de- existing test input generation techniques and corresponding tools for
pend on the underlying content providers. An email client, for Android. We evaluated these tools according to four criteria: code
instance, would show an empty inbox, unless the content provider coverage, fault detection capabilities, ease of use, and compatibility
contains some messages. GUIRipper starts exploring the applica- with multiple Android framework versions. After presenting the
tion from different starting states (e.g., when the content provider results of this comparison, we discuss strengths and weaknesses of
is empty and when it has some entries). Even if this has to be the different techniques and highlight potential venues for future
done manually by the user, by properly creating snapshots of the research in this area. All of our experimental infrastructure and data
app, it allows to potentially explore behavior that would be hard are publicly available at http://www.cc.gatech.edu/~orso/
to explore otherwise. software/androtest, so that other researchers can replicate our
Avoid side effects among different runs. Using the tool on multi- studies and build on our work.
ple apps requires to reset the environment to avoid side-effects
across multiple runs. Although, in our experiments we used a
fresh emulator instance between runs, we realized that some tools, Acknowledgments
such as Dynodroid and A3 E , had capabilities to (partially) clean We would like to thank the authors of the tools, (specifically, Saswat
up the environment by uninstalling the application and deleting its Anand, Domenico Amalfitano, Aravind Machiry, Tanzirul Azim,
data. We believe that every tool should reuse the environment for Wontae Choi, and Shuai Hao) for making their tools available and
efficiency reasons but should do it without affecting its accuracy. for answering our clarification questions regarding the tool setup.
7. REFERENCES dynamic analysis of mobile apps. In Proceedings of the 12th
[1] D. Amalfitano, A. R. Fasolino, P. Tramontana, S. De Carmine, Annual International Conference on Mobile Systems,
and A. M. Memon. Using GUI Ripping for Automated Applications, and Services, MobiSys ’14, pages 204–217,
Testing of Android Applications. In Proceedings of the 27th New York, NY, USA, 2014. ACM.
IEEE/ACM International Conference on Automated Software [13] C. Hu and I. Neamtiu. Automating GUI Testing for Android
Engineering, ASE 2012, pages 258–261, New York, NY, Applications. In Proceedings of the 6th International
USA, 2012. ACM. Workshop on Automation of Software Test, AST ’11, pages
[2] D. Amalfitano, A. R. Fasolino, P. Tramontana, B. D. Ta, and 77–83, New York, NY, USA, 2011. ACM.
A. M. Memon. MobiGUITAR – a tool for automated [14] M. Kechagia, D. Mitropoulos, and D. Spinellis. Charting the
model-based testing of mobile apps. IEEE Software, API minefield using software telemetry data. Empirical
PP(99):NN–NN, 2014. Software Engineering, pages 1–46, 2014.
[3] S. Anand, M. Naik, M. J. Harrold, and H. Yang. Automated [15] Y. Liu, C. Xu, and S. Cheung. Verifying android applications
Concolic Testing of Smartphone Apps. In Proceedings of the using java pathfinder. Technical report, The Hong Kong
ACM SIGSOFT 20th International Symposium on the University of Science and Technology, 2012.
Foundations of Software Engineering, FSE ’12, pages [16] H. Lockheimer. Google bouncer.
59:1–59:11, New York, NY, USA, 2012. ACM. http://googlemobile.blogspot.com.es/2012/
[4] S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein, 02/android-and-security.html.
Y. Le Traon, D. Octeau, and P. McDaniel. FlowDroid: Precise [17] A. Machiry, R. Tahiliani, and M. Naik. Dynodroid: An Input
context, flow, field, object-sensitive and lifecycle-aware taint Generation System for Android Apps. In Proceedings of the
analysis for Android apps. In Proceedings of the 35th ACM 2013 9th Joint Meeting on Foundations of Software
SIGPLAN Conference on Programming Language Design and Engineering, ESEC/FSE 2013, pages 224–234, New York,
Implementation, PLDI ’14, pages 259–269, New York, NY, NY, USA, 2013. ACM.
USA, 2014. ACM. [18] R. Mahmood, N. Mirzaei, and S. Malek. EvoDroid:
[5] T. Azim and I. Neamtiu. Targeted and Depth-first Exploration Segmented Evolutionary Testing of Android Apps. In
for Systematic Testing of Android Apps. In Proceedings of the Proceedings of the 22nd ACM SIGSOFT International
2013 ACM SIGPLAN International Conference on Object Symposium on Foundations of Software Engineering, FSE
Oriented Programming Systems Languages & Applications, 2014, New York, NY, USA, 2014. ACM.
OOPSLA ’13, pages 641–660, New York, NY, USA, 2013. [19] A. K. Maji, F. A. Arshad, S. Bagchi, and J. S. Rellermeyer. An
ACM. empirical study of the robustness of inter-component
[6] A. Bartel, J. Klein, Y. Le Traon, and M. Monperrus. Dexpler: communication in android. In Proceedings of the 2012 42Nd
Converting Android Dalvik Bytecode to Jimple for Static Annual IEEE/IFIP International Conference on Dependable
Analysis with Soot. In Proceedings of the ACM SIGPLAN Systems and Networks (DSN), DSN ’12, pages 1–12, 2012.
International Workshop on State of the Art in Java Program [20] L. Mariani, M. Pezze, O. Riganelli, and M. Santoro.
Analysis, SOAP ’12, pages 27–38, New York, NY, USA, 2012. AutoBlackTest: Automatic Black-Box Testing of Interactive
ACM. Applications. In Proceedings of the 2012 IEEE Fifth
[7] W. Choi, G. Necula, and K. Sen. Guided GUI Testing of International Conference on Software Testing, Verification
Android Apps with Minimal Restart and Approximate and Validation, ICST ’12, pages 81–90, Washington, DC,
Learning. In Proceedings of the 2013 ACM SIGPLAN USA, 2012. IEEE Computer Society.
International Conference on Object Oriented Programming [21] A. Memon, I. Banerjee, and A. Nagarajan. GUI Ripping:
Systems Languages & Applications, OOPSLA ’13, pages Reverse Engineering of Graphical User Interfaces for Testing.
623–640, New York, NY, USA, 2013. ACM. In Proceedings of the 10th Working Conference on Reverse
[8] V. Dallmeier, M. Burger, T. Orth, and A. Zeller. WebMate: Engineering, WCRE ’03, pages 260–, Washington, DC, USA,
Generating Test Cases for Web 2.0. In Software Quality. 2003. IEEE Computer Society.
Increasing Value in Software and Systems Development, pages [22] A. Mesbah, A. van Deursen, and S. Lenselink. Crawling
55–69. Springer, 2013. Ajax-based Web Applications through Dynamic Analysis of
[9] W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, User Interface State Changes. ACM Transactions on the Web
P. McDaniel, and A. N. Sheth. TaintDroid: An (TWEB), 6(1):3:1–3:30, 2012.
information-flow tracking system for realtime privacy [23] The Monkey UI android testing tool. http://developer.
monitoring on smartphones. In Proceedings of the 9th android.com/tools/help/monkey.html.
USENIX Conference on Operating Systems Design and [24] Intent fuzzer, 2009.
Implementation, OSDI’10, pages 1–6, Berkeley, CA, USA, http://www.isecpartners.com/tools/
2010. USENIX Association. mobile-security/intent-fuzzer.aspx.
[10] A. Gorla, I. Tavecchia, F. Gross, and A. Zeller. Checking app [25] D. Octeau, S. Jha, and P. McDaniel. Retargeting Android
behavior against app descriptions. In Proceedings of the 36th Applications to Java Bytecode. In Proceedings of the ACM
International Conference on Software Engineering, ICSE SIGSOFT 20th International Symposium on the Foundations
2014, pages 1025–1035, New York, NY, USA, June 2014. of Software Engineering, FSE ’12, pages 6:1–6:11, New York,
ACM. NY, USA, 2012. ACM.
[11] F. Gross, G. Fraser, and A. Zeller. EXSYST: Search-based [26] Rebecca Murtagh. Number of apps available in leading app
GUI Testing. In Proceedings of the 34th International stores as of July 2014.
Conference on Software Engineering, ICSE ’12, pages http://searchenginewatch.com/article/2353616/Mobile-Now-
1423–1426, Piscataway, NJ, USA, 2012. IEEE Press. Exceeds-PC-The-Biggest-Shift-Since-the-Internet-Began,
[12] S. Hao, B. Liu, S. Nath, W. G. Halfond, and R. Govindan.
July 2014.
PUMA: Programmable UI-automation for large-scale
[27] S. Roy Choudhary, M. R. Prasad, and A. Orso. X-PERT: [32] H. van der Merwe, B. van der Merwe, and W. Visser.
Accurate Identification of Cross-browser Issues in Web Execution and Property Specifications for JPF-android.
Applications. In Proceedings of the 2013 International SIGSOFT Softw. Eng. Notes, 39(1):1–5, Feb. 2014.
Conference on Software Engineering, ICSE ’13, pages [33] X. Wei, L. Gomez, I. Neamtiu, and M. Faloutsos.
702–711, Piscataway, NJ, USA, 2013. IEEE Press. ProfileDroid: Multi-layer Profiling of Android Applications.
[28] R. Sasnauskas and J. Regehr. Intent Fuzzer: Crafting Intents In Proceedings of the 18th Annual International Conference
of Death. In Proceedings of the 2014 Joint International on Mobile Computing and Networking, Mobicom ’12, pages
Workshop on Dynamic Analysis (WODA) and Software and 137–148, New York, NY, USA, 2012. ACM.
System Performance Testing, Debugging, and Analytics [34] W. Yang, M. R. Prasad, and T. Xie. A Grey-box Approach for
(PERTEA), WODA+PERTEA 2014, pages 1–5, New York, Automated GUI-model Generation of Mobile Applications. In
NY, USA, 2014. ACM. Proceedings of the 16th International Conference on
[29] Smali/baksmali, an assembler/disassembler for the dex format Fundamental Approaches to Software Engineering, FASE’13,
used by Dalvik. pages 250–265, Berlin, Heidelberg, 2013. Springer-Verlag.
https://code.google.com/p/smali. [35] H. Ye, S. Cheng, L. Zhang, and F. Jiang. DroidFuzzer:
[30] Statista. Number of apps available in leading app stores as of Fuzzing the Android Apps with Intent-Filter Tag. In
July 2014. http: Proceedings of International Conference on Advances in
//www.statista.com/statistics/276623/ Mobile Computing & Multimedia, MoMM ’13, pages
number-of-apps-available-in-leading-app-stores/, 68:68–68:74, New York, NY, USA, 2013. ACM.
August 2014. [36] R. N. Zaeem, M. R. Prasad, and S. Khurshid. Automated
[31] H. van der Merwe, B. van der Merwe, and W. Visser. generation of oracles for testing user-interaction features of
Verifying Android Applications Using Java PathFinder. mobile apps. In Proceedings of the 2014 IEEE International
SIGSOFT Softw. Eng. Notes, 37(6):1–5, Nov. 2012. Conference on Software Testing, Verification, and Validation,
ICST ’14, pages 183–192, Washington, DC, USA, 2014.
IEEE Computer Society.