Automated Testing of Android Apps: A Systematic Literature Review
Automated Testing of Android Apps: A Systematic Literature Review
of venues8 and leveraged the DBLP9 repository to collect the share a major part of their contents. We filter out the least
document object identifiers of the publications in order to crawl recent publication when duplication is confirmed.
abstracts and all publication metadata. Since this search process 5) Papers that conduct comparative evaluations, including
considers major journal and conference venues, the resulting set surveys on different approaches of testing Android apps,
of literature papers should be a representative collection of the are excluded. Such papers indeed do not introduce new
state-of-the-art. technical contributions for testing Android apps.
6) Papers in which the testing approach targets the operating
C. Exclusion Criteria system, networks, or hardware, rather than mobile apps
are excluded.
After execution of our search based on the provided key- 7) Papers that assess12 existing testing methods are also fil-
words, a preliminary manual scanning showed that the results tered out. The publications that they discuss are supposed
are rather coarse-grained since it included a number of irrele- to be already included in our search results.
vant or less relevant publications that, nonetheless, matched10 8) Papers demonstrating how to set up environments and
the keywords. It is, thus, necessary to perform a fine-grained platforms to retrieve runtime data from Android apps are
inclusion/exclusion in order to focus on a consistent and reli- excluded. These papers are also important for Android
able set of primary publications and reduce the eventual effort apps testing, but they are not focusing on new testing
in further in-depth examination. For this SLR, we have applied methodology.
the following exclusion criteria. 9) Finally, some of our keywords (e.g., “detection” of issues,
1) Papers that are not written in English are filtered out since “testing” of apps) have led to the retrieval of irrelevant lit-
English is the common language spoken in the worldwide erature works that must be excluded. We have mainly iden-
scientific peer-reviewing community. tified two types of such papers: the first includes papers
2) Short papers are excluded, mainly because such papers are that perform detection of malicious apps using machine
often work-in-progress or idea papers: on the one hand, learning (and not testing); the second includes papers that
short papers are generally not mature, and, on the other describe the building of complex platforms, adopting ex-
hand, many of them will eventually appear later in a full isting mature testing methodologies.
paper format. In the latter case, mature works are likely to We refer to all collected papers that remain after the ap-
already be included in our final set. In this paper, we take plication of exclusion criteria as primary publications. These
a given publication as a short paper when it has fewer than publications are the basis for extracting review data.
four pages (included) in IEEE/ACM-like double-column
format11 or fewer than eight pages (included) in LNCS- D. Review Protocol
like single column format as short papers are likely to be
four pages in double column format and eight pages in Concretely, the review is conducted in two phases: First, we
single column format. perform an abstract review and quick full paper scan to filter out
3) Papers that are irrelevant to testing Android apps are ex- irrelevant papers based on the exclusion criteria defined above.
cluded. Our search keywords indeed included broad terms At the end of this phase, the set of primary publications is
such as mobile and smartphone as we aimed at finding all known. Subsequently, we perform a full review of each primary
papers related to Android even when the term “Android” publication and extract relevant information that is necessary
was not specifically included in the title and abstract. By for answering all of our research questions.
doing so, we have excluded papers that only deal with mo- In practice, we have split our primary publications to all the
bile apps for other platforms such as iOS and Windows. coauthors to conduct the data extraction step. We have further
4) Duplicated papers are removed. It is quite common for crosschecked all the extracted results: when some results are in
authors to publish an extended version of their conference disagreement, informal discussions are conducted until a con-
paper to a journal venue. However, these papers share sensus is reached.
most of the ideas and approach steps. To consider both of
them would result in a biased weighting of the metrics in III. PRIMARY PUBLICATIONS SELECTION
the review. To mitigate this, we identify duplicate papers Table II summarizes statistics of collected papers during the
by first comparing paper titles, abstracts, and authors and search phase. Overall, our repository search and major venue
then further manually check when a given pair of records search have yielded in total 9259 papers.
Following the exclusion criteria in Section II, the papers sat-
isfying the matching requirements immediately drop from 9259
8 http://www.ccf.org.cn/sites/ccf/paiming.jsp, we only take into account soft- to 472. We then manually go through the title and abstract of
ware engineering and security categories, as from what we have observed, the each paper to further dismiss those that match the exclusion
majority of papers related to testing Android apps.
9 http://dblp.uni-trier.de criteria. After this step, the set of papers is reduced to 255 publi-
10 The keywords were found, for example, to be mentioned in the related cations. Subsequently, we go through the full content of papers
sections of the identified papers.
11 Note that we have actually kept a short paper entitled “GuiDiff: a regression
testing tool for graphical user interface” because it is very relevant to our study 12 For example, [23] and [24] proposed tools and algorithms for measuring
and it does not have an extended version released in the following years. the code coverage of testing methods.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
SUMMARY OF THE SELECTION OF PRIMARY PUBLICATIONS
enumerated overall six recurring testing objectives such as cific app elements that normally involve only certain types of
bug/defect detection. code (e.g., functionality).
Test targets: This dimension summarizes the representative 1) Test Objectives: Android testing research has tackled var-
targets on which testing approaches focus. In particular, for ious objectives, including the assessment of apps against non-
testing Android apps, the GUI/Event and ICC/interapplication functional properties such as app efficiency in terms of energy
communication (IAC) are recurrently targeted. For simplicity, consumption, and functional requirements such as the presence
we regroup all the other targets such as normal code analysis of bugs. We discuss in this section some recurrent test objectives
into General. from the literature.
Test levels: This dimension checks the different levels (also Concurrency: Android apps expose a concurrency model that
known as phases) at which the test activities are performed. combines multithreading and asynchronous event-based dis-
Indeed, there is a common knowledge that software testing is patch, which may lead to subtle concurrency errors because
very important and has to be applied to many levels such as of unforeseen thread interleaving coupled with nondetermin-
unit testing, integration testing, etc. Android apps, as a specific istic reordering of asynchronous tasks. These error-prone fea-
type of software, also need to go through a thorough testing tures are however useful and increasingly becoming common in
progress before being released to public markets. In this dimen- the development of efficient and feature-rich apps. To mitigate
sion, we sum up the targeted testing phases/levels of examined concurrency issues, several works have been proposed, notably
approaches, to understand what has been focused so far by the for detecting races such as data races, event-based races, etc.
state-of-the-art. in Android apps. As an example, Maiya et al. [62] have built
Test techniques: Finally, the fourth dimension focuses on the DroidRacer, which identifies data races (i.e., the read and write
fundamental methodologies (e.g., fuzzy or mutation) that are operations happen in parallel) by computing the happens-before
followed to perform the tests, as well as the testing environments relation on execution traces that are generated systematically
(e.g., on emulated hardware) and testing types (e.g., black-box through running test scenarios against Android apps. Bielik
testing). et al. [47] later have proposed a novel algorithm for scaling
the inference of happens-before relations. Hu et al. [9] pre-
sented a work for verifying and reproducing event-based races,
V. LITERATURE REVIEW where they have found that both imprecise Android component
We now report on the findings of this SLR in light of the modeling and implicit happens-before relation could result in
research questions that we have raised in Section II-A. false positive for detecting potential races.
Security: As shown by Li et al. [22], the Android research
community is extensively working on providing tools and ap-
A. What Concerns do the Approaches Focus on? proaches for solving various security problems for Android
Our review investigates both the objectives that testing ap- apps. Some of these works involve app testing, e.g., to observe
proaches seek to achieve and the app elements that are targeted defective behavior [57] and malicious behavior [79] and to track
by the test cases. Test objectives focus on problems that can be data leaks [75]. For example, Yan et al. [78] have built a novel
located anywhere in the code, while test targets focus on spe- and comprehensive approach for the detection of resource leaks
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
using test criteria based on neutral cycles: sequences of GUI ing, and search-based testing to dynamically explore Android
events should have a “neutral” effect and should not increase apps to pinpoint defective behavior [57], GUI bugs [84], intent
the usage of resources. Hay et al. [45] dynamically detected in- defects [72], crashing faults [11], etc.
terapplication communication vulnerabilities in Android apps. Table III characterizes the publications selected for our SLR
Performance: Android apps are sensitive to performance is- in terms of the objectives discussed above. Through our in-depth
sues. When a program thread becomes expensive, the system examination, the most considered testing objective is bug/defect,
may stop app execution after warning on the user interface that accounting for 23.3% of the selected publications.
the “Application [is] Not Responding.” The literature includes 2) Test Targets: Test approaches in software development
several contributions on highlighting issues related to the per- generally target core functionality code. Since Android apps are
formance of Android apps such as poor responsiveness [29] and written in Java, the literature on Android app testing focused on
exception handling [55]. Yang et al. [74], for example, have Android specificities, mainly on how to address the GUI testing
proposed a systematic testing approach to uncover and quantify with a complex event mechanism as well as intercomponent and
common causes of poor responsiveness of Android apps. Con- interapplication communications.
cretely, they explicitly extend the delay for typical problematic GUI/Event: Android implements an event-driven graphical
operations, using the test amplification approach, to demonstrate user interface system, making Android apps testing challeng-
the effects of expensive actions that can be observed by users. ing, since they intensively interact with user inputs, introducing
Energy: One of the biggest differences between traditional PC uncertainty and nondeterminism. It is generally complicated
and portable devices is the fact that portable devices may run to model the UI/system events because it not only needs the
on battery power, which can get depleted during app usage. A knowledge of the set of GUI widgets and their supporting ac-
number of research works have investigated energy consumption tions (e.g., click for buttons) but also requires the knowledge
hotspots arising from software design defects, unwanted service of system events (e.g., receiving a phone call), which however
execution (e.g., advertisement), or have leveraged energy finger- are usually unknown in advance. Consequently, it is generally
prints to detect mobile malware. As an example, Wan et al. [42] difficult to assemble a valid set of input event sequences for a
presented a technique for detecting display energy hotspots to given Android app with respect to coverage, precision, and com-
guide the developers to improve the energy efficiency of their pactness test criteria [87]. The Android testing community has
apps. Since each activity performed on a battery powered device proposed many approaches to address this challenge. For exam-
drains a certain amount of energy from it, if the normal energy ple, Android-GUITAR, an extension of the GUITAR tool [88],
consumption is known for a device, the additionally used energy was proposed to model the structure and execution behavior
should be flagged as abnormal. of Android GUI through a formalism called GUI forests and
Compatibility: Android apps are often suffering from com- event-flow graphs. Denodroid [89] applies a dynamic approach
patibility issues, where a given app can run successfully on a to generate inputs by instrumenting the Android framework to
device, characterized by a range of OS versions while failing record the reaction of events.
on others [85]. This is mainly due to the fragmentation in the ICC/IAC: The ICC and IAC15 enable a loose coupling among
Android ecosystem brought by its open source nature. Every components [90], [91], thus reducing the complexity to develop
vendor, theoretically, can have its own customized system (e.g., Android apps with a generic means to reuse existing function-
for supporting specific low-level hardware) and the screen size ality (e.g., obtain the contact list). Unfortunately, ICC/IAC also
of its released devices can vary as well. To address compati- come with a number of security issues, among which the po-
bility problems, there is a need to devise scalable and efficient tential for implementing component hijacking, broadcast in-
approaches for performing compatibility testing before releas- jection, etc. [92]. Researchers have then investigated various
ing an app into markets. Indeed, as pointed out by Vilkomir testing approaches to highlight such issues in Android apps.
et al. [59], it is expensive and time-consuming to consider test- IntentDroid [45], for instance, performs comprehensive IAC
ing all device variations. The authors thus proposed to address security testing for inferring Android IAC integrity vulnera-
the issue with a combinatorial approach, which attempts to se- bilities. It utilizes lightweight platform-level instrumentation,
lect an optimal set of mobile devices for practical testing. Zhang which is implemented through debug breakpoints, to recover
et al. [41] leveraged a statistical approach to optimize the com- IAC-relevant app-level behavior. IntentFuzzer [58], on the other
patibility testing strategy where the test sequence is generated hand, leverages fuzz testing techniques to detect capability leaks
by K-means statistic algorithm. (e.g., permission escalation attacks) in Android apps.
Bug/Defect:14 Like most software, Android apps are often General: For all other publications that did not address the
buggy, usually leading to runtime crashes. Due to the high above two popular targets, the category General applies. Publi-
competition of apps in the Android ecosystem, defect identi- cations with targets like normal code analysis are grouped into
fication is critical since they can be detrimental to user rat- this category.
ing and adoption [86]. Indeed, researchers in this field leverage Table IV characterizes the test targets discussed above. The
various testing techniques such as fuzzing testing, mutation test- most frequently addressed testing target is GUI/Event, account-
ing for 45.6% of the selected publications. Meanwhile, there are
14 Terminologically, the aforementioned objectives could also be categorized
as bug/defect problems (e.g., concurrency issues). To make the summarization
more meaningful in this work, we only flag publications as bug/defect as long
as their main focuses are bug/defect problems, e.g., when they address the gap 15 IAC is actually ICC where the communicating components are from dif-
between app’s misbehavior and developer’s original design. ferent apps.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE III
TEST OBJECTIVES IN THE LITERATURE
only 12 publications targeted ICC/IAC. A total of 44 publica- community now needs to focus on transforming the approaches
tions are regrouped under the General category. into scalable tools that will perform deeper security analyses
Insights from RQ1—on Targets and Objectives and accurate defect identification in order to improve the over-
– “Bug/defect” has been the most trending concern among all quality of apps distributed in markets.
Android research community. “Compatibility” testing, which
is necessary for detecting issues that plague the Android frag-
mented ecosystem, remains understudied. Similarly, we note B. Which Test Levels are Addressed?
that because mobile devices are quickly getting powerful, de- Development of Android apps involves classical steps of tra-
velopers build increasingly complex apps with services explor- ditional software development. Therefore, there are opportuni-
ing hardware multicore capabilities. Therefore, the community ties in various phases to perform tests with specific emphasis
should invest more efforts in approaches for concurrency testing. and purpose. The software testing community commonly ac-
– Our review has also confirmed that GUI is of paramount knowledges four levels of software testing [127], [128]. Our
importance in modern software development for guaranteeing literature review has identified that Android researchers have
a good user experience. In Android apps, the GUI actions and proposed approaches, which considered Unit/regression test-
reactions are intertwined with the app logic, increasing the chal- ing, integration testing, and system testing. Acceptance testing,
lenges of analyzing app codes for defects. For example, mod- which involves end-users evaluating whether the app complies
eling GUI behavior while taking into account potential runtime with their needs and requirements, still faces a lack of research
interruption by system events (e.g., incoming phone call) is effort in the literature.
necessary, yet not trivial. These challenges have created op- Unit testing is usually applied at the beginning of the develop-
portunities in Android research: as our literature review shows, ment of Android apps, which are usually written by developers
most test approaches target GUI or the Event mechanism. The and can be taken as a type of white-box testing. Unit testing
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IV TABLE V
TEST TARGETS IN THE LITERATURE RECURRENT TESTING PHASES
TABLE VI
TEST METHOD EMPLOYED IN THE LITERATURE
individual functionalities in a white-box testing scenario, are methodology incurs a substantial, usually manual, effort to de-
limited to a few approaches. sign and build the model, the eventual test approach is often
extensive, since test cases can be automatically generated and
executed. Our review has revealed that model-based testing is the
C. How are the Test Approaches Built?
most common methodology used in Android testing literature:
Our review further investigates the approaches in-depth to 63% of publications involve some model-based testing steps.
characterize the methodologies they leverage, the type of tests Takala et al. [123] presented a comprehensive documentation
that are implemented as well as the tool support they have ex- on their experiences in applying a model-based GUI testing to
ploited. In this paper, we refer to test technique as a broad Android apps. They typically discuss how model-based testing
concept to describe all the technical aspects related to testing, and test automation are implemented, how apps are modeled, as
while we constrain the term test methodology to specifically well as how tests are designed and executed.
describe the concrete methodology that a test approach applies. Search-based testing is using the metaheuristic search tech-
1) Test Methodologies: Table VI enumerates all the testing niques to generate software tests [129], with the aim to detect
methodologies we observed in our examination. as many bugs as possible, especially the most critical ones, in
Model-based testing is a testing methodology that goes the system under test. In [105], Mahmood et al. developed an
one step further than traditional methodologies by automati- evolutionary testing framework for Android apps. Evolutionary
cally generating test cases based on a model, which describes testing is a form of search-based testing, where an individual
the functionality of the system under test. Although such corresponds to a test case, and a population comprised of many
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VIII
SUMMARY OF BASIC TOOLS THAT ARE FREQUENTLY LEVERAGED BY OTHER TESTING APPROACHES
requires only a few extra steps and abstractions. Because testers TABLE IX
ASSESSMENT METRICS (E.G., FOR COVERAGE, ACCURACY)
do not need to maintain a set of fake objects and interfaces, it is
even preferable for complex apps.
Sikuli uses visual technology to automate GUI testing through
screenshot images. It is particularly useful when there is no easy
way to obtain the app source code or the internal structure of
graphic interfaces. Lin et al. [106], [113] leveraged Sikuli in
their work to enable record-and-replay testing of Android apps,
where the user interactions are saved beforehand in Sikuli test
formats (as screenshot images).
Insights from RQ3—on Used Techniques
– Given the complexity of interactions among components
in Android apps as well as with the operating system,
it is not surprising that most approaches in the literature
resort to “model-based” techniques, which build models
for capturing the overall structure and behavior of apps to
facilitate testing activities (e.g., input generation, execution
scenarios selection, etc.).
– The unavailability of source code for market apps makes Branch [114] have been proposed in our community. In order
white-box techniques less attractive than grey-box and to profile the Accuracy of testing approaches, other coverage
black-box testing for assessing apps in the wild. Neverthe- metrics are also proposed in the literature such as bugs [42] and
less, our SLR shows that the research community has not vulnerabilities [45] (e.g., how many known vulnerabilities can
sufficiently explored testing approaches that would directly the evaluated testing approach cover?). Table IX enumerates
benefit app developers during the development phase. the coverage metrics used in the literature, where LoC appears
– Tool support for building testing approaches is abundant. to be the most concerned metric.
The use of the Robotium open-source test framework by Ground truth refers to a reference dataset where each element
numerous approaches once again demonstrates the impor- is labeled. In this SLR, we consider two types of ground truths.
tance of making tools available to stimulate research. The first is related to malware detection approaches: the ground
truth then contains apps labeled as benign or malicious. As an
example, the Drebin [132] dataset has recurrently been lever-
D. To What Extent are the Approaches Validated? aged as ground truth to evaluate testing approaches [133]. The
Several aspects must be considered when assessing the ef- second is related to vulnerability and bug detection: the ground
fectiveness of a testing approach. We consider in this SLR the truth represents code that is flagged to be vulnerable or buggy
measurements performed on code coverage as well as on accu- based on the observation of bug reports summited by end users
racy. We also investigate the use of a ground truth to validate or bug fix histories committed by developers [55], [84].
performance scores, the size of the experimental dataset. The Dataset Size is the number of apps tested in the experi-
Coverage is a key aspect for estimating how well the program mental phase. We can see from Fig. 9 that most works (ignoring
is tested. Larger coverage generally correlates with higher possi- outliers) carried out experiments on no more than 100 apps, with
bilities of exposing potential bugs and vulnerabilities, as well as a median number of 8 apps. Comparing to the distribution of
uncovering malicious behavior. There are numerous coverage the number of evaluated apps summarized in an SLR of static
metrics leveraged by state-of-the-art works. For example, for analysis of Android apps [22], where the median and maximum
evaluating Code Coverage, metrics such as LoC (Lines of numbers are, respectively, 374 and 318 515, far bigger than the
Code) [11], [102], [105], Block [97], Method [108], [115], and number of apps considered by testing approaches. This result is
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 10. Trend of testing types. (a) Black-box. (b) White-box. (c) Grey-box. dependencies: it is not straightforward to isolate a single block
for unit/regression testing or to test the integration of two com-
ponents without interference from other components. Never-
somehow expected as testing approaches (or dynamic analysis
theless, with the increasing use of code instrumentation [14],
approaches) are generally not scalable.
there are new opportunities to eventually slice Android apps for
Insights from RQ4—on Approach Validation
performing more grey-box and white-box testing.
Although literature works always provide evaluation section
Trend analysis of testing methods in Fig. 12 confirms
to provide evidence (often through comparison) that their ap-
that model-based testing is dominating in the literature of
proaches are effective, their reproducibility is still challenged
Android app testing, and its evolution is reflected in the overall
by the fact that there is a lack of established ground truth and
evolution of testing approaches. Most approaches indeed start
benchmarks. Yet, reproducibility is essential to ensure that the
by constructing a GUI model or a call graph to generate efficient
field is indeed progressing based on a baseline performance,
test cases. In the last couple of years, mutation testing has
instead of relying on subjective observation by authors and on
been appearing in the literature, similarly to the search-based
datasets with variable characteristics.
techniques.
With regard to testing targets, Fig. 13(a)–(b) shows that the
VI. DISCUSSION
graphical user interfaces, as well as the event mechanism, are
Research on Android app testing has been prolific in the past continuously at the core of research approaches. Since Android
years. Our discussion will focus on the trends that we observed Activities (i.e., the UIs) are the main entry points for executing
while performing this SLR, as well as on the challenges that the test cases, the community will likely continue to develop
community should still attempt to address. black-box and grey-box test strategies that increase interactions
with GUI to improve code coverage. Intercomponent and
A. Trend Analysis interapplication communications, on the other hand, have been
popular targets around 2014.
The development of the different branches in the taxonomy
With regard to testing objectives, Fig. 13(c)–(h) shows that
is disparate.
security concerns have attracted a significant amount of re-
Fig. 10 illustrates the trend in testing types over the years.
search, although the output has been decreasing in the last cou-
Together, black-box and grey-box testing are involved in 90%
ple of years. Bug/defect identification, however, has somewhat
of the research works. Their evolution is thus reflected by the
stabilized.
overall evolution of research publications (cf. Fig. 4). White-box
testing remains low in all years.
B. Evaluation of Authors
Fig. 11 presents the evolution over time of works address-
ing different test levels. Unit/regression and integration testing Android testing is a new field of research, which has at-
phases include a low, but stable, number of works every year. tracted several contributions over the years due to the multiple
Overall, system testing has been heavily used in the literature opportunities that it offers for researchers to apply theoretical
and has even doubled between 2012 and 2014. System testing advances in the domain of software testing. We emphasize the
of Android apps is favored since app execution is done on a attractiveness of the field by showing in Fig. 14 the evolution
specific virtual machine environment with numerous runtime of single authors contributing to research approaches. We count
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
chips, while most PCs running Android emulators are as- the framework (and its variabilities) and account for it during
sembled with Intel chips, running ARM-based emulators test execution.
on Intel-based PCs are extremely slow; this gap has caused 3) Code Prioritization Versus Test Prioritization: Finally,
problems for emulator-based testing approaches [95]. we note that Android apps are becoming larger and larger in
4) Evaluating Testing Approaches Fairly: Frequently, re- terms of size, including obsolete code for functionalities that
searchers complain about the fact that our community has are no longer needed, or to account for the diversity of de-
not provided a reliable coverage estimator to approximate vices (and their OS versions). For example, in large companies,
the coverage (e.g., code coverage) of testing approaches because of developer rotation, “dead” code/functionality may
and to fairly compare them [12], [29], [41], [43]. Although remain hidden in plain sight of app code without development
some outstanding progress has been made for developing teams risking to remove them. As a result, the effort thrown
estimation tools [23], our SLR still indicates that there in maintaining those apps increases continuously, where con-
does not exist any universally acquired tool that supports sequently the testing efforts required to verify the functional
fair comparison among testing approaches. We, therefore, correctness of those apps also boost. Therefore, to alleviate this
urge our fellow researchers to appropriately resolve this problem, we argue that testing such apps clearly necessitates
open issue and subsequently contribute to our commu- optimizing the selection of code that must be tested in priority.
nity a reliable artefact benefiting many aspects of future Test cases prioritization must then be performed in conjunction
research studies. with a code optimization process to focus on actively used code
5) Addressing Usability Defect: The majority of the research w.r.t. user interactions to the app.
studies focuses on functional defects of Android apps.
The usability defect does not attract as much attention as
the users are concerned [53]. Usability defect, like poor
responsiveness [74], is a major drawback of Android apps VII. THREATS TO VALIDITY
and receives massive complaints from users. Bad view We have identified the following threats to validity in our
organization on the screen arising from incompatibility study.
and repetitive imprecise recognition of user gestures also On potential misses of literature—We have not considered
imply bad user experience. for our review books and Master or Ph.D. dissertations related
to the Android testing. This threat is mitigated by the fact that
E. New Research Directions the content of such publications is eventually presented in peer-
reviewed venues that we have considered. We have also consid-
In light of the SLR summary of the state-of-the-art and con- ered only publications written in English. Nevertheless, while
sidering the new challenges reported in the literature, there are searching with the compiled English keywords, we have also
opportunities for exploring new testing applications to improve found a few papers written in other languages, such as German
the quality of Android apps or/and increase confidence in using and Chinese. The number of such non-English papers remain,
them safely. We now enumerate three example directions, which however, significantly small, compared with the collected En-
are as follows. glish literature, suggesting that our SLR is likely complete. Last
1) Validation of App Updates: Android app developers reg- but not least, although we have refined our searching keywords
ularly update their apps for various reasons, including keeping several times, it is still possible that some synonyms are missed
them attractive to the user base.17 Unfortunately, recent stud- in this paper. To mitigate this, we believe that natural language
ies in [134] have shown that updates of Android apps often come processing could be leveraged to disclose such synonyms. We,
with more security vulnerabilities and functional defects. In therefore, consider it as our future work toward engineering
this context, the community could investigate and adapt regres- sound keywords for supporting SLR.
sion techniques for identifying defect-prone or unsafe updates. On data extraction errors—Given that papers are often im-
To accelerate the identification of such issues in updates, one precise with information related the aspects that we have inves-
can consider exploring approaches with behavioral equivalence, tigated, the extracted data may not have been equally reliable
e.g., using “record and replay” test-case generation techniques. for all approaches, and data aggregation can still include sev-
2) Accounting for the Ecosystem Fragmentation: As previ- eral errors as warned by Turner et al. [135] for such studies.
ously highlighted, the fragmentation of the Android ecosys- We have nevertheless strived to mitigate this issue by applying
tem (with a high variety in operating system versions where a a crosschecking mechanism on the extracted results, following
given app will be running, as well as a diversity of hardware the suggestion of Brereton et al. [20]. To further alleviate this,
specifications) is a serious challenge for performing tests that we plan to validate our extracted results through their original
can expose all issues that a user might encounter on his spe- authors.
cific device runtime environment. There is still room to investi- On the representativeness of data sources and metrics—-
gate test optimization and prioritization for Android to cover a We have implemented the “major venues search” based on the
majority of devices and operating system versions. For example, venue ranking provided by the CCF. This ranking is not only
on top of modeling apps, researchers could consider modeling potentially biased toward a specific community of researchers
but may also change from one year to another. A replication of
this study based on other rankings may lead to different primary
17 https://savvyapps.com/blog/how-often-should-you-update-your-app publications set, although the overall findings will likely remain
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
the same since most major venues continue to be so across years Based on their investigation, they divided the cloud services of
and across ranking systems. mobile testing into three subcategories, which are as follows:
The aspects and metrics investigated in this approach may 1) device clouds (mobile cloud platforms);
also not be exhaustive or representative of everything that char- 2) services to support application lifecycle management;
acterizes testing. Nevertheless, these metrics have been collected 3) tools to provide processing according to some testing tech-
from testing literature to build the taxonomy and are essential niques.
for comparing approaches. They also argue that it is essential to migrate the testing pro-
cess to the clouds, which would make teamwork become possi-
ble. Besides, it can also reduce the testing time and development
costs.
VIII. RELATED WORK Muccini et al. [143] conducted a short study on the challenges
Mobile operating systems, in particular, the open-source An- and future research directions for testing mobile apps. Based on
droid platform, have been fertile ground for research in software their study, they find that 1) mobile apps are so different from
engineering and security. Several surveys and reviews have been traditional ones and thus they require different and specialized
performed on approaches for securing [136], [137], or statically techniques in order to test them, and 2) there seems to be many
analyzing Android apps [22]. An SLR is indeed important to challenges. As an example, the performance, security, reliability,
analyze the contributions of a community to resolve the chal- and energy are strongly affected by the variability of the testing
lenges of a specific topic. In the case of Android testing, such a environment.
review is missing. Janicki et al. [144] surveyed the obstacles and opportunities
Several works in the literature have, however, attempted to in deploying model-based GUI testing of mobile apps. Unlike
provide an overview of the field via surveys or general sys- conventional automatic test execution, model-based testing goes
tematic mappings on mobile application testing techniques. For one step further by considering the automation of test genera-
example, the systematic mapping of Sein et al. [138] addresses tion phases as well. Based on their studies, they claim that the
all together Android, iOS, Symbian, Silverlight, and Windows. most valuable kind of research need (as future work) is to per-
The authors have provided a higher-level categorization of tech- form a comparative experiment on using conventional test and
niques into five groups, which are as follows: model-based automation, as well as exploratory and script-based
1) usability testing; manual testing to evaluate concurrently on the same system and
2) test automation; thus to measure the success of those approaches.
3) context-awareness; Finally, the literature includes several surveys [136], [145]–
4) security; [147] on Android, which cover some aspects of Android testing.
5) general category. As an example, Tam et al. [136] have studied the evolution
Méndez-Porras et al. [139] have provided another mapping, of Android malware and Android analysis techniques, where
focusing on a more narrowed field, namely automated testing of various Android-based testing approaches such as A3 E have
mobile apps. They discuss two major challenges for automating been discussed.
the testing process of mobile apps, which are an appropriate set
of test cases and an appropriate set of devices to perform the IX. CONCLUSION
testing. Our work, with this SLR, goes in-depth to cover dif-
We report in this paper on an SLR performed on the topic
ferent technical aspects of the literature on specifically Android
of Android app testing. Our review has explored 103 papers
app testing (as well as test objectives, targets, and publication
that were published in major conferences, workshops, and
venues).
journals in software engineering, programming language, and
Other related works have discussed directly the challenges
security domain. We have then proposed a taxonomy of the
of testing Android apps in general. For example, Amalfitano
related research exploring several dimensions including the
et al. [140] analyzed specifically the challenges and open issues
objectives (i.e., what functional or nonfunctional concerns
of testing Android apps, where they have summarized suitable
are addressed by the approaches) that were pursued and the
and effective principles, guidelines, models, techniques, and
techniques (i.e., what type of testing methods—mutation,
technologies related to testing Android apps. They enumerate
concolic, etc.) that were leveraged. We have further explored
existing tools and frameworks for automated testing of Android
the assessments presented in the literature, highlighting the
apps. They typically summarize the issues of software testing
lack of established benchmarks to clearly monitor the progress
regarding nonfunctional requirements, including performance,
made in the field. Finally, beyond quantitative summaries, we
stress, security, compatibility, usability, accessibility, etc.
have provided a discussion on future challenges and proposed
Gao et al. [141] presented a study on mobile testing-as-a-
new research directions of Android testing research for further
service (MTaaS), where they discussed the basic concepts of
ensuring the quality of apps with regards to compatibility
performing MTaaS. Besides, the motivations, distinct features,
issues, vulnerability-inducing updates, etc.
requirements, test environments, and existing approaches are
also discussed. Moreover, they have also discussed the current
APPENDIX
issues, needs, and challenges of applying MTaaS in practice.
More recently, Starov et al. [142] performed a state-of-the-art The full list of examined primary publications are enumerated
survey to look into a set of cloud services for mobile testing. in Table A1.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE A1
FULL LIST OF EXAMINED PUBLICATIONS
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE A1
(CONTINUED)
REFERENCES [5] H. Wang, H. Li, L. Li, Y. Guo, and G. Xu, “Why are Android apps
removed from Google play? A large-scale empirical study,” in Proc.
[1] L. Li, T. F. Bissyandé, J. Klein, and Y. Le Traon, “An investigation into 15th Int. Conf. Mining Softw. Repositories, 2018, pp. 231–242.
the use of common libraries in Android apps,” in Proc. 23rd IEEE Int. [6] L. Li, J. Gao, T. F. Bissyandé, L. Ma, X. Xia, and J. Klein, “Character-
Conf. Softw. Anal., Evolution, Reeng., 2016, pp. 403–414. ising deprecated Android APIs,” in Proc. 15th Int. Conf. Mining Softw.
[2] L. Li et al., “Androzoo++: Collecting millions of Android apps and their Repositories, 2018, pp. 254–264.
metadata for the research community,” 2017, arXiv:1709.05281. [7] L. Li, T. F. Bissyandé, Y. Le Traon, and J. Klein, “Accessing inaccessible
[3] P. S. Kochhar, F. Thung, N. Nagappan, T. Zimmermann, and D. Lo, Android APIs: An empirical study,” in Proc. 32nd Int. Conf. Softw.
“Understanding the test automation culture of app developers,” in Proc. Maintenance Evolution, 2016, 411–422.
8th Int. Conf. Softw. Testing, Verification Validation, 2015, pp. 1–10. [8] N. Mirzaei, J. Garcia, H. Bagheri, A. Sadeghi, and S. Malek, “Reducing
[4] L. Li, “Mining androzoo: A retrospect,” in Proc. Doctoral Symp. 33rd combinatorics in GUI testing of Android applications,” in Proc. Int. Conf.
Int. Conf. Softw. Maintenance Evolution, 2017, pp. 675–680. Softw. Eng., 2016, pp. 559–570.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[9] Y. Hu, I. Neamtiu, and A. Alavi, “Automatically verifying and reproduc- [34] Q. Sun, L. Xu, L. Chen, and W. Zhang, “Replaying harmful data races
ing event-based races in Android apps,” in Proc. Int. Symp. Softw. Testing in Android apps,” in Proc. IEEE Int. Symp. Softw. Rel. Eng. Workshop,
Anal., 2016, pp. 377–388. 2016, pp. 160–166.
[10] L. Clapp, O. Bastani, S. Anand, and A. Aiken, “Minimizing GUI event [35] X. Wu, Y. Jiang, C. Xu, C. Cao, X. Ma, and J. Lu, “Testing Android apps
traces,” in Proc. ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2016, via guided gesture event generation,” in Proc. Asia-Pacific Softw. Eng.
pp. 422–434. Conf., 2016, pp. 201–208.
[11] K. Mao, M. Harman, and Y. Jia, “Sapienz: Multi-objective automated [36] H. Zhang, H. Wu, and A. Rountev, “Automated test generation for detec-
testing for Android applications,” in Proc. Int. Symp. Softw. Testing Anal., tion of leaks in Android applications,” in Proc. IEEE 11th Int. Workshop
2016, pp. 94–105. Automat. Softw. Test, 2016, pp. 64–70.
[12] X. Zeng et al., “Automated test input generation for Android: Are we [37] R. Jabbarvand, A. Sadeghi, H. Bagheri, and S. Malek, “Energy-aware
really there yet in an industrial case?” in Proc. ACM SIGSOFT Int. Symp. test-suite minimization for Android apps,” in Proc. Int. Symp. Softw.
Found. Softw. Eng., 2016, pp. 987–992. Testing Anal., 2016, pp. 425–436.
[13] F. Dong et al., “Frauddroid: Automated ad fraud detection for Android [38] J. Qian and D. Zhou, “Prioritizing test cases for memory leaks in Android
apps,” in Proc. 26th ACM Joint Eur. Softw. Eng. Conf. Symp. Found. applications,” J. Comput. Sci. Technol., vol. 31, pp. 869–882, 2016.
Softw. Eng., 2018. [39] M. Ermuth and M. Pradel, “Monkey see, monkey do: Effective generation
[14] L. Li et al., “IccTA: Detecting inter-component privacy leaks in Android of GUI tests with inferred macro events,” in Proc. 25th Int. Symp. Softw.
apps,” in Proc. IEEE 37th Int. Conf. Softw. Eng., 2015, pp. 280–291. Testing and Anal., 2016, pp. 82–93.
[15] L. Gomez, I. Neamtiu, T. Azim, and T. Millstein, “RERAN: Timing-and [40] T. Zhang, J. Gao, O.-E.-K. Aktouf, and T. Uehara, “Test model and
touch-sensitive record and replay for Android,” in Proc. Int. Conf. Softw. coverage analysis for location-based mobile services,” in Proc. Int. Conf.
Eng., 2013, pp. 72–81. Softw. Eng. Knowl. Eng., 2015, pp. 80–86.
[16] L. Li, T. F. Bissyandé, H. Wang, and J. Klein, “CiD: Automating the [41] T. Zhang, J. Gao, J. Cheng, and T. Uehara, “Compatibility testing service
detection of API-related compatibility issues in Android apps,” in Proc. for mobile applications,” in Proc. IEEE Symp. Service-Oriented Syst.
ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2018, pp. 153–163. Eng., 2015, pp. 179–186.
[17] L. Wei, Y. Liu, and S. Cheung, “Taming Android fragmentation: Char- [42] M. Wan, Y. Jin, D. Li, and W. G. J. Halfond, “Detecting display energy
acterizing and detecting compatibility issues for Android apps,” in Proc. hotspots in Android apps,” in Proc. IEEE 8th Int. Conf. Softw. Testing,
31st IEEE/ACM Int. Conf. Automated Softw. Eng., 2016, pp. 226–237. Verification Validation, 2015, pp. 1–10.
[18] N. Mirzaei, S. Malek, C. S. Psreanu, N. Esfahani, and R. Mahmood, [43] Š. Packevičius, A. Ušaniov, Š. Stanskis, and E. Bareiša, “The testing
“Testing Android apps through symbolic execution,” in Proc. ACM SIG- method based on image analysis for automated detection of UI defects
SOFT Softw. Eng. Notes, 2012, pp. 1–5 . intended for mobile applications,” in Proc. Int. Conf. Inf. Softw. Technol.,
[19] B. Kitchenham and S. Charters, “Guidelines for performing system- 2015, pp. 560–576.
atic literature reviews in software engineering,” Univ. Durham, Durham, [44] K. Knorr and D. Aspinall, “Security testing for Android mhealth
U.K., EBSE Tech. Rep., EBSE-2007-01, 2007. apps,” in Proc. Softw. Testing, Verification Validation Workshops, 2015,
[20] P. Brereton, B. A. Kitchenham, D. Budgen, M. Turner, and M. Khalil, pp. 1–8.
“Lessons from applying the systematic literature review process within [45] R. Hay, O. Tripp, and M. Pistoia, “Dynamic detection of inter-application
the software engineering domain,” J. Syst. Softw., vol. 80, no. 4, pp. 571– communication vulnerabilities in Android,” in Proc. Int. Symp. Softw.
583, 2007. Testing Anal., 2015, pp. 118–128.
[21] P. H. Nguyen, M. Kramer, J. Klein, and Y. Le Traon, “An extensive [46] G. d. C. Farto and A. T. Endo, “Evaluating the model-based testing
systematic review on the model-driven development of secure systems,” approach in the context of mobile applications,” Electron. Notes Theor.
Inf. Softw. Technol., vol. 68, pp. 62–81, 2015. Comput. Sci., vol. 314, pp. 3–21, 2015.
[22] L. Li et al., “Static analysis of Android apps: A systematic literature [47] P. Bielik, V. Raychev, and M. T. Vechev, “Scalable race detection for An-
review,” Inf. Softw. Technol., vol. 88, pp. 67–95, 2017. droid applications,” in Proc. ACM SIGPLAN Int. Conf. Object-Oriented
[23] Y. Zhauniarovich, A. Philippov, O. Gadyatskaya, B. Crispo, and F. Mas- Program., Syst., Lang. Appl., 2015, pp. 332–348.
sacci, “Towards black box testing of Android apps,” in Proc. 10th Int. [48] D. Amalfitano, A. R. Fasolino, P. Tramontana, B. D. Ta, and A. M.
Conf. Availability, Rel. Secur., 2015, pp. 501–510. Memon, “MobiGUITAR: Automated model-based testing of mobile
[24] C.-C. Yeh and S.-K. Huang, “CovDroid: A black-box testing coverage apps,” IEEE Softw., vol. 32, no. 5, pp. 53–59, Sep./Oct. 2015.
system for Android,” in Proc. IEEE 39th Annu. Comput. Softw. Appl. [49] O.-E.-K. Aktouf, T. Zhang, J. Gao, and T. Uehara, “Testing location-
Conf., 2015, vol. 3, pp. 447–452. based function services for mobile applications,” in Proc. IEEE Symp.
[25] C. Yang, G. Yang, A. Gehani, V. Yegneswaran, D. Tariq, and G. Gu, Service-Oriented Syst. Eng., 2015, pp. 308–314.
“Using provenance patterns to vet sensitive behaviors in Android apps,” [50] M. Xia, L. Gong, Y. Lyu, Z. Qi, and X. Liu, “Effective real-time An-
in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2016, pp. 58–77. droid application auditing,” in Proc. IEEE Symp. Secur. Privacy, 2015,
[26] L. Malisa, K. Kostiainen, M. Och, and S. Capkun, “Mobile applica- pp. 899–914.
tion impersonation detection using dynamic user interface extraction,” [51] B. Hassanshahi, Y. Jia, R. H. Yap, P. Saxena, and Z. Liang, “Web-to-
in Proc. Eur. Symp. Res. Comput. Secur., 2016, pp. 217–237. application injection attacks on Android: Characterization and detec-
[27] K. Moran, M. Linares-Vásquez, C. Bernal-Cárdenas, C. Vendome, and tion,” in Proc. 20th Eur. Symp. Res. Comput. Secur., 2015, pp. 577–598.
D. Poshyvanyk, “Automatically discovering, reporting and reproducing [52] I. C. Morgado and A. C. Paiva, “Testing approach for mobile applications
Android application crashes,” in Proc. IEEE Int. Conf. Softw. Testing, through reverse engineering of UI patterns,” in Proc. Int. Conf. Automated
Verification Validation, 2016, pp. 33–44. Softw. Eng. Workshop, 2015, pp. 42–49.
[28] J. C. J. Keng, L. Jiang, T. K. Wee, and R. K. Balan, “Graph-aided directed [53] L. Deng, N. Mirzaei, P. Ammann, and J. Offutt, “Towards mutation
testing of Android applications for checking runtime privacy behaviours,” analysis of Android apps,” in Proc. Int. Conf. Softw. Testing, Verification
in Proc. IEEE 11th Int. Workshop Automat. Softw. Test, 2016, pp. 57–63. Validation Workshops, 2015, pp. 1–10.
[29] Y. Kang, Y. Zhou, M. Gao, Y. Sun, and M. R. Lyu, “Experience report: [54] A. R. Espada, M. del Mar Gallardo, A. Salmerón, and P. Merino,
Detecting poor-responsive UI in Android applications,” in Proc. IEEE “Runtime verification of expected energy consumption in smartphones,”
27th Int. Symp. Softw. Rel. Eng., 2016, pp. 490–501. Model Checking Softw., vol. 9232, pp. 132–149, 2015.
[30] Y. Hu and I. Neamtiu, “Fuzzy and cross-app replay for smartphone apps,” [55] P. Zhang and S. G. Elbaum, “Amplifying tests to validate exception
in Proc. IEEE 11th Int. Workshop Automat. Softw. Test, 2016, pp. 50–56. handling code: An extended study in the mobile application domain,” in
[31] H. Tang, G. Wu, J. Wei, and H. Zhong, “Generating test cases to expose Proc. Int. Conf. Softw. Eng., 2014, Art. no. 32.
concurrency bugs in Android applications,” in Proc. IEEE 31st Int. Conf. [56] R. N. Zaeem, M. R. Prasad, and S. Khurshid, “Automated genera-
Automated Softw. Eng., 2016, pp. 648–653. tion of oracles for testing user-interaction features of mobile apps,” in
[32] Y. Kang, Y. Zhou, H. Xu, and M. R. Lyu, “DiagDroid: Android perfor- Proc. IEEE 7th Int. Conf. Softw. Testing, Verification, Validation, 2014,
mance diagnosis via anatomizing asynchronous executions,” in Proc. Int. pp. 183–192.
Conf. Found. Softw. Eng., 2016, pp. 410–421. [57] C.-C. Yeh, H.-L. Lu, C.-Y. Chen, K.-K. Khor, and S.-K. Huang, “CRAX-
[33] M. Gómez, R. Rouvoy, B. Adams, and L. Seinturier, “Reproducing Droid: Automatic Android system testing by selective symbolic execu-
context-sensitive crashes of mobile apps using crowdsourced monitor- tion,” in Proc. IEEE 8th Int. Conf. Softw. Secur. Rel.-Companion, 2014,
ing,” in Proc. Int. Conf. Mobile Softw. Eng. Syst., 2016, pp. 88–99. pp. 140–148.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[58] K. Yang, J. Zhuge, Y. Wang, L. Zhou, and H. Duan, “Intentfuzzer: De- [81] K. B. Dhanapal et al., “An innovative system for remote and automated
tecting capability leaks of Android applications,” in Proc. ACM Symp. testing of mobile phone applications,” in Proc. Service Res. Innov. Inst.
Inf., Comput. Commun. Secur., 2014, pp. 531–536. Global Conf., 2012, pp. 44–54.
[59] S. Vilkomir and B. Amstutz, “Using combinatorial approaches for test- [82] C. Zheng et al., “SmartDroid: An automatic system for revealing
ing mobile applications,” in Proc. Int. Conf. Softw. Testing, Verification, UI-based trigger conditions in Android applications.” in Proc. 2nd
Validation Workshops, 2014, pp. 78–83. ACM Workshop Secur. Privacy Smartphones Mobile Devices, 2012,
[60] H. Shahriar, S. North, and E. Mawangi, “Testing of memory leak in pp. 93–104.
Android applications,” in Proc. IEEE 15th Int. Symp. High-Assurance [83] A. K. Maji, F. A. Arshad, S. Bagchi, and J. S. Rellermeyer, “An empirical
Syst. Eng., 2014, pp. 176–183. study of the robustness of inter-component communication in Android,”
[61] S. Salva and S. R. Zafimiharisoa, “APSET, an Android application secu- in Proc. Int. Conf. Dependable Syst. Netw., 2012, pp. 1–12.
rity testing tool for detecting intent-based vulnerabilities,” Int. J. Softw. [84] C. Hu and I. Neamtiu, “Automating GUI testing for Android applica-
Tools Technol. Transfer, vol. 17, pp. 201–221, 2014. tions,” in Proc. Int. Workshop Automat. Softw. Test, 2011, pp. 77–83.
[62] P. Maiya, A. Kanade, and R. Majumdar, “Race detection for Android [85] L. Wei, Y. Liu, and S.-C. Cheung, “Taming Android fragmentation:
applications,” in Proc. ACM SIGPLAN Conf. Programm. Lang. Des. Characterizing and detecting compatibility issues for Android apps,”
Implementation, 2014, pp. 316–325. in Proc. 31st IEEE/ACM Int. Conf. Automated Softw. Eng., 2016,
[63] J. Huang, “AppACTS: Mobile app automated compatibility testing ser- pp. 226–237.
vice,” in Proc. IEEE 2nd Int. Conf. Mobile Cloud Comput., Services, [86] H. Khalid, M. Nagappan, and A. Hassan, “Examining the relationship
Eng., 2014, pp. 85–90. between findbugs warnings and end user ratings: A case study on 10,000
[64] C. Hsiao et al., “Race detection for event-driven mobile applications,” Android apps,” IEEE Softw., vol. 33, no. 4, pp. 34–39, Jul.-Aug. 2016.
in Proc. ACM SIGPLAN Conf. Programm. Lang. Des. Implementation, [87] W. Yang, M. R. Prasad, and T. Xie, “A grey-box approach for auto-
2014, pp. 326–336. mated GUI-model generation of mobile applications,” in Proc. Int. Conf.
[65] C. Guo, J. Xu, H. Yang, Y. Zeng, and S. Xing, “An automated testing Fundamental Approaches Softw. Eng., 2013, pp. 250–265.
approach for inter-application security in Android,” in Proc. 9th Int. [88] A. M. Memon, I. Banerjee, and A. Nagarajan, “GUI ripping: Reverse en-
Workshop Automat. Softw. Test, 2014, pp. 8–14. gineering of graphical user interfaces for testing,” in Proc. 10th Workshop
[66] T. Griebe and V. Gruhn, “A model-based approach to test automation Conf. Reverse Eng., 2003, vol. 3, p. 260.
for context-aware mobile applications,” in Proc. 29th Annu. ACM Symp. [89] A. Machiry, R. Tahiliani, and M. Naik, “Dynodroid: An input generation
Appl. Comput., 2014, pp. 420–427. system for Android apps,” in Proc. Joint Meeting Eur. Softw. Eng. Conf.
[67] P. Costa, M. Nabuco, and A. C. R. Paiva, “Pattern based GUI testing for ACM SIGSOFT Symp. Found. Softw. Eng., 2013, pp. 224–234.
mobile applications,” in Proc. Int. Conf. Quality Inf. Commun. Technol., [90] D. Octeau et al., “Combining static analysis with probabilistic models
2014, pp. 66–74. to enable market-scale Android inter-component analysis,” in Proc. 43th
[68] A. Banerjee, L. K. Chong, S. Chattopadhyay, and A. Roychoudhury, Symp. Principles Programm. Lang., 2016, pp. 469–484.
“Detecting energy bugs and hotspots in mobile apps,” in Proc. 22nd [91] L. Li, A. Bartel, T. F. Bissyandé, J. Klein, and Y. Le Traon, “ApkCom-
ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2014, pp. 588–598. biner: combining multiple Android apps to support inter-app analysis,”
[69] T. Vidas, J. Tan, J. Nahata, C. L. Tan, N. Christin, and P. Tague, “A5: Au- in Proc. 30th IFIP Int. Conf. ICT Syst. Secur. Privacy Protection, 2015,
tomated analysis of adversarial Android applications,” in Proc. 4th ACM pp. 513–527.
Workshop Secur. Privacy Smartphones Mobile Devices, 2014 pp. 39–50. [92] L. Li, A. Bartel, J. Klein, and Y. Le Traon, “Automatically exploiting
[70] G. Suarez-Tangil, M. Conti, J. E. Tapiador, and P. Peris-Lopez, potential component leaks in Android applications,” in Proc. 13th Int.
“Detecting targeted smartphone malware with behavior-triggering Conf. Trust, Secur. Privacy Comput. Commun., 2014, p. 10.
stochastic models,” in Proc. Eur. Symp. Res. Comput. Secur., 2014, [93] K. Jamrozik, P. von Styp-Rekowsky, and A. Zeller, “Mining sandboxes,”
pp. 183–201. in Proc. IEEE/ACM 38th Int. Conf. Softw. Eng., 2016, pp. 37–48.
[71] M. Linares-Vásquez, G. Bavota, C. Bernal-Cárdenas, R. Oliveto, M. Di [94] Y.-M. Baek and D.-H. Bae, “Automated model-based Android GUI test-
Penta, and D. Poshyvanyk, “Mining energy-greedy API usage patterns ing using multi-level GUI comparison criteria,” in Proc. Int. Conf. Auto-
in Android apps: An empirical study,” in Proc. 11th Workshop Conf. mated Softw. Eng., 2016, pp. 238–249.
Mining Softw. Repositories, 2014, pp. 1–11. [95] Z. Qin, Y. Tang, E. Novak, and Q. Li, “MobiPlay: A remote execu-
[72] R. Sasnauskas and J. Regehr, “Intent fuzzer: Crafting intents of death,” tion based record-and-replay tool for mobile applications,” in Proc.
in Proc. Joint Int. Workshop Dyn. Anal. Softw. Syst. Performance Testing, IEEE/ACM 38th Int. Conf. Softw. Eng., 2016, pp. 571–582.
Debugging, Analytics, 2014, pp. 1–5. [96] Y. L. Arnatovich, M. N. Ngo, T. H. B. Kuan, and C. Soh, “Achieving high
[73] S. Zhao, X. Li, G. Xu, L. Zhang, and Z. Feng, “Attack tree based Android code coverage in Android UI testing via automated widget exercising,”
malware detection with hybrid analysis,” in Proc. Int. Conf. Trust, Secur. in Proc. 23rd Asia-Pacific Softw. Eng. Conf., 2016, pp. 193–200.
Privacy Comput. Commun., 2014, pp. 380–387. [97] H. Zhu, X. Ye, X. Zhang, and K. Shen, “A context-aware approach for
[74] S. Yang, D. Yan, and A. Rountev, “Testing for poor responsiveness dynamic GUI testing of Android applications,” in Proc. 39th IEEE Annu.
in Android applications,” in Proc. Int. Workshop Eng. Mobile-Enabled Comput. Softw. Appl. Conf., 2015, pp. 248–253.
Syst., 2013, pp. 1–6. [98] K. Song, A.-R. Han, S. Jeong, and S. D. Cha, “Generating various con-
[75] S. T. A. Rumee and D. Liu, “DroidTest: Testing Android applications for texts from permissions for testing Android applications,” in Proc. Int.
leakage of private information,” Int. J. Inf. Secur., vol. 7807, pp. 341–353, Conf. Softw. Eng. Knowl. Eng., 2015, pp. 87–92.
2013. [99] N. Mirzaei, H. Bagheri, R. Mahmood, and S. Malek, “Sig-Droid: Auto-
[76] V. Nandakumar, V. Ekambaram, and V. Sharma, “Appstrument—A uni- mated system input generation for Android applications,” in Proc. IEEE
fied app instrumentation and automated playback framework for testing 26th Int. Symp. Softw. Rel. Eng., 2015, pp. 461–471.
mobile applications,” in Proc. Int. Conf. Mobile Ubiquitous Syst.: Netw. [100] B. Jiang, P. Chen, W. K. Chan, and X. Zhang, “To what extent is stress
Services, 2013, pp. 474–486. testing of Android tv applications automated in industrial environments?”
[77] A. Avancini and M. Ceccato, “Security testing of the communication IEEE Trans. Rel., vol. 65, no. 3, pp. 1223–1239, Sep. 2016.
among Android applications,” in Proc. Int. Workshop Automat. Softw. [101] T. Griebe, M. Hesenius, and V. Gruhn, “Towards automated UI-tests
Test, 2013, pp. 57–63. for sensor-based mobile applications,” in Proc. Int. Conf. Intell. Softw.
[78] D. Yan, S. Yang, and A. Rountev, “Systematic testing for resource leaks Methodologies, Tools Techn., 2015, pp. 3–17.
in Android applications,” in Proc. Int. Symp. Softw. Rel. Eng., 2013, [102] D. Amalfitano, N. Amatucci, A. R. Fasolino, and P. Tramontana, “AGRip-
pp. 411–420. pin: A novel search based testing technique for Android applications,”
[79] R. Mahmood, N. Esfahani, T. Kacem, N. Mirzaei, S. Malek, and A. in Proc. Int. Workshop Softw. Development Lifecycle Mobile, 2015,
Stavrou, “A whitebox approach for automated security testing of Android pp. 5–12.
applications on the cloud,” in Proc. Int. Workshop Automat. Softw. Test, [103] C. Q. Adamsen, G. Mezzetti, and A. Møller, “Systematic execution of
2012, pp. 22–28. Android test suites in adverse conditions,” in Proc. Int. Symp. Softw.
[80] D. Franke, S. Kowalewski, C. Weise, and N. Prakobkosol, “Testing con- Testing Anal., 2015, pp. 83–93.
formance of life cycle dependent properties of mobile applications,” in [104] I. C. Morgado and A. C. Paiva, “The impact tool: Testing UI patterns
Proc. IEEE 12th Int. Conf. Softw. Testing, Verification Validation, 2012, on mobile applications,” in Proc. 30th IEEE Int. Conf. Automated Softw.
pp. 241–250. Eng., 2015, pp. 876–881.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[105] R. Mahmood, N. Mirzaei, and S. Malek, “EvoDroid: Segmented evo- [128] S. T. Fundamentals, Software testing levels. [Online]. Avail-
lutionary testing of Android apps,” in Proc. ACM SIGSOFT Int. Symp. able: http://softwaretestingfundamentals.com/software-testing-levels/.
Found. Softw. Eng., 2014, pp. 599–609. Accessed on: Aug. 2018.
[106] Y.-D. Lin, J. F. Rojas, E. T.-H. Chu, and Y.-C. Lai, “On the ac- [129] A. Wasif, T. Richard, and F. Robert, “A systematic review of search-
curacy, efficiency, and reusability of automated test oracles for An- based testing for non-functional system properties,” Inf. Softw. Technol.,
droid devices,” IEEE Trans. Softw. Eng., vol. 40, no. 10, pp. 957–970, vol. 51, pp. 957–976, 2009.
Oct. 2014. [130] R. Hamlet, “Random testing,” in Encyclopedia Software Engineering.
[107] C.-J. Liang et al., “Caiipa: Automated large-scale mobil app testing Hoboken, NJ, USA: Wiley, 1994.
through contextual fuzzing,” in Proc. 20th Annu. Int. Conf. Mobile Com- [131] T. Vidas and N. Christin, “Evading Android runtime analysis via sandbox
put. Netw., 2014, pp. 519–530. detection,” in Proc. 9th ACM Symp. Inf. Comput. Commun. Secur., 2014,
[108] X. Li, Y. Jiang, Y. Liu, C. Xu, X. Ma, and J. Lu, “User guided automation pp. 447–458.
for testing mobile apps,” in Proc. 21st Asia-Pacific Softw. Eng. Conf., [132] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and
2014, pp. 27–34. C. Siemens, “Drebin: Effective and explainable detection of An-
[109] C. Holzmann and P. Hutflesz, “Multivariate testing of native mobile appli- droid malware in your pocket.” in Proc. Netw. Distrib. Syst. Secur.,
cations,” in Proc. 12th Int. Conf. Advances Mobile Comput. Multimedia, 2014, pp. 23–26.
2014, pp. 85–94. [133] M. Spreitzenbarth, F. Freiling, F. Echtler, T. Schreck, and J. Hoff-
[110] X. Chen and Z. Xu, “Towards automatic consistency checking between mann, “Mobile-sandbox: Having a deeper look into Android ap-
web application and its mobile application,” in Proc. Int. Conf. Softw. plications,” in Proc. 28th Annu. ACM Symp. Appl. Comput., 2013,
Eng. Knowl. Eng., 2014, pp. 53–58. pp. 1808–1815.
[111] D. Amalfitano et al., “Improving code coverage in Android apps [134] V. F. Taylor and I. Martinovic, “To update or not to update: Insights from
testing by exploiting patterns and automatic test case generation,” in a two-year study of Android app evolution,” in Proc. ACM Asia Conf.
Proc. Int. Workshop Long-Term Ind. Collaboration Softw. Eng., 2014, Comput. Commun. Secur., 2017, pp. 45–57.
pp. 29–34. [135] M. Turner, B. Kitchenham, D. Budgen, and O. Brereton, “Lessons learnt
[112] M. Adinata and I. Liem, “A/b test tools of native mobile application,” in undertaking a large-scale systematic literature review,” in Proc. 12th Int.
Proc. 24th Int. Conf. Data Softw. Eng., 2014, pp. 1–6. Conf. Eval Assessment Softw. Eng., vol. 8, 2008.
[113] Y.-D. Lin, E. T.-H. Chu, S.-C. Yu, and Y.-C. Lai, “Improving the accuracy [136] K. Tam, A. Feizollah, N. B. Anuar, R. Salleh, and L. Cavallaro, “The
of automated GUI testing for embedded systems,” IEEE Softw., vol. 31, evolution of Android malware and Android analysis techniques,” ACM
no. 1, pp. 39–45, Jan./Feb. 2014. Comput. Surv., vol. 49, no. 4, 2017, Art. no. 76.
[114] W. Choi, G. Necula, and K. Sen, “Guided GUI testing of Android apps [137] M. Xu et al., “Toward engineering a secure Android ecosystem: A survey
with minimal restart and approximate learning,” in Proc. ACM SIG- of existing techniques,” ACM Comput. Surv., vol. 49, no. 2, 2016, Art.
PLAN Int. Conf. Object Oriented Programm. Syst. Lang. Appl., 2013, no. 38.
pp. 623–640. [138] S. Zein, N. Salleh, and J. Grundy, “A systematic mapping study of mobile
[115] T. Azim and I. Neamtiu, “Targeted and depth-first exploration for sys- application testing techniques,” J. Syst. Softw., vol. 117, pp. 334–356,
tematic testing of Android apps,” in Proc. ACM SIGPLAN Int. Conf. 2016.
Object Oriented Programm. Syst. Lang. Appl., 2013, pp. 641–660. [139] A. Méndez-Porras, C. Quesada-López, and M. Jenkins, “Automated test-
[116] D. Amalfitano, A. R. Fasolino, P. Tramontana, and N. Amatucci, “Con- ing of mobile applications: A systematic map and review,” in Proc. 18th
sidering context events in event-based testing of mobile applications,” in Ibero-Amer. Conf. Softw. Eng., Lima, Peru, 2015, pp. 195–208.
Proc. IEEE 6th Int. Conf. Softw. Testing, Verification Validation Work- [140] D. Amalfitano, A. R. Fasolino, P. Tramontana, and B. Robbins, “Testing
shops, 2013, pp. 126–133. Android mobile applications: Challenges, strategies, and approaches,”
[117] A. Corradi, M. Fanelli, L. Foschini, and M. Cinque, “Context data dis- Advances Comput., vol. 89, no. 6, pp. 1–52, 2013.
tribution with quality guarantees for Android-based mobile systems,” [141] J. Gao, W.-T. Tsai, R. Paul, X. Bai, and T. Uehara, “Mobile testing-as-a-
Secur. Commun. Netw., vol. 6, pp. 450–460, 2013. service (MTaaS)–Infrastructures, issues, solutions and needs,” in Proc.
[118] S. Bauersfeld, “GUIdif—A regression testing tool for graphical user 15th Int. Symp. High-Assurance Syst. Eng, 2014, pp. 158–167.
interfaces,” in Proc. Int. Conf. Softw. Testing, Verification Validation, [142] O. Starov, S. Vilkomir, A. Gorbenko, and V. Kharchenko, “Testing-as-a-
2013, pp. 499–500. service for mobile applications: State-of-the-art survey,” in Dependability
[119] C. S. Jensen, M. R. Prasad, and A. Møller, “Automated testing with Problems of Complex Information Systems, W. Zamojski and J. Sugier,
targeted event sequence generation,” in Proc. Int. Symp. Softw. Testing Eds. Berlin, Germany: Springer, 2015, pp. 55–71.
Anal., 2013, pp. 67–77. [143] H. Muccini, A. D. Francesco, and P. Esposito, “Software testing of mobile
[120] H. v. d. Merwe, B. v. d. Merwe, and W. Visser, “Verifying Android applications: Challenges and future research directions,” in Proc. 7th Int.
applications using java pathfinder,” in Proc. ACM SIGSOFT Softw. Eng. Workshop Automat. Softw. Test., 2012, pp. 29–35.
Notes, 2012, pp. 1–5. [144] M. Janicki, M. Katara, and T. Pääkkönen, “Obstacles and opportu-
[121] H.-K. Kim, “Hybrid mobile testing model,” in Proc. Int. Conf., nities in deploying model-based GUI testing of mobile software: A
Adv. Softw. Eng. Appl. Disaster Recovery Business Continuity, 2012, survey,” Softw. Testing, Verification Rel., vol. 22, no. 5, pp. 313–341,
pp. 42–52. 2012.
[122] S. Anand, M. Naik, M. J. Harrold, and H. Yang, “Automated concolic [145] A. Sadeghi, H. Bagheri, J. Garcia, and S. Malek, “A taxonomy and
testing of smartphone apps,” in Proc. ACM SIGSOFT Int. Symp. Found. qualitative comparison of program analysis techniques for security as-
Softw. Eng., 2012, Art. no. 59. sessment of Android software,” IEEE Trans. Softw. Eng., vol. 43, no. 6,
[123] T. Takala, M. Katara, and J. Harty, “Experiences of system-level model- pp. 492–530, Jun. 2017.
based GUI testing of an Android application,” in Proc. 4th IEEE Int. [146] W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman, “A survey of
Conf. Softw. Testing, Verification Validation, 2011, pp. 377–386. app store analysis for software engineering,” IEEE Trans. Softw. Eng.,
[124] B. Sadeh, K. Ørbekk, M. M. Eide, N. C. Gjerde, T. A. Tønnesland, and S. vol. 43, no. 9, pp. 817–847, Sep. 2017.
Gopalakrishnan, “Towards unit testing of user interface code for Android [147] P. Yan and Z. Yan, “A survey on dynamic mobile malware detection,”
mobile applications,” in Proc. Int. Conf. Softw. Eng. Comput. Syst., 2011, Softw. Quality J., pp. 1–29, 2017.
pp. 163–175.
[125] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “A GUI crawling-
based technique for Android mobile application testing,” in Proc. IEEE
4th Int. Conf. Softw. Testing, Verification Validation Workshops, 2011,
pp. 252–261.
[126] Z. Liu, X. Gao, and X. Long, “Adaptive random testing of mobile ap-
plication,” in Proc. 2nd Int. Conf. Comput. Eng. Technol., 2010, pp. v2-
297–v2-301.
[127] M. G. Limaye, Software Testing. New York, NY, USA: McGraw-Hill,
2009. Authors’ photographs and biographies not available at the time of publication.