0% found this document useful (0 votes)
3 views22 pages

Automated Testing of Android Apps: A Systematic Literature Review

This article presents a systematic literature review on automated testing of Android apps, highlighting the importance of ensuring both functional and nonfunctional requirements are met. The review identifies 103 relevant research papers, discusses the challenges faced in Android testing, and proposes future research directions to improve testing methodologies. Key challenges include the unique event-based mechanisms of Android, fragmentation of the ecosystem, and the need for scalable testing solutions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views22 pages

Automated Testing of Android Apps: A Systematic Literature Review

This article presents a systematic literature review on automated testing of Android apps, highlighting the importance of ensuring both functional and nonfunctional requirements are met. The review identifies 103 relevant research papers, discusses the challenges faced in Android testing, and proposes future research directions to improve testing methodologies. Key challenges include the unique event-based mechanisms of Android, fragmentation of the ecosystem, and the need for scalable testing solutions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON RELIABILITY 1

Automated Testing of Android Apps:


A Systematic Literature Review
Pingfan Kong, Li Li , Jun Gao, Kui Liu , Tegawendé F. Bissyandé , and Jacques Klein

Abstract—Automated testing of Android apps is essential for app


users, app developers, and market maintainer communities alike.
Given the widespread adoption of Android and the specificities of
its development model, the literature has proposed various testing
approaches for ensuring that not only functional requirements but
also nonfunctional requirements are satisfied. In this paper, we aim
at providing a clear overview of the state-of-the-art works around
the topic of Android app testing, in an attempt to highlight the
main trends, pinpoint the main methodologies applied, and enu-
merate the challenges faced by the Android testing approaches as
well as the directions where the community effort is still needed. To
Fig. 1. Process of testing Android apps.
this end, we conduct a systematic literature review during which
we eventually identified 103 relevant research papers published in
leading conferences and journals until 2016. Our thorough exam-
ination of the relevant literature has led to several findings and Unfortunately, the distribution ecosystem of Android is
highlighted the challenges that Android testing researchers should porous to poorly tested apps [3]–[5]. Yet, as reported by
strive to address in the future. After that, we further propose a few Kochhar [3], error-prone apps can significantly impact user
concrete research directions where testing approaches are needed experience and lead to a downgrade of their ratings, eventually
to solve recurrent issues in app updates, continuous increases of harming the reputation of app developers and their organi-
app sizes, as well as the Android ecosystem fragmentation.
zations [5]. It is thus becoming more and more important to
Index Terms—Android, automated testing, literature review, ensure that Android apps are sufficiently tested before they are
survey. released on the market. However, instead of manual testing,
which is often laborious, time-consuming, and error-prone, the
I. INTRODUCTION ever-growing complexity and the enormous number of Android
apps call for scalable, robust, and trustworthy automated testing
NDROID smart devices have become pervasive after gain-
A ing tremendous popularity in recent years. As of July
2017, Google Play, the official app store, is distributing over
solutions.
Android app testing aims at testing the functionality, usability,
and compatibility of apps running on Android devices [6], [7].
three million Android applications (i.e., apps), covering over Fig. 1 illustrates a typical working process. At Step (1), target
30 categories ranging from entertainment and personalization app is installed on an Android device. Then, in Step (2), the app
apps to education and financial apps. Such popularity among is analyzed to generate test cases. We remind the readers that this
developer communities can be attributed to the accessible de- step (in dashed line) is optional and some testing techniques such
velopment environment based on familiar Java programming as automated random testing do not need to obtain preknowledge
language as well as the availability of libraries implementing di- for generating test cases. Subsequently, in Step (3), these test
verse functionalities [1]. The app distribution ecosystem around cases are sent to the Android device to exercise the app. In Step
the official store and other alternative stores such as Anzhi and (4), execution behavior is observed and collected from all sorts
AppChina is further attractive for users to find apps and organi- of perspectives. Finally, in Step (5), the app is uninstalled and
zations to market their apps [2]. relevant data is wiped. We would like to remind the readers that
installation of the target app is sometimes not a necessity, e.g.,
Manuscript received July 31, 2017; revised February 19, 2018, May 27, 2018, frameworks like Robolectric allow tests directly run in JVM.
June 13, 2018, and August 3, 2018; accepted August 9, 2018. This work was In fact, Fig. 1 can be borrowed to describe the workflow of
supported by the Fonds National de la Recherche, Luxembourg, under projects
CHARACTERIZE C17/IS/11693861 and Recommend C15/IS/10449467. As- testing almost any software besides Android apps. Android app
sociate Editor: S. Ghosh. (Corresponding author: Li Li.) testing, on the contrary, falls in a unique context and often fails
P. Kong, J. Gao, K. Liu, T. F. Bissyandé, and J. Klein are with the In- to use general testing techniques [8]–[13]. There are several
terdisciplinary Centre for Security, Reliability and Trust, University of Lux-
embourg, Luxembourg, LU 1855, Luxembourg (e-mail:,[email protected]; differences with traditional (e.g., Java) application testing that
[email protected]; [email protected]; [email protected]; jacques.klein@ motivate research on Android app testing. We enumerate and
uni.lu). consider for our review a few common challenges.
L. Li is with the Faculty of Information Technology, Monash University,
Melbourne, Vic. 3800, Australia (e-mail:,[email protected]). First, although apps are developed in Java, traditional Java-
Digital Object Identifier 10.1109/TR.2018.2865733 based testing tools are not immediately usable on Android apps
0018-9529 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON RELIABILITY

since most control-flow interactions in Android are governed by


specific event-based mechanisms such as the intercomponent
communication (ICC [14]). To address this first challenge, sev-
eral new testing tools have been specifically designed for taking
Android specificities into account. For example, RERAN [15]
was proposed for testing Android apps through a timing- and
touch-sensitive record-and-replay mechanism, in an attempt to
capture, represent, and replay complicated nondiscrete gestures
such as circular bird swipe with increasing slingshot tension in
Angry Birds.
Second, Android fragmentation, in terms of the diversity of
available OS versions and target devices (e.g., screen size vari-
eties), is becoming acuter as now testing strategies have to take
into account different execution contexts [16], [17].
Third, the Android ecosystem attracts a massive number
of apps requiring scalable approaches to testing. Furthermore,
these apps do not generally come with open source code, which Fig. 2. Process of the SLR.
may constrain the testing scenarios.
Finally, it is challenging to generate a perfect coverage 2) We analyze in detail the key aspects in testing Android
of test cases, in order to find faults in Android apps. Tra- apps and provide a taxonomy for clearly summarizing and
ditional test case generation approaches based on symbolic categorizing all related research works.
execution and tools such as Symbolic Pathfinder are chal- 3) Finally, we investigate the current state-of-the-art, enu-
lenged by the fact that Android apps are available in Dalvik merate the salient limitations, and pinpoint a few direc-
bytecode that differs from Java bytecode. In other words, tions for furthering the research in Android testing.
traditional Java-based symbolic execution approaches can- The rest of the paper is organized as follows: Section II depicts
not be directly applied to tackle Android apps. Further- the methodology of this SLR, including a general overview and
more, the event-driven feature, as well as framework li- detailed reviewing processes of our approach. In Section III, we
braries, pose further obstacles for systematic generation of test present the results of our selected primary publications, along
cases [18]. with a preliminary trend and statistic analysis on those collected
Given the variety of challenges in testing Android apps, it is publications. Later, we introduce our data extraction strategy
important for this field, which has already produced a signifi- and their corresponding findings in the following two sections:
cant amount of approaches, to reflect on what has already been Sections IV and V. After that, we discuss the trends we observed
solved, and on what remains to tackle. To the best of our knowl- and challenges the community should attempt to address in
edge, there is no related literature review or survey summarizing Section VI and enumerate the threats to validity of this SLR in
the topic of Android testing. Thus, we attempt to meet this need Section VII. A comparison of this paper with literature studies
through a comprehensive study. Concretely, we undertake a sys- is given in Section VIII, and finally we conclude this SLR in
tematic literature review (SLR), carefully following the guide- Section IX.
lines proposed by Kitchenham et al. [19] and the lessons learned
from applying SLR within the software engineering domain by II. METHODOLOGY OF THIS SLR
Brereton et al. [20]. To achieve our goal, we have searched and We now introduce the methodology applied in this SLR. We
identified a set of relevant publications from four well-known remind the readers that an SLR follows a well-defined strategy
repositories including the ACM Digital Library and from major to systematically identify, examine, synthesize, evaluate, and
testing-related venues such as ISSTA and ICSE. Then, we have compare all available literature works in a specific topic, result-
performed a detailed overview on the current state of research ing in a reliable and replicable report [19], [21], [22]. Fig. 2
in testing Android apps, focusing on the types and phases of illustrates the process of our SLR. At the beginning, we de-
the testing approaches applied as well as on a trend analysis in fine relevant research questions (cf. Section II-A) to frame our
research directions. Eventually, we summarize the limitations investigations. The following steps are unfolded to search and
of the state-of-the-art apps and highlight potential new research consolidate the relevant literature, before extracting data for
directions. answering the research questions, and finalizing the report.
The main contributions of this paper are as follows. Concretely, to harvest all relevant publications, we identify
1) We build a comprehensive repository tracking the re- a set of search keywords and apply them in two separate pro-
search community effort to address the challenges in test- cesses: first, online repository search and, second, major1 venues
ing Android apps. In order to enable an easy navigation search. All results are eventually merged for further reviewing
of the state-of-the-art, thus enabling and encouraging re- (cf. Section II-B). Next, we apply some exclusion criteria on
searchers to push the current frontiers in Android app test-
ing, we make all collected and built information publicly 1 We rely on the China Computer Federation (CCF) ranking of computer
available at http://lilicoding.github.io/TA2Repo/. science venues.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KONG et al.: AUTOMATED TESTING OF ANDROID APPS: A SYSTEMATIC LITERATURE REVIEW 3

the merged list of publications, to exclude irrelevant papers TABLE I


SEARCH KEYWORDS
(e.g., papers not written in English) or less relevant papers (e.g.,
short papers), in order to focus on a small, but highly relevant,
set of primary publications (cf. Section II-C). Finally, we have
developed various metrics and reviewed the selected primary
publications against these metrics through full paper exami-
nation. After the examination, we crosschecked the extracted
results to ensure their correctness, and eventually we report on
the findings to the research community (cf. Section II-D).
rics, datasets, and procedures in the literature for measur-
ing the effectiveness of state-of-the-art approaches. An-
A. Initial Research Questions swers to this question may shed light on the gaps in the
research agenda of Android testing.
Given the common challenges enumerated in Section I, which
have motivated several research lines in Android apps, we in-
B. Search Strategy
vestigate several research questions to highlight how and which
challenges have been focused on in the literature. In particular, We now detail the search strategy that we applied to harvest
with regards to the fact that Android has programming speci- literature works related to Android app testing.
ficities (e.g., event-based mechanisms, GUI), we categorize test Identification of search keywords: Our review focuses on two
concerns targeted by the research community. With regards to key aspects: Testing and Android. Since a diversity of terms
the challenge of ensuring scalability, we study the tests lev- may be used by authors to refer, broadly or precisely, to any of
els that are addressed in research works. With regards to the these aspects, we rely on the extended set of keywords identified
challenge of generating test cases, we investigate in detail the in Table I. Our final search string is then constructed as a con-
fundamental testing techniques leveraged. Finally, with regards junction of these two categories of keywords (search string =
to the fragmentation of the Android ecosystem, we explore the cat1 & cat2), where each category is represented as a disjunc-
extent of validation schemes for research approaches. Overall, tion of its keywords (cat = kw1 | kw2 | kw3).
we note that testing Android apps is a broad activity that can tar- Online repository search: We use the search string on online
get a variety of functional and nonfunctional requirements and literature databases to find and collect relevant papers. We have
verification issues, leverage different techniques, and focus on considered four widely used repository for our work: ACM Dig-
different granularity levels and phases. Our investigation thus ital Library,2 IEEE Xplore Digital Library,3 SpringerLink,4 and
starts with the following related research questions. ScienceDirect.5 The “advanced” search functionality of the four
1) RQ1: What are the test concerns? With this research ques- selected online repositories are known to be inaccurate, which
tion, we survey the various objectives sought by Android usually result in a huge set of irrelevant publications, noising
app testing researchers. In general, we investigate the test- the final paper set [22]. Indeed, those irrelevant publications do
ing objectives at a high level to determine what require- not really match our keywords criteria. For example, they may
ments (e.g., security, performance, defects, and energy) not contain any of the keywords shown in the Test category.
the literature addresses. We look more in-depth into the Thus, we develop scripts (combined with Python and Shell) to
specificities of Android programming, to enumerate the perform offline matching verification on the papers yielded by
priorities that are tackled by the community, including those search engines, where the scripts follow exactly the same
which concerns (e.g., GUI and ICC mechanism) are fac- criteria that we have used for online repository search. For ex-
tored in the design of testing strategies. ample, regarding the keywords enumerated in the Test category,
2) RQ2: Which test levels are addressed? With the second if none of them is presented in a publication, the scripts will
research question, we investigate the levels (i.e., when mark that publication as irrelevant and subsequently exclude it
the tests are relevant in the app development process) from the candidate list.
that research works target. The community could indeed Major venues search: Since we only consider a few reposi-
benefit from knowing to what extent regression testing tories for search, the coverage can be limited given that a few
is (or is not) developed for apps that are now commonly conferences such as NDSS6 and SEKE7 do not host their pro-
known to evolve rapidly. ceedings in the aforementioned repositories. Thus, to mitigate
3) RQ3: How are the testing approaches built? In the third the threat to validity of not including all relevant papers, we fur-
research question, we process detailed information on the ther explicitly search in proceedings of all major venues in com-
design and implementation of test approaches. In par- puter science. We have chosen the comprehensive CCF-ranking
ticular, we investigate the fundamental techniques (e.g.,
concolic testing or mutation testing) leveraged, as well 2 http://dl.acm.org/
as the amount of input information (i.e., to what extent 3 http://ieeexplore.ieee.org/Xlpore/home.jsp
4 http://link.springer.com
the tester should know about the app prior to testing) that
5 http://www.sciencedirect.com
approaches require to perform. 6 The
Network and Distributed System Security Symposium
4) RQ4: To what extent are the testing approaches validated? 7 International
Conference on Software Engineering and Knowledge
Finally, the fourth research question investigates the met- Engineering
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON RELIABILITY

of venues8 and leveraged the DBLP9 repository to collect the share a major part of their contents. We filter out the least
document object identifiers of the publications in order to crawl recent publication when duplication is confirmed.
abstracts and all publication metadata. Since this search process 5) Papers that conduct comparative evaluations, including
considers major journal and conference venues, the resulting set surveys on different approaches of testing Android apps,
of literature papers should be a representative collection of the are excluded. Such papers indeed do not introduce new
state-of-the-art. technical contributions for testing Android apps.
6) Papers in which the testing approach targets the operating
C. Exclusion Criteria system, networks, or hardware, rather than mobile apps
are excluded.
After execution of our search based on the provided key- 7) Papers that assess12 existing testing methods are also fil-
words, a preliminary manual scanning showed that the results tered out. The publications that they discuss are supposed
are rather coarse-grained since it included a number of irrele- to be already included in our search results.
vant or less relevant publications that, nonetheless, matched10 8) Papers demonstrating how to set up environments and
the keywords. It is, thus, necessary to perform a fine-grained platforms to retrieve runtime data from Android apps are
inclusion/exclusion in order to focus on a consistent and reli- excluded. These papers are also important for Android
able set of primary publications and reduce the eventual effort apps testing, but they are not focusing on new testing
in further in-depth examination. For this SLR, we have applied methodology.
the following exclusion criteria. 9) Finally, some of our keywords (e.g., “detection” of issues,
1) Papers that are not written in English are filtered out since “testing” of apps) have led to the retrieval of irrelevant lit-
English is the common language spoken in the worldwide erature works that must be excluded. We have mainly iden-
scientific peer-reviewing community. tified two types of such papers: the first includes papers
2) Short papers are excluded, mainly because such papers are that perform detection of malicious apps using machine
often work-in-progress or idea papers: on the one hand, learning (and not testing); the second includes papers that
short papers are generally not mature, and, on the other describe the building of complex platforms, adopting ex-
hand, many of them will eventually appear later in a full isting mature testing methodologies.
paper format. In the latter case, mature works are likely to We refer to all collected papers that remain after the ap-
already be included in our final set. In this paper, we take plication of exclusion criteria as primary publications. These
a given publication as a short paper when it has fewer than publications are the basis for extracting review data.
four pages (included) in IEEE/ACM-like double-column
format11 or fewer than eight pages (included) in LNCS- D. Review Protocol
like single column format as short papers are likely to be
four pages in double column format and eight pages in Concretely, the review is conducted in two phases: First, we
single column format. perform an abstract review and quick full paper scan to filter out
3) Papers that are irrelevant to testing Android apps are ex- irrelevant papers based on the exclusion criteria defined above.
cluded. Our search keywords indeed included broad terms At the end of this phase, the set of primary publications is
such as mobile and smartphone as we aimed at finding all known. Subsequently, we perform a full review of each primary
papers related to Android even when the term “Android” publication and extract relevant information that is necessary
was not specifically included in the title and abstract. By for answering all of our research questions.
doing so, we have excluded papers that only deal with mo- In practice, we have split our primary publications to all the
bile apps for other platforms such as iOS and Windows. coauthors to conduct the data extraction step. We have further
4) Duplicated papers are removed. It is quite common for crosschecked all the extracted results: when some results are in
authors to publish an extended version of their conference disagreement, informal discussions are conducted until a con-
paper to a journal venue. However, these papers share sensus is reached.
most of the ideas and approach steps. To consider both of
them would result in a biased weighting of the metrics in III. PRIMARY PUBLICATIONS SELECTION
the review. To mitigate this, we identify duplicate papers Table II summarizes statistics of collected papers during the
by first comparing paper titles, abstracts, and authors and search phase. Overall, our repository search and major venue
then further manually check when a given pair of records search have yielded in total 9259 papers.
Following the exclusion criteria in Section II, the papers sat-
isfying the matching requirements immediately drop from 9259
8 http://www.ccf.org.cn/sites/ccf/paiming.jsp, we only take into account soft- to 472. We then manually go through the title and abstract of
ware engineering and security categories, as from what we have observed, the each paper to further dismiss those that match the exclusion
majority of papers related to testing Android apps.
9 http://dblp.uni-trier.de criteria. After this step, the set of papers is reduced to 255 publi-
10 The keywords were found, for example, to be mentioned in the related cations. Subsequently, we go through the full content of papers
sections of the identified papers.
11 Note that we have actually kept a short paper entitled “GuiDiff: a regression
testing tool for graphical user interface” because it is very relevant to our study 12 For example, [23] and [24] proposed tools and algorithms for measuring
and it does not have an extended version released in the following years. the code coverage of testing methods.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KONG et al.: AUTOMATED TESTING OF ANDROID APPS: A SYSTEMATIC LITERATURE REVIEW 5

TABLE II
SUMMARY OF THE SELECTION OF PRIMARY PUBLICATIONS

Fig. 5. Distribution of examined publications through published venue types


and domains. (a) Venue types. (b) Venue domains.

serve that the number of papers tackling the problem of testing


Android apps has increased gradually to reach a peak in 2014.
Afterwards, the pace of developing new testing techniques has
stabilized.
We further look into the selected primary publications through
Fig. 3. Word cloud based on the venue names of selected primary publications.
their published venue types and domains. Fig. 5(a) and (b) illus-
trates the statistic results, respectively. Over 90% of examined
papers are published in conferences and workshops (which are
usually co-located with top conferences), while only 10% of
papers are published in journals. These findings are in line with
the current situation where intense competition in Android re-
search forces researchers to make available their works as fast as
possible. We further find that over 80% of examined papers are
published in software engineering and programming language
venues, showing that testing Android apps is mainly a concern
in the software engineering community. Nevertheless, as shown
by several papers published in proceedings of security venues,
testing is also a valuable approach to address security issues in
Fig. 4. Number of publications in each year. Android apps.

IV. TAXONOMY OF ANDROID TESTING RESEARCH


in the set, leading to the exclusion of 84 more papers. Finally,
after discussion among the authors for the rest of the set, we To extract relevant information from the literature, our SLR
reach a consensus on considering 103 publications as relevant must focus on specific characteristics eventually described in
primary publications. Table A1 (in the Appendix) enumerates each publication. To facilitate this process in a field that ex-
the details of those 103 publications. plores a large variety of approaches, we propose to build a tax-
It is noteworthy that around 4% of the final primary publica- onomy of Android testing. Such a taxonomy eventually helps to
tions are exclusively found by major venues search, meaning that gain insights into the state-of-the-art by answering the research
they cannot be found based on well-known online repositories questions proposed in Section II-A.
such as IEEE and ACM. This result, along with our previous ex- By searching for answers to the aforementioned research
periences [22], suggests that repository search is necessary but questions in each publication, we are able to make a system-
not sufficient for harvesting review publications. Other steps atic assessment of the literature with a schema for classifying
(e.g., top venues search based on Google Scholar impact fac- and comparing different approaches. Fig. 6 presents a high-level
tor [22] or CCF ranking) should be taken in complement to view of the taxonomy diagram spreading in four dimensions
ensure reliable coverage of state-of-the-art papers. (i.e., Test Objectives, Test Targets, Test Levels, and Test Tech-
Fig. 3 presents a word cloud based on the venue names of niques) associated with the first three research questions.13
selected primary publications. The more papers selected from Test objectives: This dimension summarizes the targeted ob-
a venue, the bigger its name showing in the word cloud. Not jectives of our examined testing-related publications. We have
surprisingly, the recurrently targeted venues are mainly testing-
related conferences such as ISSTA, ICST, ISSRE, etc. 13 Test Objectives and Test Targets for RQ1 (test concerns), Test Levels for
Fig. 4 illustrates the trend of the number of publications in RQ2 (test levels), and Test Techniques for RQ3 (test approaches). RQ4 explores
each year we have considered. From this figure, we can ob- the validity of testing approaches that is not summarized in the taxonomy.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON RELIABILITY

Fig. 6. Taxonomy of Android app testing.

enumerated overall six recurring testing objectives such as cific app elements that normally involve only certain types of
bug/defect detection. code (e.g., functionality).
Test targets: This dimension summarizes the representative 1) Test Objectives: Android testing research has tackled var-
targets on which testing approaches focus. In particular, for ious objectives, including the assessment of apps against non-
testing Android apps, the GUI/Event and ICC/interapplication functional properties such as app efficiency in terms of energy
communication (IAC) are recurrently targeted. For simplicity, consumption, and functional requirements such as the presence
we regroup all the other targets such as normal code analysis of bugs. We discuss in this section some recurrent test objectives
into General. from the literature.
Test levels: This dimension checks the different levels (also Concurrency: Android apps expose a concurrency model that
known as phases) at which the test activities are performed. combines multithreading and asynchronous event-based dis-
Indeed, there is a common knowledge that software testing is patch, which may lead to subtle concurrency errors because
very important and has to be applied to many levels such as of unforeseen thread interleaving coupled with nondetermin-
unit testing, integration testing, etc. Android apps, as a specific istic reordering of asynchronous tasks. These error-prone fea-
type of software, also need to go through a thorough testing tures are however useful and increasingly becoming common in
progress before being released to public markets. In this dimen- the development of efficient and feature-rich apps. To mitigate
sion, we sum up the targeted testing phases/levels of examined concurrency issues, several works have been proposed, notably
approaches, to understand what has been focused so far by the for detecting races such as data races, event-based races, etc.
state-of-the-art. in Android apps. As an example, Maiya et al. [62] have built
Test techniques: Finally, the fourth dimension focuses on the DroidRacer, which identifies data races (i.e., the read and write
fundamental methodologies (e.g., fuzzy or mutation) that are operations happen in parallel) by computing the happens-before
followed to perform the tests, as well as the testing environments relation on execution traces that are generated systematically
(e.g., on emulated hardware) and testing types (e.g., black-box through running test scenarios against Android apps. Bielik
testing). et al. [47] later have proposed a novel algorithm for scaling
the inference of happens-before relations. Hu et al. [9] pre-
sented a work for verifying and reproducing event-based races,
V. LITERATURE REVIEW where they have found that both imprecise Android component
We now report on the findings of this SLR in light of the modeling and implicit happens-before relation could result in
research questions that we have raised in Section II-A. false positive for detecting potential races.
Security: As shown by Li et al. [22], the Android research
community is extensively working on providing tools and ap-
A. What Concerns do the Approaches Focus on? proaches for solving various security problems for Android
Our review investigates both the objectives that testing ap- apps. Some of these works involve app testing, e.g., to observe
proaches seek to achieve and the app elements that are targeted defective behavior [57] and malicious behavior [79] and to track
by the test cases. Test objectives focus on problems that can be data leaks [75]. For example, Yan et al. [78] have built a novel
located anywhere in the code, while test targets focus on spe- and comprehensive approach for the detection of resource leaks
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KONG et al.: AUTOMATED TESTING OF ANDROID APPS: A SYSTEMATIC LITERATURE REVIEW 7

using test criteria based on neutral cycles: sequences of GUI ing, and search-based testing to dynamically explore Android
events should have a “neutral” effect and should not increase apps to pinpoint defective behavior [57], GUI bugs [84], intent
the usage of resources. Hay et al. [45] dynamically detected in- defects [72], crashing faults [11], etc.
terapplication communication vulnerabilities in Android apps. Table III characterizes the publications selected for our SLR
Performance: Android apps are sensitive to performance is- in terms of the objectives discussed above. Through our in-depth
sues. When a program thread becomes expensive, the system examination, the most considered testing objective is bug/defect,
may stop app execution after warning on the user interface that accounting for 23.3% of the selected publications.
the “Application [is] Not Responding.” The literature includes 2) Test Targets: Test approaches in software development
several contributions on highlighting issues related to the per- generally target core functionality code. Since Android apps are
formance of Android apps such as poor responsiveness [29] and written in Java, the literature on Android app testing focused on
exception handling [55]. Yang et al. [74], for example, have Android specificities, mainly on how to address the GUI testing
proposed a systematic testing approach to uncover and quantify with a complex event mechanism as well as intercomponent and
common causes of poor responsiveness of Android apps. Con- interapplication communications.
cretely, they explicitly extend the delay for typical problematic GUI/Event: Android implements an event-driven graphical
operations, using the test amplification approach, to demonstrate user interface system, making Android apps testing challeng-
the effects of expensive actions that can be observed by users. ing, since they intensively interact with user inputs, introducing
Energy: One of the biggest differences between traditional PC uncertainty and nondeterminism. It is generally complicated
and portable devices is the fact that portable devices may run to model the UI/system events because it not only needs the
on battery power, which can get depleted during app usage. A knowledge of the set of GUI widgets and their supporting ac-
number of research works have investigated energy consumption tions (e.g., click for buttons) but also requires the knowledge
hotspots arising from software design defects, unwanted service of system events (e.g., receiving a phone call), which however
execution (e.g., advertisement), or have leveraged energy finger- are usually unknown in advance. Consequently, it is generally
prints to detect mobile malware. As an example, Wan et al. [42] difficult to assemble a valid set of input event sequences for a
presented a technique for detecting display energy hotspots to given Android app with respect to coverage, precision, and com-
guide the developers to improve the energy efficiency of their pactness test criteria [87]. The Android testing community has
apps. Since each activity performed on a battery powered device proposed many approaches to address this challenge. For exam-
drains a certain amount of energy from it, if the normal energy ple, Android-GUITAR, an extension of the GUITAR tool [88],
consumption is known for a device, the additionally used energy was proposed to model the structure and execution behavior
should be flagged as abnormal. of Android GUI through a formalism called GUI forests and
Compatibility: Android apps are often suffering from com- event-flow graphs. Denodroid [89] applies a dynamic approach
patibility issues, where a given app can run successfully on a to generate inputs by instrumenting the Android framework to
device, characterized by a range of OS versions while failing record the reaction of events.
on others [85]. This is mainly due to the fragmentation in the ICC/IAC: The ICC and IAC15 enable a loose coupling among
Android ecosystem brought by its open source nature. Every components [90], [91], thus reducing the complexity to develop
vendor, theoretically, can have its own customized system (e.g., Android apps with a generic means to reuse existing function-
for supporting specific low-level hardware) and the screen size ality (e.g., obtain the contact list). Unfortunately, ICC/IAC also
of its released devices can vary as well. To address compati- come with a number of security issues, among which the po-
bility problems, there is a need to devise scalable and efficient tential for implementing component hijacking, broadcast in-
approaches for performing compatibility testing before releas- jection, etc. [92]. Researchers have then investigated various
ing an app into markets. Indeed, as pointed out by Vilkomir testing approaches to highlight such issues in Android apps.
et al. [59], it is expensive and time-consuming to consider test- IntentDroid [45], for instance, performs comprehensive IAC
ing all device variations. The authors thus proposed to address security testing for inferring Android IAC integrity vulnera-
the issue with a combinatorial approach, which attempts to se- bilities. It utilizes lightweight platform-level instrumentation,
lect an optimal set of mobile devices for practical testing. Zhang which is implemented through debug breakpoints, to recover
et al. [41] leveraged a statistical approach to optimize the com- IAC-relevant app-level behavior. IntentFuzzer [58], on the other
patibility testing strategy where the test sequence is generated hand, leverages fuzz testing techniques to detect capability leaks
by K-means statistic algorithm. (e.g., permission escalation attacks) in Android apps.
Bug/Defect:14 Like most software, Android apps are often General: For all other publications that did not address the
buggy, usually leading to runtime crashes. Due to the high above two popular targets, the category General applies. Publi-
competition of apps in the Android ecosystem, defect identi- cations with targets like normal code analysis are grouped into
fication is critical since they can be detrimental to user rat- this category.
ing and adoption [86]. Indeed, researchers in this field leverage Table IV characterizes the test targets discussed above. The
various testing techniques such as fuzzing testing, mutation test- most frequently addressed testing target is GUI/Event, account-
ing for 45.6% of the selected publications. Meanwhile, there are
14 Terminologically, the aforementioned objectives could also be categorized
as bug/defect problems (e.g., concurrency issues). To make the summarization
more meaningful in this work, we only flag publications as bug/defect as long
as their main focuses are bug/defect problems, e.g., when they address the gap 15 IAC is actually ICC where the communicating components are from dif-
between app’s misbehavior and developer’s original design. ferent apps.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON RELIABILITY

TABLE III
TEST OBJECTIVES IN THE LITERATURE

only 12 publications targeted ICC/IAC. A total of 44 publica- community now needs to focus on transforming the approaches
tions are regrouped under the General category. into scalable tools that will perform deeper security analyses
Insights from RQ1—on Targets and Objectives and accurate defect identification in order to improve the over-
– “Bug/defect” has been the most trending concern among all quality of apps distributed in markets.
Android research community. “Compatibility” testing, which
is necessary for detecting issues that plague the Android frag-
mented ecosystem, remains understudied. Similarly, we note B. Which Test Levels are Addressed?
that because mobile devices are quickly getting powerful, de- Development of Android apps involves classical steps of tra-
velopers build increasingly complex apps with services explor- ditional software development. Therefore, there are opportuni-
ing hardware multicore capabilities. Therefore, the community ties in various phases to perform tests with specific emphasis
should invest more efforts in approaches for concurrency testing. and purpose. The software testing community commonly ac-
– Our review has also confirmed that GUI is of paramount knowledges four levels of software testing [127], [128]. Our
importance in modern software development for guaranteeing literature review has identified that Android researchers have
a good user experience. In Android apps, the GUI actions and proposed approaches, which considered Unit/regression test-
reactions are intertwined with the app logic, increasing the chal- ing, integration testing, and system testing. Acceptance testing,
lenges of analyzing app codes for defects. For example, mod- which involves end-users evaluating whether the app complies
eling GUI behavior while taking into account potential runtime with their needs and requirements, still faces a lack of research
interruption by system events (e.g., incoming phone call) is effort in the literature.
necessary, yet not trivial. These challenges have created op- Unit testing is usually applied at the beginning of the develop-
portunities in Android research: as our literature review shows, ment of Android apps, which are usually written by developers
most test approaches target GUI or the Event mechanism. The and can be taken as a type of white-box testing. Unit testing
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KONG et al.: AUTOMATED TESTING OF ANDROID APPS: A SYSTEMATIC LITERATURE REVIEW 9

TABLE IV TABLE V
TEST TARGETS IN THE LITERATURE RECURRENT TESTING PHASES

intends to ensure that every functionality, which could be rep-


resented as a function or a component, works properly (i.e., in
accordance with the test cases). The main goal of unit testing
is to verify that the implementation works as intended. Regres-
sion testing consists in re-executing previously executed test
cases to ensure that subsequent updates of the app code have whether the outlined requirements and quality standards have
not impacted the original program behavior, allowing issues (if been fulfilled. Usually, system testing is done in a black-box
presented) to be resolved as quickly as possible. Usually, re- style, which is usually conducted by independent testers who
gression testing is based on unit testing. It re-executes all the have no knowledge of the apps to be tested. As an example,
unit test cases every time when a piece of code is changed. As Mao et al. [11] have proposed a testing tool named Sapienz that
an example, Hu et al. [84] have applied unit testing to automat- combines several approaches including fuzzing testing, search-
ically explore GUI bugs, where JUnit, a unit testing framework, based testing to systematically explore faults in Android apps.
is leveraged to automate the generation of unit testing cases. Table V summarizes the aforementioned test phases, where
Integration testing: Integration testing combines all units the most recurrently applied testing phase is system testing (ac-
within an app (iteratively) to test them as a group. The pur- counting for nearly 80% of the selected publications), followed
pose of this phase is to infer interface defects among units by unit testing and integration testing, respectively.
or functions. It determines how efficient the units are interac- Insights from RQ2—on Test Levels
tive. For example, Yang et al. [58] have proposed a tool called – The large majority of approaches reviewed in this SLR are
IntentFuzzer to test the capability problems involved in ICC. about testing the whole app against given test criteria. This
System testing: System testing is the first step that the whole correlates with the test methodologies detailed below. Unit
app is tested as a whole. The goal of this phase is to assess and regression testing, which would help developers assess
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON RELIABILITY

TABLE VI
TEST METHOD EMPLOYED IN THE LITERATURE

individual functionalities in a white-box testing scenario, are methodology incurs a substantial, usually manual, effort to de-
limited to a few approaches. sign and build the model, the eventual test approach is often
extensive, since test cases can be automatically generated and
executed. Our review has revealed that model-based testing is the
C. How are the Test Approaches Built?
most common methodology used in Android testing literature:
Our review further investigates the approaches in-depth to 63% of publications involve some model-based testing steps.
characterize the methodologies they leverage, the type of tests Takala et al. [123] presented a comprehensive documentation
that are implemented as well as the tool support they have ex- on their experiences in applying a model-based GUI testing to
ploited. In this paper, we refer to test technique as a broad Android apps. They typically discuss how model-based testing
concept to describe all the technical aspects related to testing, and test automation are implemented, how apps are modeled, as
while we constrain the term test methodology to specifically well as how tests are designed and executed.
describe the concrete methodology that a test approach applies. Search-based testing is using the metaheuristic search tech-
1) Test Methodologies: Table VI enumerates all the testing niques to generate software tests [129], with the aim to detect
methodologies we observed in our examination. as many bugs as possible, especially the most critical ones, in
Model-based testing is a testing methodology that goes the system under test. In [105], Mahmood et al. developed an
one step further than traditional methodologies by automati- evolutionary testing framework for Android apps. Evolutionary
cally generating test cases based on a model, which describes testing is a form of search-based testing, where an individual
the functionality of the system under test. Although such corresponds to a test case, and a population comprised of many
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KONG et al.: AUTOMATED TESTING OF ANDROID APPS: A SYSTEMATIC LITERATURE REVIEW 11

individuals is evolved according to certain heuristics to maxi- TABLE VII


COMMON TEST TYPES
mize the code coverage. Their technique thus tackles the com-
mon shortcoming of using evolutionary techniques for system
testing. In order to generate the test suites in an effective and
efficient way, Amalfitano et al. [102] proposed a novel search-
based testing technique based on the combination of genetic and
hill climbing techniques.
Random testing is a software testing technique where pro- two types of mutation (namely, input genes and event genes)
grams are tested by generating random, independent inputs. are leveraged to identify a set of test cases that maximize code
Results of the output are compared against software specifica- coverage. Mutation testing-based approaches are, however, not
tions to verify that the test output is a pass or a fail [130]. In the common in the Android literature.
absence of specifications, program exceptions are used to detect Overall, our review has shown that the literature often com-
test case fails. Random testing is also acquired by almost all bines several methodologies to improve test effectiveness. In
other test suite generation methodologies and serves as a fun- [108], Chen and Xu combined model-based testing with ran-
damental technique. Random testing has been used in several dom testing to complete the testing. Finally, EvoDroid [105]
literature works [89], [97], [103], [108], [126]. is a framework that explores model-based, search-based, and
Fuzzing testing is a testing technique that applies invalid, mutation testing techniques.
unexpected, or random data as inputs to a testing object. It is 2) Test Types: In general, there are three types of testing,
commonly used to test for security problems in software or namely the White-box testing, Black-box testing, and Grey-box
computer systems. The main focus then shifts to monitoring the testing. Table VII summarizes these testing types by empha-
program for exceptions such as crashes, or failing built-in code sizing on the ideal tester (the software developer or a third-
assertions or for finding potential memory leaks. A number party), on whether knowledge on implementation details is
of research papers (e.g., [23], [84]) have explored this type fully/partially/not required.
of testing via automated or semiautomated fuzzing. Fuzzing White-box testing is a scenario in which the software is ex-
testing is slightly different from random testing, as it mainly amined based on the knowledge of its implementation details.
embraces, usually on purpose, unexpected, invalid inputs, and It is usually applied by the software developers in early devel-
focuses on monitoring crashes/exceptions of the tested apps, opment stages when performing unit testing. Another common
while random testing does not need to conform to any such usage scenario is to perform thorough tests once all software
software specifications. components are assembled (known as regression testing). In
A/B testing provides a means for comparing two variants of a this SLR, when an approach requires app source (or byte) code
testing object, and hence determining which of the two variants knowledge, whether obtained directly or via reverse engineer-
is more effective. A/B testing is recurrently used for statisti- ing, we consider it a white-box approach.
cal hypothesis tests. In [112], Adinata et al. have applied A/B Black-box testing, on the other hand, is a scenario where in-
testing to test mobile apps, where they have solved three chal- ternal design/implementation of the tested object is not required.
lenges of applying A/B testing, including element composition, Black-box testing is often conducted by third-party testers, who
variant delivery, and internet connection. Holzmann et al. [109] have no relationships with the developers of tested objects. If an
conducted A/B testing through a multivariate testing tool. Android app testing process only requires the installation of the
Concolic testing is a hybrid software verification technique targeted app, we reasonably put it under this category.
that performs symbolic execution, a classical technique that Grey-box testing is a tradeoff between white-box and black-
treats program variables as symbolic variables, along with a box testing. It does not require the testers to have full knowledge
concrete execution path (testing on particular inputs). Anand on the source code where white-box testing needs. Instead, it
et al. [122] proposed a concolic testing technique, CONTEST, to only needs the testers to know some limited specifications like
alleviate the path explosion problem. They develop a concolic- how the system components interact. For the investigations of
testing algorithm to generate sequences of events. Checking our SLR, if a testing approach requires to extract some knowl-
the subsumption condition between event sequences allows the edge (e.g., from the Android manifest configuration) to guide
algorithm to trim redundant event sequences, thereby alleviating its tests, we consider it a grey-box testing approach.
path explosion. Fig. 7 illustrates the distribution of test types applied by ex-
Mutation testing is used to evaluate the quality of existing amined testing approaches. White-box testing is the least used
software tests. It is performed by selecting a set of mutation type, far behind black-box and grey-box testing. This is expected
operators and then applying them to the source program, one because Android apps are usually compiled and distributed in
operator at a time, for each relevant program location. The result APK format, so testers in most scenarios have no access to
of applying one mutation operator to the program is called a source code. We also wish to address that one literature can
mutant. If the test suite is able to detect the change (i.e., one make use of more than one testing type; this is why the sum of
of the tests fails), then the mutant is said to be killed. In order the three types in Fig. 7 is larger than 103.
to realize an end-to-end system testing of Android apps in a 3) Test Environments: Unlike static analysis of Android
systematic manner, Mahmood et al. [105] proposed EvoDroid, apps [22], testing requires to actually run apps on an execu-
an evolutionary approach of system testing of apps, in which tion environment such as a real device or an emulator.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON RELIABILITY

4) Tool Support: While performing the SLR, we have ob-


served that several publicly available tools were recurrently
leveraged to implement or complement the state-of-the-art ap-
proaches. Table VIII enumerates such tools with example refer-
ences to works where they are explored.
AndroidRipper is a tool for automatic GUI testing of Android
apps. It is driven by a user-interface ripper that automatically
and systematically travels the app’s GUI aiming at exercising a
given app in a structured way. In order to generate test cases in
an effective and efficient way, Amalfitano et al. [102] extended
this work with search-based testing techniques, where genetic
and hill climbing algorithms are considered.
EMMA is an open-source toolkit for measuring and reporting
Fig. 7. Breakdown of examined publications regarding their applied testing Java code coverage. Since Android apps are written in Java,
types.
researchers often use EMMA to compute the code coverage of
their Android app testing approaches, including EvoDroid [105]
and SIG-Droid [99].
Monkey is a test framework released and maintained by
Google, the official maintainer of Android. It generates and
sends pseudorandom streams of user/system events into the run-
ning system. This functionality is exploited in the literature to
automatically identify defects of ill-designed apps. As an exam-
ple, Hu et al. [84] leveraged Monkey to identify GUI bugs of
Android apps. The randomly generated test cases (events) are
fed into a customized Android system that produces log/trace
files during the test. Those log/trace files can then be leveraged
to perform postanalysis and thereby to discover event-related
bugs.
Fig. 8. Venn diagram of testing environment.
RERAN is a record and replay tool for testing Android apps.
Unlike traditional record-and-reply tools, which are inadequate
Real device has a number of advantages: they can be used for Android apps because of their expressiveness on smartphone
to test apps w.r.t compatibility aspects [41], [59], [63], energy features, RERAN supports sophisticated GUI gestures and com-
consumption [42], [68], [71], and the poor responsiveness is- plex sensor events. Moreover, RERAN achieves accurate timing
sue [29], [74]. Unfortunately, using real devices is not efficient, requirements among various input events. A3 E [115], for exam-
since it cannot scale in terms of execution time and resources ple, uses RERAN to record its targeted and depth-first explo-
(several devices may be required). ration for systematic testing of Android apps. Those recorded
Emulator, on the contrary, can be scalable. When deployed explorations can later be replayed so that to benefit debuggers
on the cloud, using the emulator can grant a tester great com- in quickly localizing the exact event stream that has led to the
puting resources and carry out parallel tests at a very large crash.
scale [79]. Unfortunately, emulators are ineffective for security- Robotium is an open-source test framework, which has full
relevant tests, since some malware have the functionality to support for native and hybrid apps. It also eases the way to
detect whether they are running on an emulator. If so, they may write powerful and robust automatic black-box UI tests of An-
decide to refrain from exposing their malicious intention [131]. droid apps. SIG-Droid [99], for example, leverages Robotium
Emulators also introduce huge overhead when mimicking real- to execute its generated test cases (with the help of symbolic
life sensor inputs, e.g., requiring altering the apps under testing execution). We have found during our review that Robotium
at source code level [101]. were most frequently leveraged by state-of-the-art testing ap-
Emulator + real device can be leveraged together to test An- proaches.
droid apps. For example, one can first use an emulator to launch Robolectric is a unit testing framework, which simulates the
large-scale app testing for preselecting a subset of relevant apps Android execution environment (either on a real device or on
and then resort to real devices for more accurate testing. an emulator) in a pure Java environment. The main advantage
As can be seen from Fig. 8, real devices are largely used of doing that is to improve the testing efficiency because tests
by 68 publications in our final list. Only 38 publications used running inside a JVM are much faster than that of running on
emulators, despite the fact that they are cheap. A total of 15 an Android device (or even emulator), where it usually takes
publications chose both environments to avoid disadvantages of minutes to build, deploy, and launch an app. Sadeh et al. [124]
either. Deducting these 15 publications, we can calculate that 23 have effectively used Robolectric framework to conduct unit
publications focused solely on emulators, where 53 publications testing for their calculator application. They have found that it
selected real devices as the only environment. is rather easy to write test cases with this framework, which
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KONG et al.: AUTOMATED TESTING OF ANDROID APPS: A SYSTEMATIC LITERATURE REVIEW 13

TABLE VIII
SUMMARY OF BASIC TOOLS THAT ARE FREQUENTLY LEVERAGED BY OTHER TESTING APPROACHES

requires only a few extra steps and abstractions. Because testers TABLE IX
ASSESSMENT METRICS (E.G., FOR COVERAGE, ACCURACY)
do not need to maintain a set of fake objects and interfaces, it is
even preferable for complex apps.
Sikuli uses visual technology to automate GUI testing through
screenshot images. It is particularly useful when there is no easy
way to obtain the app source code or the internal structure of
graphic interfaces. Lin et al. [106], [113] leveraged Sikuli in
their work to enable record-and-replay testing of Android apps,
where the user interactions are saved beforehand in Sikuli test
formats (as screenshot images).
Insights from RQ3—on Used Techniques
– Given the complexity of interactions among components
in Android apps as well as with the operating system,
it is not surprising that most approaches in the literature
resort to “model-based” techniques, which build models
for capturing the overall structure and behavior of apps to
facilitate testing activities (e.g., input generation, execution
scenarios selection, etc.).
– The unavailability of source code for market apps makes Branch [114] have been proposed in our community. In order
white-box techniques less attractive than grey-box and to profile the Accuracy of testing approaches, other coverage
black-box testing for assessing apps in the wild. Neverthe- metrics are also proposed in the literature such as bugs [42] and
less, our SLR shows that the research community has not vulnerabilities [45] (e.g., how many known vulnerabilities can
sufficiently explored testing approaches that would directly the evaluated testing approach cover?). Table IX enumerates
benefit app developers during the development phase. the coverage metrics used in the literature, where LoC appears
– Tool support for building testing approaches is abundant. to be the most concerned metric.
The use of the Robotium open-source test framework by Ground truth refers to a reference dataset where each element
numerous approaches once again demonstrates the impor- is labeled. In this SLR, we consider two types of ground truths.
tance of making tools available to stimulate research. The first is related to malware detection approaches: the ground
truth then contains apps labeled as benign or malicious. As an
example, the Drebin [132] dataset has recurrently been lever-
D. To What Extent are the Approaches Validated? aged as ground truth to evaluate testing approaches [133]. The
Several aspects must be considered when assessing the ef- second is related to vulnerability and bug detection: the ground
fectiveness of a testing approach. We consider in this SLR the truth represents code that is flagged to be vulnerable or buggy
measurements performed on code coverage as well as on accu- based on the observation of bug reports summited by end users
racy. We also investigate the use of a ground truth to validate or bug fix histories committed by developers [55], [84].
performance scores, the size of the experimental dataset. The Dataset Size is the number of apps tested in the experi-
Coverage is a key aspect for estimating how well the program mental phase. We can see from Fig. 9 that most works (ignoring
is tested. Larger coverage generally correlates with higher possi- outliers) carried out experiments on no more than 100 apps, with
bilities of exposing potential bugs and vulnerabilities, as well as a median number of 8 apps. Comparing to the distribution of
uncovering malicious behavior. There are numerous coverage the number of evaluated apps summarized in an SLR of static
metrics leveraged by state-of-the-art works. For example, for analysis of Android apps [22], where the median and maximum
evaluating Code Coverage, metrics such as LoC (Lines of numbers are, respectively, 374 and 318 515, far bigger than the
Code) [11], [102], [105], Block [97], Method [108], [115], and number of apps considered by testing approaches. This result is
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON RELIABILITY

Fig. 11. Trend of testing levels. (a) Unit/regression. (b) Integration.


(c) System.

Fig. 9. Distribution of the number of tested apps (outliers are removed).

Fig. 12. Trend of testing methods. (a) Model-based. (b) Search-based.


(c) Random. (d) Fuzzing. (e) Concolic. (f) Mutation.

Fig. 10. Trend of testing types. (a) Black-box. (b) White-box. (c) Grey-box. dependencies: it is not straightforward to isolate a single block
for unit/regression testing or to test the integration of two com-
ponents without interference from other components. Never-
somehow expected as testing approaches (or dynamic analysis
theless, with the increasing use of code instrumentation [14],
approaches) are generally not scalable.
there are new opportunities to eventually slice Android apps for
Insights from RQ4—on Approach Validation
performing more grey-box and white-box testing.
Although literature works always provide evaluation section
Trend analysis of testing methods in Fig. 12 confirms
to provide evidence (often through comparison) that their ap-
that model-based testing is dominating in the literature of
proaches are effective, their reproducibility is still challenged
Android app testing, and its evolution is reflected in the overall
by the fact that there is a lack of established ground truth and
evolution of testing approaches. Most approaches indeed start
benchmarks. Yet, reproducibility is essential to ensure that the
by constructing a GUI model or a call graph to generate efficient
field is indeed progressing based on a baseline performance,
test cases. In the last couple of years, mutation testing has
instead of relying on subjective observation by authors and on
been appearing in the literature, similarly to the search-based
datasets with variable characteristics.
techniques.
With regard to testing targets, Fig. 13(a)–(b) shows that the
VI. DISCUSSION
graphical user interfaces, as well as the event mechanism, are
Research on Android app testing has been prolific in the past continuously at the core of research approaches. Since Android
years. Our discussion will focus on the trends that we observed Activities (i.e., the UIs) are the main entry points for executing
while performing this SLR, as well as on the challenges that the test cases, the community will likely continue to develop
community should still attempt to address. black-box and grey-box test strategies that increase interactions
with GUI to improve code coverage. Intercomponent and
A. Trend Analysis interapplication communications, on the other hand, have been
popular targets around 2014.
The development of the different branches in the taxonomy
With regard to testing objectives, Fig. 13(c)–(h) shows that
is disparate.
security concerns have attracted a significant amount of re-
Fig. 10 illustrates the trend in testing types over the years.
search, although the output has been decreasing in the last cou-
Together, black-box and grey-box testing are involved in 90%
ple of years. Bug/defect identification, however, has somewhat
of the research works. Their evolution is thus reflected by the
stabilized.
overall evolution of research publications (cf. Fig. 4). White-box
testing remains low in all years.
B. Evaluation of Authors
Fig. 11 presents the evolution over time of works address-
ing different test levels. Unit/regression and integration testing Android testing is a new field of research, which has at-
phases include a low, but stable, number of works every year. tracted several contributions over the years due to the multiple
Overall, system testing has been heavily used in the literature opportunities that it offers for researchers to apply theoretical
and has even doubled between 2012 and 2014. System testing advances in the domain of software testing. We emphasize the
of Android apps is favored since app execution is done on a attractiveness of the field by showing in Fig. 14 the evolution
specific virtual machine environment with numerous runtime of single authors contributing to research approaches. We count
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KONG et al.: AUTOMATED TESTING OF ANDROID APPS: A SYSTEMATIC LITERATURE REVIEW 15

ing seldom contributes with reusable tools (e.g., implementation


of approaches for GUI testing), not even mention to contribute
with open source testing tools. Yet, the availability of such tools
is necessary not only to limit the efforts in subsequent works
but also to encourage true progress beyond the state-of-the-art.
Despite the fact that most testing approaches are not made
publicly available, it is nevertheless gratifying to observe that
some of them have been leveraged in industry. For example,
research tool TEMA has now been integrated into the RATA
project,16 where researchers (from Tampere University of Tech-
nology) and practitioners (from Intel Finland, OptoFidelity, and
VTT) work together to provide robot-assisted test automation
for mobile apps. Another research tool named SAPIENZ has
led to a start-up called MaJiCKe and recently been acquired
by Facebook London, being the core component of Facebook’s
testing solutions for mobile apps.

D. Open Issues and Future Challenges


Fig. 13. Trend of testing targets and objectives. (a) GUI/Event. (b) ICC/IAC.
(c) Concurrency. (d) Security. (e) Performance. (f) Energy. (g) Compatibility. Although the publications we chose all have their own solid
(h) Bug/defect. contributions, some authors posed open issues and future chal-
lenges to call in more research attention to the domain. We
managed to collect the concerns and summarized as follows.
1) Satisfying Fastidious Preconditions: One recurrently dis-
cussed issue is to generate test cases that can appropriately
satisfy preconditions such as login to an app. When the
oracles generate events to traverse the activities of An-
droid apps, some particular activities are extremely hard
to be touched. A publicly known condition is to tap the
same button for seven consecutive times in order to trig-
ger developer mode [12], [99]. Another example would
Fig. 14. Trend in community authors. “New Authors” and “Stayed Authors” be to break through the login page, which requires a par-
indicate the number of authors that enter the field (no relevant publications
before) and have stayed in the field (they will keep publishing in the following ticular combination of user account and passwords. Both
years). preconditions are clearly not easy to be satisfied during
the process of testing Android apps.
in each year, the Total Authors who participated in at least one 2) Modeling Complex Events (e.g., gestures or n-user
of our selected publications, the New Authors that had had no events): In addition to simple events such as clicking, An-
selected publication until that year, and the number of Stayed droid OS also involves quite a lot of complex events such
Authors who had publications selected both that year and the as user gestures (swipe, long press, zoom in/out, spin, etc.)
years to come. Overall, the figures raise several interesting find- and system events (network connectivity, events coming
ings, which are as follows. from light, pressure and temperature sensors, GPS, fin-
1) Every year, the community of Android testing research gerprint recognizer, etc.). All the events would introduce
authors is almost entirely renewed. nondeterministic behaviors if they are not properly mod-
2) Only a limited number of researchers publish again in the eled. Unfortunately, at the moment, most of our reviewed
theme after one publication. papers only tackle simple events like clicking, letting other
These facts may suggest that the research in Android app events remain untouched [67], [101].
testing is often governed by opportunities. Furthermore, chal- 3) Bridging Incompatible Instruction Sets: To improve the
lenges (e.g., building a sound GUI event model) quickly performance of Android apps, Google provides a toolset,
arise, making authors lose interest in pursuing in this research i.e., the Android Native Developer Kit, allowing app de-
direction. Although we believe that the fact that the topic is velopers to implement time-intensive tasks via C/C++.
within reach of a variety of authors from other backgrounds is Those tasks implemented with C/C++ are closely depen-
good for bringing new ideas and crossfertilizing, the maturity dent on the CPU instruction sets (e.g., Intel or ARM) and
of the field will require commitment from more authors staying hence can only be launched in right instruction sets, e.g.,
in the field. tasks implemented based on the ARM architecture can
only be executed on ARM-based devices). However, as
C. Research Output Usability most mobile devices nowadays are assembled with ARM
In the course of our investigations for performing the review,
we have found that the research community on Android app test- 16 http://wiki.tut.fi/RATA/WebHome
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

16 IEEE TRANSACTIONS ON RELIABILITY

chips, while most PCs running Android emulators are as- the framework (and its variabilities) and account for it during
sembled with Intel chips, running ARM-based emulators test execution.
on Intel-based PCs are extremely slow; this gap has caused 3) Code Prioritization Versus Test Prioritization: Finally,
problems for emulator-based testing approaches [95]. we note that Android apps are becoming larger and larger in
4) Evaluating Testing Approaches Fairly: Frequently, re- terms of size, including obsolete code for functionalities that
searchers complain about the fact that our community has are no longer needed, or to account for the diversity of de-
not provided a reliable coverage estimator to approximate vices (and their OS versions). For example, in large companies,
the coverage (e.g., code coverage) of testing approaches because of developer rotation, “dead” code/functionality may
and to fairly compare them [12], [29], [41], [43]. Although remain hidden in plain sight of app code without development
some outstanding progress has been made for developing teams risking to remove them. As a result, the effort thrown
estimation tools [23], our SLR still indicates that there in maintaining those apps increases continuously, where con-
does not exist any universally acquired tool that supports sequently the testing efforts required to verify the functional
fair comparison among testing approaches. We, therefore, correctness of those apps also boost. Therefore, to alleviate this
urge our fellow researchers to appropriately resolve this problem, we argue that testing such apps clearly necessitates
open issue and subsequently contribute to our commu- optimizing the selection of code that must be tested in priority.
nity a reliable artefact benefiting many aspects of future Test cases prioritization must then be performed in conjunction
research studies. with a code optimization process to focus on actively used code
5) Addressing Usability Defect: The majority of the research w.r.t. user interactions to the app.
studies focuses on functional defects of Android apps.
The usability defect does not attract as much attention as
the users are concerned [53]. Usability defect, like poor
responsiveness [74], is a major drawback of Android apps VII. THREATS TO VALIDITY
and receives massive complaints from users. Bad view We have identified the following threats to validity in our
organization on the screen arising from incompatibility study.
and repetitive imprecise recognition of user gestures also On potential misses of literature—We have not considered
imply bad user experience. for our review books and Master or Ph.D. dissertations related
to the Android testing. This threat is mitigated by the fact that
E. New Research Directions the content of such publications is eventually presented in peer-
reviewed venues that we have considered. We have also consid-
In light of the SLR summary of the state-of-the-art and con- ered only publications written in English. Nevertheless, while
sidering the new challenges reported in the literature, there are searching with the compiled English keywords, we have also
opportunities for exploring new testing applications to improve found a few papers written in other languages, such as German
the quality of Android apps or/and increase confidence in using and Chinese. The number of such non-English papers remain,
them safely. We now enumerate three example directions, which however, significantly small, compared with the collected En-
are as follows. glish literature, suggesting that our SLR is likely complete. Last
1) Validation of App Updates: Android app developers reg- but not least, although we have refined our searching keywords
ularly update their apps for various reasons, including keeping several times, it is still possible that some synonyms are missed
them attractive to the user base.17 Unfortunately, recent stud- in this paper. To mitigate this, we believe that natural language
ies in [134] have shown that updates of Android apps often come processing could be leveraged to disclose such synonyms. We,
with more security vulnerabilities and functional defects. In therefore, consider it as our future work toward engineering
this context, the community could investigate and adapt regres- sound keywords for supporting SLR.
sion techniques for identifying defect-prone or unsafe updates. On data extraction errors—Given that papers are often im-
To accelerate the identification of such issues in updates, one precise with information related the aspects that we have inves-
can consider exploring approaches with behavioral equivalence, tigated, the extracted data may not have been equally reliable
e.g., using “record and replay” test-case generation techniques. for all approaches, and data aggregation can still include sev-
2) Accounting for the Ecosystem Fragmentation: As previ- eral errors as warned by Turner et al. [135] for such studies.
ously highlighted, the fragmentation of the Android ecosys- We have nevertheless strived to mitigate this issue by applying
tem (with a high variety in operating system versions where a a crosschecking mechanism on the extracted results, following
given app will be running, as well as a diversity of hardware the suggestion of Brereton et al. [20]. To further alleviate this,
specifications) is a serious challenge for performing tests that we plan to validate our extracted results through their original
can expose all issues that a user might encounter on his spe- authors.
cific device runtime environment. There is still room to investi- On the representativeness of data sources and metrics—-
gate test optimization and prioritization for Android to cover a We have implemented the “major venues search” based on the
majority of devices and operating system versions. For example, venue ranking provided by the CCF. This ranking is not only
on top of modeling apps, researchers could consider modeling potentially biased toward a specific community of researchers
but may also change from one year to another. A replication of
this study based on other rankings may lead to different primary
17 https://savvyapps.com/blog/how-often-should-you-update-your-app publications set, although the overall findings will likely remain
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KONG et al.: AUTOMATED TESTING OF ANDROID APPS: A SYSTEMATIC LITERATURE REVIEW 17

the same since most major venues continue to be so across years Based on their investigation, they divided the cloud services of
and across ranking systems. mobile testing into three subcategories, which are as follows:
The aspects and metrics investigated in this approach may 1) device clouds (mobile cloud platforms);
also not be exhaustive or representative of everything that char- 2) services to support application lifecycle management;
acterizes testing. Nevertheless, these metrics have been collected 3) tools to provide processing according to some testing tech-
from testing literature to build the taxonomy and are essential niques.
for comparing approaches. They also argue that it is essential to migrate the testing pro-
cess to the clouds, which would make teamwork become possi-
ble. Besides, it can also reduce the testing time and development
costs.
VIII. RELATED WORK Muccini et al. [143] conducted a short study on the challenges
Mobile operating systems, in particular, the open-source An- and future research directions for testing mobile apps. Based on
droid platform, have been fertile ground for research in software their study, they find that 1) mobile apps are so different from
engineering and security. Several surveys and reviews have been traditional ones and thus they require different and specialized
performed on approaches for securing [136], [137], or statically techniques in order to test them, and 2) there seems to be many
analyzing Android apps [22]. An SLR is indeed important to challenges. As an example, the performance, security, reliability,
analyze the contributions of a community to resolve the chal- and energy are strongly affected by the variability of the testing
lenges of a specific topic. In the case of Android testing, such a environment.
review is missing. Janicki et al. [144] surveyed the obstacles and opportunities
Several works in the literature have, however, attempted to in deploying model-based GUI testing of mobile apps. Unlike
provide an overview of the field via surveys or general sys- conventional automatic test execution, model-based testing goes
tematic mappings on mobile application testing techniques. For one step further by considering the automation of test genera-
example, the systematic mapping of Sein et al. [138] addresses tion phases as well. Based on their studies, they claim that the
all together Android, iOS, Symbian, Silverlight, and Windows. most valuable kind of research need (as future work) is to per-
The authors have provided a higher-level categorization of tech- form a comparative experiment on using conventional test and
niques into five groups, which are as follows: model-based automation, as well as exploratory and script-based
1) usability testing; manual testing to evaluate concurrently on the same system and
2) test automation; thus to measure the success of those approaches.
3) context-awareness; Finally, the literature includes several surveys [136], [145]–
4) security; [147] on Android, which cover some aspects of Android testing.
5) general category. As an example, Tam et al. [136] have studied the evolution
Méndez-Porras et al. [139] have provided another mapping, of Android malware and Android analysis techniques, where
focusing on a more narrowed field, namely automated testing of various Android-based testing approaches such as A3 E have
mobile apps. They discuss two major challenges for automating been discussed.
the testing process of mobile apps, which are an appropriate set
of test cases and an appropriate set of devices to perform the IX. CONCLUSION
testing. Our work, with this SLR, goes in-depth to cover dif-
We report in this paper on an SLR performed on the topic
ferent technical aspects of the literature on specifically Android
of Android app testing. Our review has explored 103 papers
app testing (as well as test objectives, targets, and publication
that were published in major conferences, workshops, and
venues).
journals in software engineering, programming language, and
Other related works have discussed directly the challenges
security domain. We have then proposed a taxonomy of the
of testing Android apps in general. For example, Amalfitano
related research exploring several dimensions including the
et al. [140] analyzed specifically the challenges and open issues
objectives (i.e., what functional or nonfunctional concerns
of testing Android apps, where they have summarized suitable
are addressed by the approaches) that were pursued and the
and effective principles, guidelines, models, techniques, and
techniques (i.e., what type of testing methods—mutation,
technologies related to testing Android apps. They enumerate
concolic, etc.) that were leveraged. We have further explored
existing tools and frameworks for automated testing of Android
the assessments presented in the literature, highlighting the
apps. They typically summarize the issues of software testing
lack of established benchmarks to clearly monitor the progress
regarding nonfunctional requirements, including performance,
made in the field. Finally, beyond quantitative summaries, we
stress, security, compatibility, usability, accessibility, etc.
have provided a discussion on future challenges and proposed
Gao et al. [141] presented a study on mobile testing-as-a-
new research directions of Android testing research for further
service (MTaaS), where they discussed the basic concepts of
ensuring the quality of apps with regards to compatibility
performing MTaaS. Besides, the motivations, distinct features,
issues, vulnerability-inducing updates, etc.
requirements, test environments, and existing approaches are
also discussed. Moreover, they have also discussed the current
APPENDIX
issues, needs, and challenges of applying MTaaS in practice.
More recently, Starov et al. [142] performed a state-of-the-art The full list of examined primary publications are enumerated
survey to look into a set of cloud services for mobile testing. in Table A1.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

18 IEEE TRANSACTIONS ON RELIABILITY

TABLE A1
FULL LIST OF EXAMINED PUBLICATIONS
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KONG et al.: AUTOMATED TESTING OF ANDROID APPS: A SYSTEMATIC LITERATURE REVIEW 19

TABLE A1
(CONTINUED)

REFERENCES [5] H. Wang, H. Li, L. Li, Y. Guo, and G. Xu, “Why are Android apps
removed from Google play? A large-scale empirical study,” in Proc.
[1] L. Li, T. F. Bissyandé, J. Klein, and Y. Le Traon, “An investigation into 15th Int. Conf. Mining Softw. Repositories, 2018, pp. 231–242.
the use of common libraries in Android apps,” in Proc. 23rd IEEE Int. [6] L. Li, J. Gao, T. F. Bissyandé, L. Ma, X. Xia, and J. Klein, “Character-
Conf. Softw. Anal., Evolution, Reeng., 2016, pp. 403–414. ising deprecated Android APIs,” in Proc. 15th Int. Conf. Mining Softw.
[2] L. Li et al., “Androzoo++: Collecting millions of Android apps and their Repositories, 2018, pp. 254–264.
metadata for the research community,” 2017, arXiv:1709.05281. [7] L. Li, T. F. Bissyandé, Y. Le Traon, and J. Klein, “Accessing inaccessible
[3] P. S. Kochhar, F. Thung, N. Nagappan, T. Zimmermann, and D. Lo, Android APIs: An empirical study,” in Proc. 32nd Int. Conf. Softw.
“Understanding the test automation culture of app developers,” in Proc. Maintenance Evolution, 2016, 411–422.
8th Int. Conf. Softw. Testing, Verification Validation, 2015, pp. 1–10. [8] N. Mirzaei, J. Garcia, H. Bagheri, A. Sadeghi, and S. Malek, “Reducing
[4] L. Li, “Mining androzoo: A retrospect,” in Proc. Doctoral Symp. 33rd combinatorics in GUI testing of Android applications,” in Proc. Int. Conf.
Int. Conf. Softw. Maintenance Evolution, 2017, pp. 675–680. Softw. Eng., 2016, pp. 559–570.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

20 IEEE TRANSACTIONS ON RELIABILITY

[9] Y. Hu, I. Neamtiu, and A. Alavi, “Automatically verifying and reproduc- [34] Q. Sun, L. Xu, L. Chen, and W. Zhang, “Replaying harmful data races
ing event-based races in Android apps,” in Proc. Int. Symp. Softw. Testing in Android apps,” in Proc. IEEE Int. Symp. Softw. Rel. Eng. Workshop,
Anal., 2016, pp. 377–388. 2016, pp. 160–166.
[10] L. Clapp, O. Bastani, S. Anand, and A. Aiken, “Minimizing GUI event [35] X. Wu, Y. Jiang, C. Xu, C. Cao, X. Ma, and J. Lu, “Testing Android apps
traces,” in Proc. ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2016, via guided gesture event generation,” in Proc. Asia-Pacific Softw. Eng.
pp. 422–434. Conf., 2016, pp. 201–208.
[11] K. Mao, M. Harman, and Y. Jia, “Sapienz: Multi-objective automated [36] H. Zhang, H. Wu, and A. Rountev, “Automated test generation for detec-
testing for Android applications,” in Proc. Int. Symp. Softw. Testing Anal., tion of leaks in Android applications,” in Proc. IEEE 11th Int. Workshop
2016, pp. 94–105. Automat. Softw. Test, 2016, pp. 64–70.
[12] X. Zeng et al., “Automated test input generation for Android: Are we [37] R. Jabbarvand, A. Sadeghi, H. Bagheri, and S. Malek, “Energy-aware
really there yet in an industrial case?” in Proc. ACM SIGSOFT Int. Symp. test-suite minimization for Android apps,” in Proc. Int. Symp. Softw.
Found. Softw. Eng., 2016, pp. 987–992. Testing Anal., 2016, pp. 425–436.
[13] F. Dong et al., “Frauddroid: Automated ad fraud detection for Android [38] J. Qian and D. Zhou, “Prioritizing test cases for memory leaks in Android
apps,” in Proc. 26th ACM Joint Eur. Softw. Eng. Conf. Symp. Found. applications,” J. Comput. Sci. Technol., vol. 31, pp. 869–882, 2016.
Softw. Eng., 2018. [39] M. Ermuth and M. Pradel, “Monkey see, monkey do: Effective generation
[14] L. Li et al., “IccTA: Detecting inter-component privacy leaks in Android of GUI tests with inferred macro events,” in Proc. 25th Int. Symp. Softw.
apps,” in Proc. IEEE 37th Int. Conf. Softw. Eng., 2015, pp. 280–291. Testing and Anal., 2016, pp. 82–93.
[15] L. Gomez, I. Neamtiu, T. Azim, and T. Millstein, “RERAN: Timing-and [40] T. Zhang, J. Gao, O.-E.-K. Aktouf, and T. Uehara, “Test model and
touch-sensitive record and replay for Android,” in Proc. Int. Conf. Softw. coverage analysis for location-based mobile services,” in Proc. Int. Conf.
Eng., 2013, pp. 72–81. Softw. Eng. Knowl. Eng., 2015, pp. 80–86.
[16] L. Li, T. F. Bissyandé, H. Wang, and J. Klein, “CiD: Automating the [41] T. Zhang, J. Gao, J. Cheng, and T. Uehara, “Compatibility testing service
detection of API-related compatibility issues in Android apps,” in Proc. for mobile applications,” in Proc. IEEE Symp. Service-Oriented Syst.
ACM SIGSOFT Int. Symp. Softw. Testing Anal., 2018, pp. 153–163. Eng., 2015, pp. 179–186.
[17] L. Wei, Y. Liu, and S. Cheung, “Taming Android fragmentation: Char- [42] M. Wan, Y. Jin, D. Li, and W. G. J. Halfond, “Detecting display energy
acterizing and detecting compatibility issues for Android apps,” in Proc. hotspots in Android apps,” in Proc. IEEE 8th Int. Conf. Softw. Testing,
31st IEEE/ACM Int. Conf. Automated Softw. Eng., 2016, pp. 226–237. Verification Validation, 2015, pp. 1–10.
[18] N. Mirzaei, S. Malek, C. S. Psreanu, N. Esfahani, and R. Mahmood, [43] Š. Packevičius, A. Ušaniov, Š. Stanskis, and E. Bareiša, “The testing
“Testing Android apps through symbolic execution,” in Proc. ACM SIG- method based on image analysis for automated detection of UI defects
SOFT Softw. Eng. Notes, 2012, pp. 1–5 . intended for mobile applications,” in Proc. Int. Conf. Inf. Softw. Technol.,
[19] B. Kitchenham and S. Charters, “Guidelines for performing system- 2015, pp. 560–576.
atic literature reviews in software engineering,” Univ. Durham, Durham, [44] K. Knorr and D. Aspinall, “Security testing for Android mhealth
U.K., EBSE Tech. Rep., EBSE-2007-01, 2007. apps,” in Proc. Softw. Testing, Verification Validation Workshops, 2015,
[20] P. Brereton, B. A. Kitchenham, D. Budgen, M. Turner, and M. Khalil, pp. 1–8.
“Lessons from applying the systematic literature review process within [45] R. Hay, O. Tripp, and M. Pistoia, “Dynamic detection of inter-application
the software engineering domain,” J. Syst. Softw., vol. 80, no. 4, pp. 571– communication vulnerabilities in Android,” in Proc. Int. Symp. Softw.
583, 2007. Testing Anal., 2015, pp. 118–128.
[21] P. H. Nguyen, M. Kramer, J. Klein, and Y. Le Traon, “An extensive [46] G. d. C. Farto and A. T. Endo, “Evaluating the model-based testing
systematic review on the model-driven development of secure systems,” approach in the context of mobile applications,” Electron. Notes Theor.
Inf. Softw. Technol., vol. 68, pp. 62–81, 2015. Comput. Sci., vol. 314, pp. 3–21, 2015.
[22] L. Li et al., “Static analysis of Android apps: A systematic literature [47] P. Bielik, V. Raychev, and M. T. Vechev, “Scalable race detection for An-
review,” Inf. Softw. Technol., vol. 88, pp. 67–95, 2017. droid applications,” in Proc. ACM SIGPLAN Int. Conf. Object-Oriented
[23] Y. Zhauniarovich, A. Philippov, O. Gadyatskaya, B. Crispo, and F. Mas- Program., Syst., Lang. Appl., 2015, pp. 332–348.
sacci, “Towards black box testing of Android apps,” in Proc. 10th Int. [48] D. Amalfitano, A. R. Fasolino, P. Tramontana, B. D. Ta, and A. M.
Conf. Availability, Rel. Secur., 2015, pp. 501–510. Memon, “MobiGUITAR: Automated model-based testing of mobile
[24] C.-C. Yeh and S.-K. Huang, “CovDroid: A black-box testing coverage apps,” IEEE Softw., vol. 32, no. 5, pp. 53–59, Sep./Oct. 2015.
system for Android,” in Proc. IEEE 39th Annu. Comput. Softw. Appl. [49] O.-E.-K. Aktouf, T. Zhang, J. Gao, and T. Uehara, “Testing location-
Conf., 2015, vol. 3, pp. 447–452. based function services for mobile applications,” in Proc. IEEE Symp.
[25] C. Yang, G. Yang, A. Gehani, V. Yegneswaran, D. Tariq, and G. Gu, Service-Oriented Syst. Eng., 2015, pp. 308–314.
“Using provenance patterns to vet sensitive behaviors in Android apps,” [50] M. Xia, L. Gong, Y. Lyu, Z. Qi, and X. Liu, “Effective real-time An-
in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., 2016, pp. 58–77. droid application auditing,” in Proc. IEEE Symp. Secur. Privacy, 2015,
[26] L. Malisa, K. Kostiainen, M. Och, and S. Capkun, “Mobile applica- pp. 899–914.
tion impersonation detection using dynamic user interface extraction,” [51] B. Hassanshahi, Y. Jia, R. H. Yap, P. Saxena, and Z. Liang, “Web-to-
in Proc. Eur. Symp. Res. Comput. Secur., 2016, pp. 217–237. application injection attacks on Android: Characterization and detec-
[27] K. Moran, M. Linares-Vásquez, C. Bernal-Cárdenas, C. Vendome, and tion,” in Proc. 20th Eur. Symp. Res. Comput. Secur., 2015, pp. 577–598.
D. Poshyvanyk, “Automatically discovering, reporting and reproducing [52] I. C. Morgado and A. C. Paiva, “Testing approach for mobile applications
Android application crashes,” in Proc. IEEE Int. Conf. Softw. Testing, through reverse engineering of UI patterns,” in Proc. Int. Conf. Automated
Verification Validation, 2016, pp. 33–44. Softw. Eng. Workshop, 2015, pp. 42–49.
[28] J. C. J. Keng, L. Jiang, T. K. Wee, and R. K. Balan, “Graph-aided directed [53] L. Deng, N. Mirzaei, P. Ammann, and J. Offutt, “Towards mutation
testing of Android applications for checking runtime privacy behaviours,” analysis of Android apps,” in Proc. Int. Conf. Softw. Testing, Verification
in Proc. IEEE 11th Int. Workshop Automat. Softw. Test, 2016, pp. 57–63. Validation Workshops, 2015, pp. 1–10.
[29] Y. Kang, Y. Zhou, M. Gao, Y. Sun, and M. R. Lyu, “Experience report: [54] A. R. Espada, M. del Mar Gallardo, A. Salmerón, and P. Merino,
Detecting poor-responsive UI in Android applications,” in Proc. IEEE “Runtime verification of expected energy consumption in smartphones,”
27th Int. Symp. Softw. Rel. Eng., 2016, pp. 490–501. Model Checking Softw., vol. 9232, pp. 132–149, 2015.
[30] Y. Hu and I. Neamtiu, “Fuzzy and cross-app replay for smartphone apps,” [55] P. Zhang and S. G. Elbaum, “Amplifying tests to validate exception
in Proc. IEEE 11th Int. Workshop Automat. Softw. Test, 2016, pp. 50–56. handling code: An extended study in the mobile application domain,” in
[31] H. Tang, G. Wu, J. Wei, and H. Zhong, “Generating test cases to expose Proc. Int. Conf. Softw. Eng., 2014, Art. no. 32.
concurrency bugs in Android applications,” in Proc. IEEE 31st Int. Conf. [56] R. N. Zaeem, M. R. Prasad, and S. Khurshid, “Automated genera-
Automated Softw. Eng., 2016, pp. 648–653. tion of oracles for testing user-interaction features of mobile apps,” in
[32] Y. Kang, Y. Zhou, H. Xu, and M. R. Lyu, “DiagDroid: Android perfor- Proc. IEEE 7th Int. Conf. Softw. Testing, Verification, Validation, 2014,
mance diagnosis via anatomizing asynchronous executions,” in Proc. Int. pp. 183–192.
Conf. Found. Softw. Eng., 2016, pp. 410–421. [57] C.-C. Yeh, H.-L. Lu, C.-Y. Chen, K.-K. Khor, and S.-K. Huang, “CRAX-
[33] M. Gómez, R. Rouvoy, B. Adams, and L. Seinturier, “Reproducing Droid: Automatic Android system testing by selective symbolic execu-
context-sensitive crashes of mobile apps using crowdsourced monitor- tion,” in Proc. IEEE 8th Int. Conf. Softw. Secur. Rel.-Companion, 2014,
ing,” in Proc. Int. Conf. Mobile Softw. Eng. Syst., 2016, pp. 88–99. pp. 140–148.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KONG et al.: AUTOMATED TESTING OF ANDROID APPS: A SYSTEMATIC LITERATURE REVIEW 21

[58] K. Yang, J. Zhuge, Y. Wang, L. Zhou, and H. Duan, “Intentfuzzer: De- [81] K. B. Dhanapal et al., “An innovative system for remote and automated
tecting capability leaks of Android applications,” in Proc. ACM Symp. testing of mobile phone applications,” in Proc. Service Res. Innov. Inst.
Inf., Comput. Commun. Secur., 2014, pp. 531–536. Global Conf., 2012, pp. 44–54.
[59] S. Vilkomir and B. Amstutz, “Using combinatorial approaches for test- [82] C. Zheng et al., “SmartDroid: An automatic system for revealing
ing mobile applications,” in Proc. Int. Conf. Softw. Testing, Verification, UI-based trigger conditions in Android applications.” in Proc. 2nd
Validation Workshops, 2014, pp. 78–83. ACM Workshop Secur. Privacy Smartphones Mobile Devices, 2012,
[60] H. Shahriar, S. North, and E. Mawangi, “Testing of memory leak in pp. 93–104.
Android applications,” in Proc. IEEE 15th Int. Symp. High-Assurance [83] A. K. Maji, F. A. Arshad, S. Bagchi, and J. S. Rellermeyer, “An empirical
Syst. Eng., 2014, pp. 176–183. study of the robustness of inter-component communication in Android,”
[61] S. Salva and S. R. Zafimiharisoa, “APSET, an Android application secu- in Proc. Int. Conf. Dependable Syst. Netw., 2012, pp. 1–12.
rity testing tool for detecting intent-based vulnerabilities,” Int. J. Softw. [84] C. Hu and I. Neamtiu, “Automating GUI testing for Android applica-
Tools Technol. Transfer, vol. 17, pp. 201–221, 2014. tions,” in Proc. Int. Workshop Automat. Softw. Test, 2011, pp. 77–83.
[62] P. Maiya, A. Kanade, and R. Majumdar, “Race detection for Android [85] L. Wei, Y. Liu, and S.-C. Cheung, “Taming Android fragmentation:
applications,” in Proc. ACM SIGPLAN Conf. Programm. Lang. Des. Characterizing and detecting compatibility issues for Android apps,”
Implementation, 2014, pp. 316–325. in Proc. 31st IEEE/ACM Int. Conf. Automated Softw. Eng., 2016,
[63] J. Huang, “AppACTS: Mobile app automated compatibility testing ser- pp. 226–237.
vice,” in Proc. IEEE 2nd Int. Conf. Mobile Cloud Comput., Services, [86] H. Khalid, M. Nagappan, and A. Hassan, “Examining the relationship
Eng., 2014, pp. 85–90. between findbugs warnings and end user ratings: A case study on 10,000
[64] C. Hsiao et al., “Race detection for event-driven mobile applications,” Android apps,” IEEE Softw., vol. 33, no. 4, pp. 34–39, Jul.-Aug. 2016.
in Proc. ACM SIGPLAN Conf. Programm. Lang. Des. Implementation, [87] W. Yang, M. R. Prasad, and T. Xie, “A grey-box approach for auto-
2014, pp. 326–336. mated GUI-model generation of mobile applications,” in Proc. Int. Conf.
[65] C. Guo, J. Xu, H. Yang, Y. Zeng, and S. Xing, “An automated testing Fundamental Approaches Softw. Eng., 2013, pp. 250–265.
approach for inter-application security in Android,” in Proc. 9th Int. [88] A. M. Memon, I. Banerjee, and A. Nagarajan, “GUI ripping: Reverse en-
Workshop Automat. Softw. Test, 2014, pp. 8–14. gineering of graphical user interfaces for testing,” in Proc. 10th Workshop
[66] T. Griebe and V. Gruhn, “A model-based approach to test automation Conf. Reverse Eng., 2003, vol. 3, p. 260.
for context-aware mobile applications,” in Proc. 29th Annu. ACM Symp. [89] A. Machiry, R. Tahiliani, and M. Naik, “Dynodroid: An input generation
Appl. Comput., 2014, pp. 420–427. system for Android apps,” in Proc. Joint Meeting Eur. Softw. Eng. Conf.
[67] P. Costa, M. Nabuco, and A. C. R. Paiva, “Pattern based GUI testing for ACM SIGSOFT Symp. Found. Softw. Eng., 2013, pp. 224–234.
mobile applications,” in Proc. Int. Conf. Quality Inf. Commun. Technol., [90] D. Octeau et al., “Combining static analysis with probabilistic models
2014, pp. 66–74. to enable market-scale Android inter-component analysis,” in Proc. 43th
[68] A. Banerjee, L. K. Chong, S. Chattopadhyay, and A. Roychoudhury, Symp. Principles Programm. Lang., 2016, pp. 469–484.
“Detecting energy bugs and hotspots in mobile apps,” in Proc. 22nd [91] L. Li, A. Bartel, T. F. Bissyandé, J. Klein, and Y. Le Traon, “ApkCom-
ACM SIGSOFT Int. Symp. Found. Softw. Eng., 2014, pp. 588–598. biner: combining multiple Android apps to support inter-app analysis,”
[69] T. Vidas, J. Tan, J. Nahata, C. L. Tan, N. Christin, and P. Tague, “A5: Au- in Proc. 30th IFIP Int. Conf. ICT Syst. Secur. Privacy Protection, 2015,
tomated analysis of adversarial Android applications,” in Proc. 4th ACM pp. 513–527.
Workshop Secur. Privacy Smartphones Mobile Devices, 2014 pp. 39–50. [92] L. Li, A. Bartel, J. Klein, and Y. Le Traon, “Automatically exploiting
[70] G. Suarez-Tangil, M. Conti, J. E. Tapiador, and P. Peris-Lopez, potential component leaks in Android applications,” in Proc. 13th Int.
“Detecting targeted smartphone malware with behavior-triggering Conf. Trust, Secur. Privacy Comput. Commun., 2014, p. 10.
stochastic models,” in Proc. Eur. Symp. Res. Comput. Secur., 2014, [93] K. Jamrozik, P. von Styp-Rekowsky, and A. Zeller, “Mining sandboxes,”
pp. 183–201. in Proc. IEEE/ACM 38th Int. Conf. Softw. Eng., 2016, pp. 37–48.
[71] M. Linares-Vásquez, G. Bavota, C. Bernal-Cárdenas, R. Oliveto, M. Di [94] Y.-M. Baek and D.-H. Bae, “Automated model-based Android GUI test-
Penta, and D. Poshyvanyk, “Mining energy-greedy API usage patterns ing using multi-level GUI comparison criteria,” in Proc. Int. Conf. Auto-
in Android apps: An empirical study,” in Proc. 11th Workshop Conf. mated Softw. Eng., 2016, pp. 238–249.
Mining Softw. Repositories, 2014, pp. 1–11. [95] Z. Qin, Y. Tang, E. Novak, and Q. Li, “MobiPlay: A remote execu-
[72] R. Sasnauskas and J. Regehr, “Intent fuzzer: Crafting intents of death,” tion based record-and-replay tool for mobile applications,” in Proc.
in Proc. Joint Int. Workshop Dyn. Anal. Softw. Syst. Performance Testing, IEEE/ACM 38th Int. Conf. Softw. Eng., 2016, pp. 571–582.
Debugging, Analytics, 2014, pp. 1–5. [96] Y. L. Arnatovich, M. N. Ngo, T. H. B. Kuan, and C. Soh, “Achieving high
[73] S. Zhao, X. Li, G. Xu, L. Zhang, and Z. Feng, “Attack tree based Android code coverage in Android UI testing via automated widget exercising,”
malware detection with hybrid analysis,” in Proc. Int. Conf. Trust, Secur. in Proc. 23rd Asia-Pacific Softw. Eng. Conf., 2016, pp. 193–200.
Privacy Comput. Commun., 2014, pp. 380–387. [97] H. Zhu, X. Ye, X. Zhang, and K. Shen, “A context-aware approach for
[74] S. Yang, D. Yan, and A. Rountev, “Testing for poor responsiveness dynamic GUI testing of Android applications,” in Proc. 39th IEEE Annu.
in Android applications,” in Proc. Int. Workshop Eng. Mobile-Enabled Comput. Softw. Appl. Conf., 2015, pp. 248–253.
Syst., 2013, pp. 1–6. [98] K. Song, A.-R. Han, S. Jeong, and S. D. Cha, “Generating various con-
[75] S. T. A. Rumee and D. Liu, “DroidTest: Testing Android applications for texts from permissions for testing Android applications,” in Proc. Int.
leakage of private information,” Int. J. Inf. Secur., vol. 7807, pp. 341–353, Conf. Softw. Eng. Knowl. Eng., 2015, pp. 87–92.
2013. [99] N. Mirzaei, H. Bagheri, R. Mahmood, and S. Malek, “Sig-Droid: Auto-
[76] V. Nandakumar, V. Ekambaram, and V. Sharma, “Appstrument—A uni- mated system input generation for Android applications,” in Proc. IEEE
fied app instrumentation and automated playback framework for testing 26th Int. Symp. Softw. Rel. Eng., 2015, pp. 461–471.
mobile applications,” in Proc. Int. Conf. Mobile Ubiquitous Syst.: Netw. [100] B. Jiang, P. Chen, W. K. Chan, and X. Zhang, “To what extent is stress
Services, 2013, pp. 474–486. testing of Android tv applications automated in industrial environments?”
[77] A. Avancini and M. Ceccato, “Security testing of the communication IEEE Trans. Rel., vol. 65, no. 3, pp. 1223–1239, Sep. 2016.
among Android applications,” in Proc. Int. Workshop Automat. Softw. [101] T. Griebe, M. Hesenius, and V. Gruhn, “Towards automated UI-tests
Test, 2013, pp. 57–63. for sensor-based mobile applications,” in Proc. Int. Conf. Intell. Softw.
[78] D. Yan, S. Yang, and A. Rountev, “Systematic testing for resource leaks Methodologies, Tools Techn., 2015, pp. 3–17.
in Android applications,” in Proc. Int. Symp. Softw. Rel. Eng., 2013, [102] D. Amalfitano, N. Amatucci, A. R. Fasolino, and P. Tramontana, “AGRip-
pp. 411–420. pin: A novel search based testing technique for Android applications,”
[79] R. Mahmood, N. Esfahani, T. Kacem, N. Mirzaei, S. Malek, and A. in Proc. Int. Workshop Softw. Development Lifecycle Mobile, 2015,
Stavrou, “A whitebox approach for automated security testing of Android pp. 5–12.
applications on the cloud,” in Proc. Int. Workshop Automat. Softw. Test, [103] C. Q. Adamsen, G. Mezzetti, and A. Møller, “Systematic execution of
2012, pp. 22–28. Android test suites in adverse conditions,” in Proc. Int. Symp. Softw.
[80] D. Franke, S. Kowalewski, C. Weise, and N. Prakobkosol, “Testing con- Testing Anal., 2015, pp. 83–93.
formance of life cycle dependent properties of mobile applications,” in [104] I. C. Morgado and A. C. Paiva, “The impact tool: Testing UI patterns
Proc. IEEE 12th Int. Conf. Softw. Testing, Verification Validation, 2012, on mobile applications,” in Proc. 30th IEEE Int. Conf. Automated Softw.
pp. 241–250. Eng., 2015, pp. 876–881.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

22 IEEE TRANSACTIONS ON RELIABILITY

[105] R. Mahmood, N. Mirzaei, and S. Malek, “EvoDroid: Segmented evo- [128] S. T. Fundamentals, Software testing levels. [Online]. Avail-
lutionary testing of Android apps,” in Proc. ACM SIGSOFT Int. Symp. able: http://softwaretestingfundamentals.com/software-testing-levels/.
Found. Softw. Eng., 2014, pp. 599–609. Accessed on: Aug. 2018.
[106] Y.-D. Lin, J. F. Rojas, E. T.-H. Chu, and Y.-C. Lai, “On the ac- [129] A. Wasif, T. Richard, and F. Robert, “A systematic review of search-
curacy, efficiency, and reusability of automated test oracles for An- based testing for non-functional system properties,” Inf. Softw. Technol.,
droid devices,” IEEE Trans. Softw. Eng., vol. 40, no. 10, pp. 957–970, vol. 51, pp. 957–976, 2009.
Oct. 2014. [130] R. Hamlet, “Random testing,” in Encyclopedia Software Engineering.
[107] C.-J. Liang et al., “Caiipa: Automated large-scale mobil app testing Hoboken, NJ, USA: Wiley, 1994.
through contextual fuzzing,” in Proc. 20th Annu. Int. Conf. Mobile Com- [131] T. Vidas and N. Christin, “Evading Android runtime analysis via sandbox
put. Netw., 2014, pp. 519–530. detection,” in Proc. 9th ACM Symp. Inf. Comput. Commun. Secur., 2014,
[108] X. Li, Y. Jiang, Y. Liu, C. Xu, X. Ma, and J. Lu, “User guided automation pp. 447–458.
for testing mobile apps,” in Proc. 21st Asia-Pacific Softw. Eng. Conf., [132] D. Arp, M. Spreitzenbarth, M. Hubner, H. Gascon, K. Rieck, and
2014, pp. 27–34. C. Siemens, “Drebin: Effective and explainable detection of An-
[109] C. Holzmann and P. Hutflesz, “Multivariate testing of native mobile appli- droid malware in your pocket.” in Proc. Netw. Distrib. Syst. Secur.,
cations,” in Proc. 12th Int. Conf. Advances Mobile Comput. Multimedia, 2014, pp. 23–26.
2014, pp. 85–94. [133] M. Spreitzenbarth, F. Freiling, F. Echtler, T. Schreck, and J. Hoff-
[110] X. Chen and Z. Xu, “Towards automatic consistency checking between mann, “Mobile-sandbox: Having a deeper look into Android ap-
web application and its mobile application,” in Proc. Int. Conf. Softw. plications,” in Proc. 28th Annu. ACM Symp. Appl. Comput., 2013,
Eng. Knowl. Eng., 2014, pp. 53–58. pp. 1808–1815.
[111] D. Amalfitano et al., “Improving code coverage in Android apps [134] V. F. Taylor and I. Martinovic, “To update or not to update: Insights from
testing by exploiting patterns and automatic test case generation,” in a two-year study of Android app evolution,” in Proc. ACM Asia Conf.
Proc. Int. Workshop Long-Term Ind. Collaboration Softw. Eng., 2014, Comput. Commun. Secur., 2017, pp. 45–57.
pp. 29–34. [135] M. Turner, B. Kitchenham, D. Budgen, and O. Brereton, “Lessons learnt
[112] M. Adinata and I. Liem, “A/b test tools of native mobile application,” in undertaking a large-scale systematic literature review,” in Proc. 12th Int.
Proc. 24th Int. Conf. Data Softw. Eng., 2014, pp. 1–6. Conf. Eval Assessment Softw. Eng., vol. 8, 2008.
[113] Y.-D. Lin, E. T.-H. Chu, S.-C. Yu, and Y.-C. Lai, “Improving the accuracy [136] K. Tam, A. Feizollah, N. B. Anuar, R. Salleh, and L. Cavallaro, “The
of automated GUI testing for embedded systems,” IEEE Softw., vol. 31, evolution of Android malware and Android analysis techniques,” ACM
no. 1, pp. 39–45, Jan./Feb. 2014. Comput. Surv., vol. 49, no. 4, 2017, Art. no. 76.
[114] W. Choi, G. Necula, and K. Sen, “Guided GUI testing of Android apps [137] M. Xu et al., “Toward engineering a secure Android ecosystem: A survey
with minimal restart and approximate learning,” in Proc. ACM SIG- of existing techniques,” ACM Comput. Surv., vol. 49, no. 2, 2016, Art.
PLAN Int. Conf. Object Oriented Programm. Syst. Lang. Appl., 2013, no. 38.
pp. 623–640. [138] S. Zein, N. Salleh, and J. Grundy, “A systematic mapping study of mobile
[115] T. Azim and I. Neamtiu, “Targeted and depth-first exploration for sys- application testing techniques,” J. Syst. Softw., vol. 117, pp. 334–356,
tematic testing of Android apps,” in Proc. ACM SIGPLAN Int. Conf. 2016.
Object Oriented Programm. Syst. Lang. Appl., 2013, pp. 641–660. [139] A. Méndez-Porras, C. Quesada-López, and M. Jenkins, “Automated test-
[116] D. Amalfitano, A. R. Fasolino, P. Tramontana, and N. Amatucci, “Con- ing of mobile applications: A systematic map and review,” in Proc. 18th
sidering context events in event-based testing of mobile applications,” in Ibero-Amer. Conf. Softw. Eng., Lima, Peru, 2015, pp. 195–208.
Proc. IEEE 6th Int. Conf. Softw. Testing, Verification Validation Work- [140] D. Amalfitano, A. R. Fasolino, P. Tramontana, and B. Robbins, “Testing
shops, 2013, pp. 126–133. Android mobile applications: Challenges, strategies, and approaches,”
[117] A. Corradi, M. Fanelli, L. Foschini, and M. Cinque, “Context data dis- Advances Comput., vol. 89, no. 6, pp. 1–52, 2013.
tribution with quality guarantees for Android-based mobile systems,” [141] J. Gao, W.-T. Tsai, R. Paul, X. Bai, and T. Uehara, “Mobile testing-as-a-
Secur. Commun. Netw., vol. 6, pp. 450–460, 2013. service (MTaaS)–Infrastructures, issues, solutions and needs,” in Proc.
[118] S. Bauersfeld, “GUIdif—A regression testing tool for graphical user 15th Int. Symp. High-Assurance Syst. Eng, 2014, pp. 158–167.
interfaces,” in Proc. Int. Conf. Softw. Testing, Verification Validation, [142] O. Starov, S. Vilkomir, A. Gorbenko, and V. Kharchenko, “Testing-as-a-
2013, pp. 499–500. service for mobile applications: State-of-the-art survey,” in Dependability
[119] C. S. Jensen, M. R. Prasad, and A. Møller, “Automated testing with Problems of Complex Information Systems, W. Zamojski and J. Sugier,
targeted event sequence generation,” in Proc. Int. Symp. Softw. Testing Eds. Berlin, Germany: Springer, 2015, pp. 55–71.
Anal., 2013, pp. 67–77. [143] H. Muccini, A. D. Francesco, and P. Esposito, “Software testing of mobile
[120] H. v. d. Merwe, B. v. d. Merwe, and W. Visser, “Verifying Android applications: Challenges and future research directions,” in Proc. 7th Int.
applications using java pathfinder,” in Proc. ACM SIGSOFT Softw. Eng. Workshop Automat. Softw. Test., 2012, pp. 29–35.
Notes, 2012, pp. 1–5. [144] M. Janicki, M. Katara, and T. Pääkkönen, “Obstacles and opportu-
[121] H.-K. Kim, “Hybrid mobile testing model,” in Proc. Int. Conf., nities in deploying model-based GUI testing of mobile software: A
Adv. Softw. Eng. Appl. Disaster Recovery Business Continuity, 2012, survey,” Softw. Testing, Verification Rel., vol. 22, no. 5, pp. 313–341,
pp. 42–52. 2012.
[122] S. Anand, M. Naik, M. J. Harrold, and H. Yang, “Automated concolic [145] A. Sadeghi, H. Bagheri, J. Garcia, and S. Malek, “A taxonomy and
testing of smartphone apps,” in Proc. ACM SIGSOFT Int. Symp. Found. qualitative comparison of program analysis techniques for security as-
Softw. Eng., 2012, Art. no. 59. sessment of Android software,” IEEE Trans. Softw. Eng., vol. 43, no. 6,
[123] T. Takala, M. Katara, and J. Harty, “Experiences of system-level model- pp. 492–530, Jun. 2017.
based GUI testing of an Android application,” in Proc. 4th IEEE Int. [146] W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman, “A survey of
Conf. Softw. Testing, Verification Validation, 2011, pp. 377–386. app store analysis for software engineering,” IEEE Trans. Softw. Eng.,
[124] B. Sadeh, K. Ørbekk, M. M. Eide, N. C. Gjerde, T. A. Tønnesland, and S. vol. 43, no. 9, pp. 817–847, Sep. 2017.
Gopalakrishnan, “Towards unit testing of user interface code for Android [147] P. Yan and Z. Yan, “A survey on dynamic mobile malware detection,”
mobile applications,” in Proc. Int. Conf. Softw. Eng. Comput. Syst., 2011, Softw. Quality J., pp. 1–29, 2017.
pp. 163–175.
[125] D. Amalfitano, A. R. Fasolino, and P. Tramontana, “A GUI crawling-
based technique for Android mobile application testing,” in Proc. IEEE
4th Int. Conf. Softw. Testing, Verification Validation Workshops, 2011,
pp. 252–261.
[126] Z. Liu, X. Gao, and X. Long, “Adaptive random testing of mobile ap-
plication,” in Proc. 2nd Int. Conf. Comput. Eng. Technol., 2010, pp. v2-
297–v2-301.
[127] M. G. Limaye, Software Testing. New York, NY, USA: McGraw-Hill,
2009. Authors’ photographs and biographies not available at the time of publication.

You might also like