ASE22 Industry Fastbot
ASE22 Industry Fastbot
ABSTRACT 1 INTRODUCTION
We introduce a reusable automated model-based GUI testing tech- Mobile apps have drastically increased in number over the recent
nique for Android apps to accelerate the testing cycle. Our key in- years [1]. Ensuring app quality is crucial to keeping user loyalty and
sight is that the knowledge of event-activity transitions from the pre- maintaining business success. To this end, automated GUI testing
vious testing runs, i.e., executing which events can reach which ac- has become an attractive and cost-effective solution [2, 10, 11].
tivities, is valuable for guiding the follow-up testing runs to quickly In practice, an industrial app undergoes frequent updates to
cover major app functionalities. To this end, we propose (1) a proba- catch up with the changing user demands. At ByteDance, we re-
bilistic model to memorize and leverage this knowledge during test- lease new updates of major apps on a weekly basis. As a result,
ing, and (2) design a model-based guided testing strategy (enhanced continuous testing becomes crucial for quick feedback on app qual-
by a reinforcement learning algorithm). We implemented our tech- ity (e.g., doing smoke testing) whenever a new internal version
nique as an automated testing tool named Fastbot2. The evaluation is built. However, simply adopting existing testing tools [10], al-
on two popular industrial apps (with billions of user installations), though feasible, is inefficient and ineffective, as they simply rerun
Douyin and Toutiao, shows that Fastbot2 outperforms the state- each version from scratch and do not leverage the knowledge from
of-the-art testing tools (Monkey, Ape and Stoat) in both activity previous testing runs to accelerate GUI testing in the current run.
coverage and fault detection in the context of continuous testing. To To fill this important gap, we introduce a reusable automated GUI
date, Fastbot2 has been deployed in the CI pipeline at ByteDance testing technique. Our key idea is to leverage model-based testing
for nearly two years, and 50.8% of the developer-fixed crash bugs (MBT). Among existing testing solutions, MBT is recognized for
were reported by Fastbot2, which significantly improves app qual- its unique model construction phase, which is ideal for storing
ity. Fastbot2 has been made publicly available to benefit the com- and leveraging the prior knowledge. However, we face two major
munity at: https:// github.com/ bytedance/ Fastbot_Android. technical challenges in putting our idea into practice.
The first challenge is how to effectively store the knowledge
CCS CONCEPTS from previous testing runs. Our key insight is that the knowledge
• Software and its engineering → Software testing and debug- of event-activity transitions, i.e., executing which events reaches
ging; • Computing methodologies → Reinforcement learning. which activities, is valuable for guiding the follow-up testing to
quickly cover core activities. Thus, we propose a probabilistic model
ACM Reference Format:
Zhengwei Lv, Chao Peng, Zhao Zhang, Ting Su, Kai Liu, and Ping Yang.
as the basis of MBT to memorize such knowledge from each testing
2022. Fastbot2: Reusable Automated Model-based GUI Testing for An- run. This model stores a set of event-activity transitions, each of
droid Enhanced by Reinforcement Learning. In 37th IEEE/ACM Interna- which records the historical probability of an event to reach an
tional Conference on Automated Software Engineering (ASE ’22), October activity. Moreover, to tackle the complexity of industrial apps, we
10–14, 2022, Rochester, MI, USA. ACM, New York, NY, USA, 5 pages. https: introduce a conception of hyper-event to represent events in this
//doi.org/10.1145/3551349.3559505 model, which is useful to balance the model scalability and accuracy.
∗ Chao Peng and Ting Su are the corresponding authors. The second challenge is how to effectively leverage the prior
† Kai Liu was a research intern at ByteDance when this work was conducted. knowledge to guide GUI testing. Classic MBT methods traverse the
model to generate GUI events (i.e., GUI tests). However, one promi-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed nent problem is that such GUI tests are likely to be broken due to
for profit or commercial advantage and that copies bear this notice and the full citation the unawareness of the connectivity between different GUI events.
on the first page. Copyrights for components of this work owned by others than ACM To overcome this issue, our key insight is to employ the proba-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a bilistic model to achieve on-the-fly, guided model-based testing.
fee. Request permissions from [email protected]. Specifically, the probabilistic model (which stores the knowledge of
ASE ’22, October 10–14, 2022, Rochester, MI, USA event-activity transitions) provides one-step guidance about which
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9475-8/22/10. . . $15.00 events on the current GUI page could be selected to quickly reach
https://doi.org/10.1145/3551349.3559505
ASE ’22, October 10–14, 2022, Rochester, MI, USA Zhengwei Lv, Chao Peng, Zhao Zhang, Ting Su, Kai Liu, and Ping Yang
Figure 1: Fastbot2’s workflow in each testing run. The arrows denote different steps, which are annotated by “a1”, “b1”, etc.
those not-yet-covered activities in the current testing run. More- (a) App Under Test (Toutiao)
over, to further improve performance, we develop a reinforcement
learning algorithm to provide multi-step guidance (also informed by Title 1 (e1) Back Author Profile Photo
the probabilistic model), which aims to reach those deep activities Author 1 (e2)
(e4) (e5) (e7)
News 1 (e8)
requiring executing multiple sequential events. Remarks 1
Livestream
(e3) News 2 (e8’)
We implemented our technique as an automated testing tool News Video
Title 2 (e1’) Content News 3 (e8’’) (e9)
named Fastbot2. Our evaluation on our two popular industrial Author 2 (e2’) (e6) …
apps, Douyin and Toutiao, shows that Fastbot2 outperforms the Remarks 2
News n (e8n)
(e3’)
two state-of-the-art MBT tools, Stoat [9] and Ape [5], and the
random testing tool Monkey [4] in both activity coverage and bug Activity 1 Activity 2 Activity 3 Activity 4
(Home) (News Info) (Author Homepage) (Livestream)
finding in the context of continuous testing. To sum up, our work
makes two major contributions: (1) We propose a reusable automated (b) Probabilistic Model 63.6% Activity 2
e1
model-based GUI testing technique enhanced by reinforcement 60% Activity 2 36.4% Activity 5
learning to satisfy the practical needs of continuous testing, which e1
100%
40% Activity 5 e2 Activity 3
has not been considered by prior work. (2) Our implementation,
100%
Fastbot2, outperforms the state-of-the-art. It has also been suc- e2 Activity 3 Update 90% Activity 1
e3
cessfully deployed in the CI pipeline at ByteDance and received 90% Activity 1
10% Activity 5
positive feedback on its ability of improving app quality. e3
10% Activity 5 100%
e4 Activity 1
100%
2 FASTBOT2 e4 Activity 1 100%
e5 Activity 3
M: Initial
Figure 1 shows Fastbot2’s workflow. Fastbot2 takes as input an M’: After executing e1 and e5
APK file of the app, outputs coverage and crash reports. Fastbot2 (c) Q-value Update
Q Table
includes two major phases. The first phase does setup before testing: e1 e5 e7 e9 … an
e1 q1
decompiling the APK to gather static widget text labels of (Step reward1 reward5 reward7 rewardn e5 q5
“a1”), installing the app on a pool of mobile devices (Step “a2”), and e7 q7
an qn
loading the historical data if available to populate the probabilistic q1 q5 q7 qn
model (Step “a3”, cf. Section 2.2.1). The second phase does guided
GUI exploration (cf. Section 2.3). Fastbot2 dumps the current GUI Figure 2: Activity Transition Example from Toutiao App
page from the app (Step “b1”), identifies and abstracts available
hyper-events from the current page (Step “b2”, cf. Section 2.2.2),
selects the event which is likely to increase activity coverage (Step 2.2 Probabilistic Model and Hyper-events
“b3”), executes this UI event (Step “b4”) and updates the historical 2.2.1 Probabilistic Model. We propose a probabilistic model as
testing data, the probabilistic model (Step “b5”) and the reinforce- the basis of MBT to memorize the knowledge of event-activity
ment learning agent (Step “b6”). These steps (“b1”∼“b6”) will be transitions from previous testing runs. Specifically, this model 𝑀 is
iteratively conducted until the time budget is used up. formally defined as a 3-tuple 𝑀 = (E, A, 𝛿), where
• E is the set of hyper-events created from UI widgets.
2.1 An Illustrative Example • A is the set of activities of the app under test.
Figure 2(a) gives an illustrative example taken from Toutiao, a • 𝛿 is the transition function, i.e., E → P (A × [0, 1]). P
popular daily news app. We use this example to ease the exposition is a powerset function and each transition is of the form
of our approach in Sections 2.2 and 2.3. In Toutiao, users can view 𝑒 → (𝐴, 𝑝), meaning that the probability of a hyper-event 𝑒
news on Activity 1, click the news title (e.g., “Title 1”) to reach reaching an app activity 𝐴 is 𝑝, where 𝑒 ∈ E and 𝐴 ∈ A.
Activity 2 to view the content, click the news author (“Author”) to The probabilistic model 𝑀 is constructed from historical explo-
view all news from this author on Activity 3 and watch the live ration data. The probability of reaching activity 𝐴𝑖 by executing
video on Activity 4 if available by clicking the profile photo. the hyper-event 𝑒 (denoted by 𝑃 (𝑒, 𝐴𝑖 )) is calculated by :
Fastbot2: Reusable Automated Model-based GUI Testing for Android Enhanced by Reinforcement Learning ASE ’22, October 10–14, 2022, Rochester, MI, USA
been added in the current tested app version. This mode can help
𝑁 (𝑒, 𝐴𝑖 ) expand the model and prioritize exploring potentially new features.
𝑃 (𝑒, 𝐴𝑖 ) = (1)
𝑁 (𝑒) Model exploitation. If all the hyper-events from the current GUI
where 𝑁 (𝑒, 𝐴𝑖 ) denotes the number of times of 𝑒 reaching 𝐴𝑖 , page have been included in the probabilistic model 𝑀, Fastbot2
and 𝑁 (𝑒) denotes the total execution times of 𝑒𝑖 in all the pre- will activate this mode to select an event with higher probability
vious testing runs. If the hyper-event 𝑒 can reach 𝑘 activities (e.g., to cover those not-yet-covered activities in the current testing run
Í
𝐴1, . . . , 𝐴𝑖 , . . . , 𝐴𝑘 ), 𝑘𝑖=1 𝑃 (𝑒, 𝐴𝑖 ) = 1 holds. (which were covered in the previous testing runs). Let A𝑡 be the set
of already covered activities in the current testing run and E𝑐 be
Example. Figure 2(b) gives an example of the initial probabilistic
the set of hyper-events from the current GUI page, the expectation
model (see the left part) loaded from previous testing runs before
of improving activity coverage by executing 𝑒𝑖 (𝑒𝑖 ∈ E𝑐 ) can be
starting the current testing run. For example, Activity 2 can be Í
computed as E(𝑒𝑖 ) = 𝐴∉A𝑡 𝑃 (𝑒𝑖 , 𝐴), 0 ≤ 𝑖 ≤ |E𝑐 |.
reached by executing the hyper-event 𝑒1 on Activity 1, 𝑒1 can reach
Here, E(𝑒𝑖 ) represents the expectation value of probability that
Activity 2 and 5 with probability values 60% and 40%, respectively.
those not-yet-covered activities in the current testing run will be
2.2.2 Hyper-events. We propose the concept of hyper-event to rep- covered after the hyper-event 𝑒𝑖 is executed. The higher E(𝑒𝑖 ), the
resent the events in the probabilistic model. A hyper-event is created more likely it is to improve activity coverage.Thus, Fastbot2 in
from each UI widget according to its properties. Specifically, we only this mode selects the hyper-event 𝑒𝑖 by probability 𝑃𝑀 (𝑒𝑖 ):
consider the following four properties of a widget: the activity
E(𝑒𝑖 ) E(𝑒𝑖 )
which the widget belongs to, the widget’s text1 , resource-id,
Õ
𝑃𝑀 (𝑒𝑖 ) = 𝑒𝑥𝑝 ( ) / 𝑒𝑥𝑝 ( ) (2)
and the supported action types (e.g., click, long click). In other 𝛼 𝛼
𝑒𝑖 ∈E𝑐
words, if some widgets have the same four properties, we assume
they have the similar functionality and only one hyper-event will be where, 𝛼 is a hyperparameter which adjusts the randomness of
created. We ignore all the other minor widget properties (e.g., a wid- this mode. This equation is adapted from the softmax formula. We
get’s type) when creating hyper-events with the aim of balancing also require that 𝑒𝑖 should be selected no more than 𝐾 times to
between model scalability and accuracy. ensure fairness. In practice, we set 𝛼 as 0.8 and 𝐾 as 1. By using the
probabilistic model as priori information, the model exploitation
Example. In Figure 2(a), on Activity 1, Fastbot2 will create only
mode can quickly improve activity coverage in a short time.
one hyper-event “e1” for the widgets named “Title 1” and “Title 2”
because these two widgets have the identical widget properties: the Example. In Figure 2(a), three hyper-events are available on Ac-
same activity, the same empty text (“Title 1” and “Title 2” are the tivity 1. Since all these three events have been included in the
texts dynamically loaded from the app server without static text probabilistic model 𝑀 (see the left part of Figure 2(b)), Fastbot2
labels), the same resource-id, and the same click action type. In activates the model exploitation mode to select events. According
this way, we will create three hyper-events, i.e., “e1” (representing to 𝑀, event e1 and e2 are more likely to reach unexplored activities
“Title 1” and “Title 2”), “e2” (representing “Author 1” and “Author (i.e., Activity 2, 3, 5), while event e3 has 90% probability to stay in
2”), “e3” (representing “Remarks 1” and “Remarks 2”), on Activity 1. Activity 1. Thus, Fastbot2 is likely to select e1 or e2. Assume e1
is selected and then Activity 2 is covered. In Activity 2, event e4
2.3 Model-based Guided UI Exploration (the back button) has 100% probability to return back to Activity 1,
while event e5 and e6 have not been included in 𝑀. As this time,
The key idea of Fastbot2 is to reuse the prior knowledge stored in Fastbot2 activates the model expansion mode and randomly se-
the probabilistic model to effectively guide GUI testing. To achieve lects e5 or e6. Assume e5 is selected and then Activity 3 is covered.
this, the key step is to decide which UI event on the current GUI Meanwhile, 𝑀 is updated by adding e5→Activity 3 with probability
page should be selected so as to quickly increase activity coverage. value 100% (see the right part of Figure 2(b)).
This step corresponds to Step “b3” in Figure 1. Specifically, given a
GUI page, Fastbot2 extracts the available hyper-events, and selects 2.3.2 Learning-based event selection. However, the probabilistic
the event2 to be executed based on the two synergistically combined model can only express one-step guidance information. Fortunately,
strategies: (1) model-based event selection (cf. Section 2.3.1), and (2) reinforcement learning technique is able to spread one-step into
learning-based event selection (cf. Section 2.3.2). multiple-step guidance information.
2.3.1 Model-based event selection. Model-based event selection Q-table expansion. The key component of the RL agent is the Q-
contains two modes, i.e., model expansion and model exploitation. table, which contains the Q-values (which indicate the possibility of
executing each hyper-event to reach a new activity). During testing,
Model expansion. If some hyper-events from the current GUI page
no matter which event selection strategy is used, the Q-value of
have not been included in the probabilistic model 𝑀, Fastbot2 will
the selected hyper-event 𝑒𝑡 on the current GUI page, i.e., 𝑄 (𝑒𝑡 ), is
activate this mode to randomly select one not-yet-executed hyper-
updated to 𝑄 (𝑒𝑡 ) + 𝛼 (𝐺𝑡,𝑡 +𝑛 − 𝑄 (𝑒𝑡 )), where 𝐺𝑡,𝑡 +𝑛 is the n-step
event. This situation may occur because the previous testing runs
cumulative reward calculated by an N-step Sarsa method [3]:
may not cover all hyper-events or some new app features have
1 Here, the text means the static text labels stored in the resource files of the APK file. 𝐺𝑡,𝑡 +𝑛 = 𝑟𝑡 +1 + 𝛾𝑟𝑡 +2 + · · · + 𝛾 𝑛 𝑄 (𝑒𝑡 +𝑛 ) (3)
If the text is dynamically loaded from the app server, we treat its text as empty.
2 After we decide which hyper-event should be selected, if the selected hyper-event Here, 𝛾 is the discount factor. 𝑟𝑡 +1 is the immediate reward earned
represents multiple UI widgets, we will randomly pick one UI widget to exercise. after the event 𝑒𝑡 is executed, which is defined as
ASE ’22, October 10–14, 2022, Rochester, MI, USA Zhengwei Lv, Chao Peng, Zhao Zhang, Ting Su, Kai Liu, and Ping Yang
200 250 10 20
Fastbot
Num. of Covered Activities
Num. of Revealed
Num. of Revealed
150 200 APE
15
Crashes
Crashes
150 Stoat
100 5 10 Monkey
100
50 5 Fastbot-Accum.
50
APE-Accum.
0 0 0 0 Stoat-Accum.
v8 0
v8 1
v8 2
v8 3
v8 4
v8 5
v8 6
v8 7
v8 8
.9
v8 0
v8 1
v8 2
v8 3
v8 4
v8 5
v8 6
v8 7
v8 8
.9
v1 7
v1 8
v2 9
v2 0
v2 1
v2 2
v2 3
v2 4
v2 5
6
7
8
9
0
1
2
3
4
5
6
Monkey-Accum.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9.
9.
9.
0.
0.
0.
0.
0.
0.
0.
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
.7
9.
9.
9.
0.
0.
0.
0.
0.
0.
0.
v1
v8
v8
v1
v1
v1
v2
v2
v2
v2
v2
v2
v2
(a) Activity Coverage for Toutiao (b) Activity Coverage for Douyin (c) Uncovered Crashes of Toutiao (d) Uncovered Crashes of Douyin
Figure 3: Testing results between Fastbot2 and other tools in terms of activity coverage and bug finding.
improve code coverage and fault detection. They also extract infor-
mation from crowd-based testing to enhance Sapienz [7]. WeChat
WCTester [12, 13] adopts Monkey-based random testing. WCTester
allows human testers to specify blacklisted widgets and define GUI
(a) Toutiao (b) Douyin event sequences to improve coverage. However, our work is sig-
Figure 4: Differences of accumulative activity coverage. nificantly different from the prior work. First, Fastbot2 mainly
adopts model-based testing (enhanced by a learning-based algo-
rithm). Second, Fastbot2 reuses the knowledge from the historical
3.2 RQ2: Ablation Study exploration data to resolve the practical needs of continuous testing,
Figure 5(a) and (b) shows the activity coverage of Toutiao and which have not been considered by prior work.
Douyin achieved by different testing strategies within 1-hour test-
ing. Fastbot2 (denoted by “RL+PM”) achieves 31.5% coverage
6 CONCLUSION
for Toutiao, which is higher than both the model-based event This paper presents a reusable automated model-based GUI testing
selection strategy alone (28.5%, denoted by “PM Only”) and the technique for Android enhanced by reinforcement learning to sat-
learning-based event selection strategy alone (29.0%, denoted by “RL isfy the practical needs of continuous testing. Our implementation
Only”). Similarly, Fastbot2 achieves higher coverage for Douyin Fastbot2 outperforms the three state-of-the-art testing tools in
(20.8%) than model-based strategy alone (18.6%) and learning-based both activity coverage and bug finding in the scenario of contin-
strategy alone (19.2%). The result indicates that both the model- uous testing on two popular apps Douyin and Toutiao. Fastbot2
based and learning-based event selection strategies contribute to has been successfully deployed in the CI pipeline at ByteDance and
Fastbot2’s overall performance in improving activity coverage. received positive feedback on its ability of improving app quality.
REFERENCES
[1] AppBrain. 2022. . Retrieved June 3, 2022 from https://www.appbrain.com/stats/
number-of-android-apps
[2] Shauvik Roy Choudhary, Alessandra Gorla, and Alessandro Orso. 2015. Au-
tomated test input generation for android: Are we there yet?. In ASE. IEEE,
429–440.
[3] Kristopher De Asis, J Hernandez-Garcia, G Holland, and Richard Sutton. 2018.
Multi-step reinforcement learning: A unifying algorithm. In AAAI, Vol. 32.
[4] Google. 2021. UI/Application Exerciser Monkey. Retrieved March 3, 2021 from
https://developer.android.com/studio/test/monkey
[5] Tianxiao Gu, Chengnian Sun, Xiaoxing Ma, Chun Cao, Chang Xu, Yuan Yao,
(a) Toutiao (b) Douyin
Qirun Zhang, Jian Lu, and Zhendong Su. 2019. Practical GUI testing of Android
Figure 5: Comparing Fastbot2’s internal strategies. applications via model abstraction and refinement. In ICSE. IEEE, 269–280.
[6] Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: Multi-objective automated
testing for android applications. In ISSTA. 94–105.
4 INDUSTRIAL DEPLOYMENT [7] Ke Mao, Mark Harman, and Yue Jia. 2017. Crowd intelligence enhances automated
mobile testing. In ASE. IEEE, 16–26.
To date, Fastbot2 have been deployed in the Continuous Inte- [8] Minxue Pan, An Huang, Guoxin Wang, Tian Zhang, and Xuandong Li. 2020.
gration pipeline at ByteDance for nearly two years. Fastbot2 is Reinforcement learning based curiosity-driven testing of Android applications.
In ISSTA. 153–164.
automatically triggered by nightly builds to obtain quick feedback [9] Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang
on app quality when new code changes occur. We have received pos- Pu, Yang Liu, and Zhendong Su. 2017. Guided, stochastic model-based GUI testing
itive feedback from app development teams. For example, among of Android apps. In ESEC/FSE. 245–256.
[10] Ting Su, Jue Wang, and Zhendong Su. 2021. Benchmarking Automated GUI
all the developer-fixed bugs for Toutiao from September 1 to Oc- Testing for Android against Real-World Bugs. In ESEC/FSE. to appear.
tober 31, 2021, 50.8% of these bugs were uncovered by Fastbot2. [11] Wenyu Wang, Dengfeng Li, Wei Yang, Yurui Cao, Zhenwen Zhang, Yuetang
Additionally, Fastbot2 can cover 80% of the hot-spot activities in Deng, and Tao Xie. 2018. An empirical study of android test generation tools in
industrial cases. In ASE. IEEE, 738–748.
Toutiao that are frequently visited by online users. These results [12] Xia Zeng, Dengfeng Li, Wujie Zheng, Fan Xia, Yuetang Deng, Wing Lam, Wei
corroborate Fastbot2’s strong effectiveness. Yang, and Tao Xie. 2016. Automated test input generation for android: Are we
really there yet in an industrial case?. In ESEC/FSE. 987–992.
5 RELATED WORK [13] Haibing Zheng, Dengfeng Li, Beihai Liang, Xia Zeng, Wujie Zheng, Yuetang
Deng, Wing Lam, Wei Yang, and Tao Xie. 2017. Automated test input generation
We focus on discussing the industrial practice of automated Android for android: Towards getting there in an industrial case. In ICSE-SEIP. IEEE,
GUI testing. Facebook Sapienz [6] adopts search-based testing to 253–262.