20
Including
Non-functional
Testing in Build
and Continuous
Integration Cycles
THIS CHAPTER WAS CONTRIBUTED BY AMIR ROZENBERG, SR. DIRECTOR OF PRODUCT
MANAGEMENT AT PERFECTO
AMIR ROZENBERG is a Senior Director of Product Management at Perfecto —
the leading digital, Cloud, and automation company. As such, he is responsible
for the core product strategy of Perfecto. Amir has pioneered many practices
in order to extend the company’s offering
across the whole application life cycle.
He has vast experience in the digital in-
dustry with expertise in areas such as
infrastructure, applications, application
delivery, and monitoring. Amir has led
many successful synergies with technol-
ogy partners to optimize the value de-
livered to Perfecto’s customers. Prior to
working at Perfecto, Mr. Rozenberg was
heading the mobile monitoring practice
239
THE DIGITAL QUALITY HANDBOOK
at Compuware. Prior to Compuware, co-founded Adva Mobile, a direct-to-fan
mobile marketing startup, and held various leadership positions at Groove
Mobile, Nextcode Corp., and others.
INTRODUCTION
Verifying compliance with a number of non-functional requirements has been
studied the hard way by most software vendors. While this is a fundamental
challenge for every software delivery organization, reality whispers about con-
sistent failures to do so.
Within this chapter, more details on this highly important topic will be revealed.
You will have a walk through the trends and biggest challenges in performance
testing today. You will also find good practices easing it all up a bit. Real-world
examples and case studies will back up this thesis in order to make you fully
realize the crucial relation of performance and user experience, as well as the
corresponding impact in terms of money. Last but not least, valuable insights
on the huge pain of guaranteeing great user experience and high-end user
engagement for mobile apps will be provided.
MODERN TESTING OF NON-FUNCTIONAL REQUIREMENTS
The traditional approach towards non-functional testing mandates a focus on
such activities at the end of the software delivery process, sometimes even
extending beyond. The net effect? Well, perhaps these critical tests never really
run in time to fully guarantee the delivery readiness of upcoming releases.
We can definitely do better. There are ways for improving non-functional cov-
erage emerging every day. Modern testing frameworks and Cloud Labs allow
the execution of some of these tests earlier in the SDLC, thus smoothing the
transitions from development to functional and to non-functional testing.
Yes, technology provides the means for early identification and resolution of
non-functional defects that have the potential to become showstoppers for the
release, thus reducing costs, unnecessary efforts, and risk. Technology is just
a tool, though. On the road to success you will need to learn to stick to agile
principles. Furthermore, you need to constantly monitor, fine-tune, and promote
them in such a way that they are aligned, well-perceived, and trusted by all of
240
INCLUDING NON-FUNCTIONAL TESTING IN BUILD AND CONTINUOUS INTEGRATION CYCLES
the development and testing teams involved.
EVOLUTION OF PERFORMANCE TESTING
The world of performance testing has changed dramatically in the last few
years. It is no longer about testing the service API layer or stressing a single
server with a bit of load. Distributed Cloud architectures, increased integration
and consumption of 3rd party services, and different types of content (static,
dynamic, streaming, large, extra large in volume) means just one thing — mod-
ern applications operate in fragmented, siloed, and latency-rich environments.
These artifacts often cause serious flaws in terms of user experience (UX)
which is crucial for client facing applications. That said, in many cases, UX shall
also be taken into account and measured during load testing. Easy to say, but
difficult to achieve.
TRADITIONAL APPROACHES AND THEIR SHORTCOMINGS
In order to ensure the right test coverage, one would need to run massive tests
with thousands or millions of virtual users which can theoretically only happen
in a staging environment (just before production). Isn’t such an approach a
good recipe for release showstoppers? There are several challenges related
to this traditional approach:
1. Maintaining different environments (such as test, staging, etc.) is expensive.
2. The time it takes to create and update such environments to the latest
build is simply not available anymore.
3. As mentioned many times in this book, finding issues just before an up-
coming release is very expensive.
Therefore, we need to be able to scale up and down our performance testing
architecture in terms of test scope and required target environment. It would
be highly appreciated to execute such tests as part of nightly continuous inte-
gration builds. This way, any major performance regressions introduced the day
before can easily be detected and resolved on the day after, before it is too late.
241
THE DIGITAL QUALITY HANDBOOK
ON THE ROAD TO SCALABLE ARCHITECTURE FOR PERFORMANCE
TESTS
You do not need to test the application under heavy load. Sometimes, it might
collapse due to a severe, brand new bug causing excessively increased mem-
ory usage (or leak). And, talking about memory, you could measure and track
the memory usage on daily basis. You would not be interested in absolute
values, but in trends of consumption — on daily, weekly, and release bases.
This knowledge could be used to define and align clear thresholds of what
is acceptable and what is not. Those could be propagated to the backlog for
software development teams as done criteria for each developed feature/user
story, thus effectively avoiding serious performance regressions as the devel-
oped software evolves. In addition, such an approach could come in handy with
pinpointing potential memory leaks or inefficient implementations that could
turn into bottlenecks later on. The same applies to measuring, tracking, and
analyzing other important performance metrics, such as CPU load. Resolving
these kinds of issues today guarantees that they will not cost a sack of money
for the organization tomorrow.
What if you need to run this same test suite in your pre-production environment,
taking into consideration real-world constraints and workloads? This leads us to
the conclusion that the target test architecture needs to be highly configurable
by design in order to scale well.
Then, what about user experience? Can we also measure and track the im-
pact of development changes over UX while putting different “loads” on the
tested application? Depending on our use cases and the nature of the tested
application, we can achieve this if we have an orchestration layer that is able
to generate “load” while also running carefully selected UI tests against the
tested application in parallel.
OPEN SOURCE WILL SAVE THE DAY — JMETER AND TAURUS
You can harness the real power of existing tools and frameworks and scale
according to your needs by introducing complex (from testers’ perspective)
scenarios, such as using a technology backbone to run both functional and
non-functional tests in parallel so you can correlate results from performance
242
INCLUDING NON-FUNCTIONAL TESTING IN BUILD AND CONTINUOUS INTEGRATION CYCLES
and UI tests, for instance. The scalability and configuration capabilities of a
similar testing architecture can be ensured by a framework like Taurus.1 Since
it plays nicely with JMeter2 and Selenium WebDriver, we are able to execute
complex, highly configurable, and context-dependent test scenarios by mixing
different types of tests and providing timely and valuable insights at any point
of the SDLC.
When the build matures, the number of virtual users applied to the infrastructure
can easily grow using scalable solutions from vendors like Blazemeter (now CA).
The solution can scale up the number of virtual users thanks to global Clouds
and multiple scripts, thus allowing testers to observe the true scalability of the
application under test. By plugging in and triggering Selenium WebDriver scripts
while the application is effectively being “loaded up”, one could also measure,
quantize, track, and analyze the way UX is affected by any change within the
source control management system.
USER EXPERIENCE TESTING: WIND TUNNEL AND USER CONDITION
TESTING
As digital applications proliferate and their functionality expands, the expec-
tations of end users towards reliable, consistent, and responsive behavior
grow higher and higher. There are numerous articles discussing the penalties
vendors incur over poor user experience, independent of whether it is their
fault or due to 3rd parties. End users cannot make the difference and they
do not need to. Mature software delivery organizations understand that. They
also understand that UX can be severely affected by circumstances outside
the control of software developers.
MAJOR PITFALLS FOR GREAT MOBILE USER EXPERIENCE
The sole nature of mobile apps mandate that they are extremely vulnerable
to a whole bunch of flaws caused by the underlying carrier and network infra-
structure. In terms of remote connectivity, some of these could be poor network
quality, roaming artifacts derived by the local carrier network, server downtime,
and many more. There might be a whole lot happening while users explore an
1 Taurus web page — http://gettaurus.org
2 Jmeter open-source web page — http://jmeter.apache.org
243
THE DIGITAL QUALITY HANDBOOK
app, as also evident in Figure 108.
If you put mobile devices into the equation, you might draw the conclusion that
mobile apps sometimes function only due to some kind of dark magic that could
lose its effect at any point in time. There are a plethora of possible application
failures related to improper usage of shared system resources (memory, storage,
CPU, camera, geo-location, sensors, etc.), application conflicts on the device,
improper interaction, unpredictable behavior of a (defective) sensor, to name
just a few. Would you care to imagine how the different versions of mobile OS
and their peculiar constraints might make you stumble and fall?
Figure 108: Real User Journey Example — Multiple Notifications and Background Services
THE THREE STEPS TO SUCCESS
The road to superior user experience is steep and hard to walk. There is hope,
244
INCLUDING NON-FUNCTIONAL TESTING IN BUILD AND CONTINUOUS INTEGRATION CYCLES
though. In order to tackle some of the aforementioned challenges, organiza-
tions can choose to follow the staged solution below. It consists of three steps.
1. Identify the scenarios that may harm the UX of the tested app
You may not know everything in advance, but you can safely start by
analyzing the behavior of the app under poor network conditions. What
would you discover if you started a number of background applications
running concurrently on devices? Will they somehow affect the UX of the
app under test? Load tests are another technique that you can employ at
any time in order to identify low performing and less scalable application
functionalities. But hey — if your application really underperforms, why
don’t you source some input from your end users? Ideally, you will find
some evidence about such issues in your application log, in user contrib-
uted reviews published in the corresponding app store and/or in social
networks. The most important thing that you should not forget: defining
potentially harmful scenarios is not a one-time activity. You need to iterate
on it. Gradually, you will ensure that all important performance indicators
will be taken into account while the app evolves.
2. Define thresholds and enforce Key Performance Indicators (KPI) for UX.
Incorporate them into every test script, so that time for launching and
logging in with the app, for example, can be measured and compared
with previous test runs. Take your time to define sane timeouts for notable
events and interactions with the app as part of your test suite. Perhaps, if
an application is launched after 20 seconds, while you are chasing 3–5
seconds, you would appreciate the notification of a suddenly failing test.
This is also one way to drive the alignment of software and testing teams
along the SDLC.
3. Ensure that your testing activities are in line with the needs and be-
havioral patterns of end users.
As illustrated in Figure 109, you could pick 2–3 personas that resemble
your users and parametrize your testing strategies with their needs in
mind. Repetitive testing using personas as part of your test suites — in-
tegrated and measured along the SDLC — allows detecting and resolving
issues early. It also introduces a continuous and objective indication for
245
THE DIGITAL QUALITY HANDBOOK
the quality of the application throughout the sprint. No doubt, this will
turn out to be invaluable information for project leads, stakeholders, and
decision makers in general.
Figure 109: Connecting a Test Scenario to a Possible Target Persona
PERSONAS GO THROUGH WIND TUNNEL
Perfecto offers a type of testing called “Wind Tunnel”. It is a unique offering
in terms of tooling and test execution support. You could use “Wind Tunnel”
for defining environmental profiles named “personas”. Essentially, a persona
summarizes the characteristics of a distinct group of end users. This solution
gives you the opportunity to manage alternate workflows based on the config-
uration of different personas. This way, you can emulate and verify the quality
of core scenarios and interaction patterns relevant to a given group of users
of your app. There is also a catalog of typical ready-made personas available
with “Wind Tunnel.” They would serve well many types of mobile apps and will
ease the whole personification exercise at your end. Feel free to re-use them
as appropriate.
TROUBLESHOOTING LATENT AND INCORRECTLY RENDERED
CONTENT
There is a unique challenge for native, and sometimes hybrid, mobile apps,
namely, the detection and troubleshooting of incorrect content rendered by
mobile applications. This is difficult due to the network traffic usually being
encrypted. From a technical perspective, the typical website network traffic
246
INCLUDING NON-FUNCTIONAL TESTING IN BUILD AND CONTINUOUS INTEGRATION CYCLES
does not have the potential to bring so much sorrow. One can easily detect/
see that a large image was incorrectly downloaded to a smartphone with a
small screen. With native mobile apps, this is often close to impossible. The
ability to examine the network traffic between the app and its backend(s) is
key to many roles within a software delivery organization. Mobile developers,
functional, performance and security testers rely on it in order to:
1. Assess, optimize and validate the efficiency of downloading and rendering
the correct media objects necessary to fit the device screen — not too small
and not too large. In many cases, it may be reasonable to download such
objects only once and cache them for some time. We may also need to
download these objects according to a predefined order.
2. Ensure user content is sent in an encrypted manner and to the correct
backend server.
3. Guarantee efficient interaction with 3rd party service APIs.
A traditional approach to addressing this challenge has been the provisioning
of team access to tools similar to the Charles proxy. The challenge with this
approach is that many organizations would object to incorporate a Charles
proxy, or man-in-the-middle facility, in their own corporate network.
Solutions like Perfecto recently added the ability to produce a HAR file out of
every script execution in the Cloud. This practice could also be employed or
inherited by others. The HAR file3 is a layer 7 (HTTP) decrypted network traffic
log containing device or page interaction with the various backend system(s).
Every piece of downloaded content can be easily examined, as demonstrated
in Figure 110. The timestamp and duration of each download phase (DNS
time, connect time, download time, etc.), as well as requests and response
message body and headers are just some of the interesting insights bundled
with HAR files. Since the solution is deployed in the Cloud, it is very easy for
mobile developers and testers to use this feature extensively.
3 Har File — https://en.wikipedia.org/wiki/.har
247
THE DIGITAL QUALITY HANDBOOK
Figure 110: An Example of a HAR File
TESTING BATTERY DRAIN AS PART OF NON-FUNCTIONAL TESTING
Teams will often focus on the usability, functionality, performance, and security
of an app. If this same app does what it was designed to do, it will often get
pushed to production as is.
Figure 111: Battery Drain as Part of App Quality
RESOURCE CONSUMPTION AS AN APP QUALITY PRIORITY
Let us review one of most popular mobile native apps for 2016, namely,
Pokemon Go. This mobile app alone requires constant GPS location services
while active. It keeps the screen fully lit (when in the foreground), operates the
camera, plays sounds, and renders 3D graphics content.
Research shows that this kind of resource consumption will drain the battery
248
INCLUDING NON-FUNCTIONAL TESTING IN BUILD AND CONTINUOUS INTEGRATION CYCLES
of a fully charged Android device in 2 hours and 40 minutes in average; that
is, this is the time required for your battery indicator to drop from 100% to 0%.
The thing is, of course, that end users will typically have at least ten other apps
running in the background, hence it would happen way faster than estimated
above.
From recent research done by AVAST4 in Q3 2016, two groups of “greedy” apps
could be distinguished. They are rated in Figure 112 and Figure 113.
Figure 112: Top 10 Performance-Draining Apps That Run at Startup
Figure 113: Top 10 Battery-Draining Apps That Run at Startup
4 AVAST Research — https://blog.avast.com/these-top-10-most-performance-draining-android-apps-
might-shock-you
249
THE DIGITAL QUALITY HANDBOOK
General awareness of this pitfall for all of the mobile apps out there seems to
be the first step in the right direction. Let us now take the second step and ask
the question that is going to be discussed till the end of this chapter: how can
we detect deviations in terms of reduced battery life? Furthermore, how could
we be sure that these will have been caused by your app under test?
HOW TO TEST AN APP FOR BATTERY DRAIN
Teams need to know as much as possible about their end users. This is a clear
requirement of the market today. From a battery drain testing perspective, the
test environment needs to mimic the one being used by target end users. This
often implies careful selection of devices (with different battery states and
health) and versions of mobile OS, as well as a typical mix of popular apps
being installed and run in the background. In addition, testing is conducted
under real network conditions (2G, 3G, LTE, Wi-Fi, Roaming, etc.).
Test Against Multiple Devices
Device hardware is different across manufacturers. Each battery will obviously
have a different capacity compared to others. After a while, devices will expe-
rience degraded battery chemistry, which has certain negative impacts on its
performance, the duration it can last without recharging, and more. This is why a
variety of new, legacy, and different battery capacities needs to be available and
taken into consideration in any mobile device lab. This is a general requirement
for mobile app quality and a very specific one in the context of battery testing.
Listen to the Market and End Users
Since the market is constantly changing, the “known state” and quality of your
app – including the consumption of battery and other system resources — may
degrade as well. This could happen due to lack of experience of running your
app on “unknown” devices or because of a new version of the underlying OS
that was just recently released. We have witnessed plenty of examples, includ-
ing the recent iOS 10.25 release.
There are free native apps leveraged by mobile teams to obtain valuable battery
5 iOS 10.2 battery drain reports — http://www.forbes.com/sites/gordonkelly/2016/12/21/apple-ios-
10-2-iphone-battery-problems/#5e2a4f1d4f2c
250
INCLUDING NON-FUNCTIONAL TESTING IN BUILD AND CONTINUOUS INTEGRATION CYCLES
stats in real time, such as Carat.6 You can make good use of some of these.
If you happen to detect battery consumption performance degradation due
to a major release of the underlying OS once it has already been released,
you may risk losing some clients due to unacceptable behavior of your app. A
better approach would be to take your time and plan for running manual and
automated tests against the beta versions of the OS (vendors often provide
such). Thus, you will have the chance to detect any major functional and
non-functional discrepancy rather early, prior to the Global Availability (GA) of
the corresponding version of an OS. Another feedback channel for mobile teams
would be the users’ reviews in app stores and social networks. Yes, it works.
The only issue with this approach is that feedback gets to you a little bit too
late and may also have certain negative impacts on converting new users and
retaining existing ones, as presented in Figure 114. Continuously enriching
your automated test suites on a refreshed device lab will reduce the risks and
will help you identify issues earlier in the SDLC, prior to production releases.
Make sure to incorporate such tests (or a subset of them) as part of your CI
cycle, thus enhancing test coverage and reducing risks.
Figure 114: The Power of User Reviews
6 Carat battery drain app — http://lifehacker.com/5918671/carat-tells-your-which-apps-suck-up-your-
battery-power
251
THE DIGITAL QUALITY HANDBOOK
SUMMARY
Modern performance testing is quite difficult, but extremely important. Bad
performance and scalability have quite a negative impact on user experience
and profits. Fortunately, technology wise, these challenges can be tackled. Oh,
yes, this requires also organizational support, strong discipline and culture. Do
not forget about personas. They can contribute a lot to amazing UX.
As for testing app battery drain, there are not really any good automation meth-
ods facilitating it. Therefore, the general recommendation would be to bring a
plethora of devices in various conditions (as already explained above) to your lab
and measure the battery drain through native apps installed on these devices.
At first, the tests should be run against the app on a clean device and then you
will usually go for the same exercise on a real end user device.
252