Engineering Techniques
Engineering Techniques
Performance and load tests produce a sea of data that can be overwhelming to analyze. Fortunately,
there are a few methodical practices you can use to do this efficiently.
Derived from my 17 years of experience performance-testing and performance-tuning mobile, web,
and Internet of Things (IoT) apps, the 10 best practices listed below should help any performance
engineer get started.
1. Identify tier-based engineering transactions
In the typical performance test harness, load scripts contain transactions or ordered API calls that
represent a user workflow. If you are creating a performance harness for an IoT application, the script
will contain transactions and logic/behaviors representing a device.
Engineering scripts contain a single transaction that targets a specific tier of your deployment. By
spotting degradation in an engineering transaction, you can isolate the tier of the deployment on
which you need to concentrate your efforts.
To do this, you want to identify which transactions hit which tiers. If you have trouble doing so, ask
your development or supporting infrastructure team for help.
Every deployment is unique, but here are some examples of the tiers and problems you may
encounter:
Web tier: A transaction that GETs a static non-cached file.
App tier: A transaction that executes a method and creates objects but stops there and does not go to
the database tier.
Database tier: A transaction that requires a query from the database.
Make each of these engineering transactions its own script so you can individually graph out each
engineering transaction's hit rate (TPS) and response time values. Use a constant think time (15
seconds, for example) before each engineering transaction to space out the intervals of execution and
create a consistent sampling rate.
[ Special coverage: PerfGuild performance testing conference ]
2. Monitored KPIs
Front-end KPIs show the current capacity by correlating user load, TPS, response time, and error rate.
Monitored KPIs tell the entire story of why an application starts to degrade at a certain workload
level. Hit rates and free resources are two illuminating KPIs for every hardware or software server.
The hit rate will trend with the workload. As the workload increases in a ramping load test, so does
the hit rate.
Here are examples of hit rates you can monitor:
Operating system: TCP connection rate
Web server: Requests per second
Messaging: Enqueue/dequeue count
Database: Queries per second
Remember that each deployment is unique, so you need to decide what qualifies as a good hit rate
per server for you, and then hook up the required monitoring.
I tend to monitor the free resources KPI because, unlike with used resources, free resources trend
inversely to the workload. Because of that, you can easily identify bottlenecks on a graph. (But you'll
have to go with used resources if free resources aren't counted.) Whichever resource is your target, if
it has queuing strategies, be sure to add a queued counter to show waiting requests.
Here are examples of free resources you can monitor:
OS: CPU average idle
Web server: Waiting requests
App server: Free worker threads
Messaging: Enqueue/dequeue wait time
Database: Free connections in thread pool
To determine relevant monitored KPIs or hook them in, start by studying an architectural diagram of
the deployment. Every touch point where the data is received or transformed is a potential bottleneck
and therefore a candidate for monitoring. The more relevant monitored KPIs you have, the clearer the
performance story.
Now it’s time to prove your monitored KPIs’ worth. Assuming you have built a rock-solid performance
test harness, it’s time to spin up a load test using both the user workflow and those engineering
scripts.
Set up a slow-ramping test (for example, one that adds one user every 45 seconds up to, say, 200
virtual users). Once the test is complete, graph all your monitored KPIs and make sure that they have
either a direct or inverse relationship to the TPS/workload reported by your load tool. Have patience,
and graph everything; the information you collect from this test is extremely valuable in isolating
bottlenecks. You are exercising the application in order to validate that your monitored KPIs trend
with the workload. If the KPI doesn’t budge or make sense, toss it out.
Also, set up your monitoring interval to collect three values per sustained load. In this case, since you
are adding a user every 45 seconds, you want to have the load tool sample every 15 seconds. The
reason: Three values will graph as a plateau, whereas a single value will graph as a peak. Plateaus are
trends.
Catch unanticipated resources. Perhaps not all of the resources will be caught during the review of
the architecture diagram, so spin up a fast-ramping load test. Again, you don’t care about the
results; this is just an investigation to see what processes and operating system activities spin up. If
you notice an external process and have no idea what it is doing, ask! It could be a KPI candidate to
add to your harness.
3. Reduce the number of transactions you analyze
Now that you are getting into the analysis phase, you need to significantly reduce the number of
transactions that you graph and use for analysis. Trying to analyze hundreds of tagged business
transactions isn't efficient.
All of these business transactions are using shared resources of the deployment, so pick just a few to
avoid analysis paralysis. But which ones? That depends on the characteristics of your application.
From the results of your upcoming targeted single-user load test (I will describe this shortly), choose
your landing page, the login, the business transaction that has the highest response time, and the
transaction with the lowest response time.
Also include and graph all of the engineering transactions. The number of engineering
transactions depends on how many tiers there are in the deployment: Five tiers equals
five engineering transactions.
Now, instead of analyzing all transactions executing in a load test that emulates a realistic load,
graph only a subset. The graph of response times will be less chaotic and far easier to analyze. And
when you are creating performance reports, you need to include response times for all of the
business transactions.
4. Wait for the test to complete before analyzing
It’s funny to watch business stakeholders during a load test. It usually goes like this: The stakeholders
concentrate on the orange response time line, the test ramps up slowly and methodically, and then
they exclaim, “Whoa, look at those lightning-fast response times! I told you we had overcapacity. We
didn’t even need to pay for all this hardware. Such a waste.”
Then, as response times start to deviate, the stakeholders get nervous. They speculate about the
cause of bottlenecks, but they have no evidence to support their theories. They point fingers at
groups responsible for certain tiers of the deployment. You calm them down by noting that the values
are in milliseconds, but they get restless again quickly.
If response times start to exceed three seconds, they get more worried still. There's no finger-pointing
this time—it didn't go over so well the last time—but there's a lot of loud sighing and what looks like
praying.
Then response times spike, and the stakeholders jump up, insisting that the app has crashed,
demanding that someone be fired, and wondering why they are paying for an elastic cloud
deployment that was supposed to solve all their scalability limitations. (Ah, yes, that magical cloud.)
All a performance engineer can do at this point is to calmly explain the value of running performance
tests prior to going live and that the tests are intended only to verify that the app is executing as
planned. It's not the time or the place to analyze.
The best approach is to design a methodical load test to answer a specific engineering question, kick it
off, make sure it's behaving as expected, and then go to lunch and let the monitoring tool do its
automated job. Don't just sit there, observing each data point as it arrives; the results and the trends
will be far easier to interpret after the test has completed, so relax.
5. Ensure reproducible results
For every test scenario, run the same load test three times to completion. For these three test
executions, do not tweak or change anything within your performance test harness: not the runtime
settings, not the code in the load scripts, not the duration of the test, not the ramp schedule, and
absolutely not the target web application environment. Only allow data resets or server recycles, and
only to bring the environment back to the baseline between test runs.
The “magic of three” will save you a ton of wasted hours chasing red herrings. It will reduce the data
you need to analyze by removing irreproducible results.
Yes, the magic of three requires that you run more tests. But because these are automated tests, you
simply press start. The time it takes to run those three tests is tiny compared to the time you could
spend analyzing irreproducible results. So run every test scenario three times, and conduct a
preliminary analysis to validate that the results or the TPS plateau at the same elapsed time.
If your results are erratic, stop there. Are you sure you built a rock-solid performance test harness? Is
the target application code half-baked and throwing errors? You need a pristine, quality-assured build
in order to conduct efficient load testing.
Once your results can be reproducible three times, you will have the confidence you need to invest
your valuable time in analysis.
6. Ramp up your load
Targeted workloads will make the analysis process much easier. Here's how to get started ramping up
your load.
Run ghost tests
Begin by running ghost tests, which check the system without executing load scripts. A ghost test has
no real user activity, but it is important: The system is left alone to do housekeeping. What's
important is that the monitored KPIs are collecting metrics.
You might be surprised at the number of resources your deployment uses even without user load. It’s
better to know that now than try to differentiate user load from system load later in your project. Use
this test to calibrate your monitor KPIs, and establish resource usage patterns.
I recommend running the ghost test three times a day. If you find that every half hour a job that
crunches the database server kicks off, isolate and understand this activity before executing realistic
load tests.
Move to single-user load tests
Assign a single user to execute every single-user and engineering script, and start all of your tests at
once. If you have 23 scripts, you should have 23 users executing. Remember: Three times
assures reproducible results.
This test is a benchmark to show the minimum response time achievable under a single-user
load, which is your best-case scenario. Transactions’ minimum response time values are your
transaction response time floor. You also use the results of this test to identify business transactions
with the highest and lowest response times.
Create concurrent user load scenarios
Move on to your concurrent test scenarios: Create a slow-ramping staircase scenario that allows for
the capturing of three monitored KPI values for each set load. In other words, configure the slow
ramp of users to sustain a duration before adding the next set of users. Your goal is to capture at least
three KPI metric values during the duration of the sustained load.
For example, If you are ramping by 10 or 100 users at a time and collecting KPIs at 15-second
intervals, then run each set load for a minimum of 45 seconds before ramping to the next one. Yes,
this elongates the test (by slowing the ramp) but the results are much easier to interpret. Use that
magic number three again. It excludes anomalies. A spiking KPI metric that isn’t sustained isn’t a
trend.
Living by the law of halves and doubles when performance testing greatly simplifies your performance
engineering approach. Start off with the goal of achieving half the target load, or peak users if the
application scales to half the load. Then you can double it to the target load. If it does not scale,
reduce the load by half again. Do this over and over, if need be. Keep reducing by half until you get a
scalable test, even if that’s just 10 users and your goal was 10,000!
7. Use visualization to spot anomalies
If you know what a perfectly scalable application looks like, you can spot anomalies quickly. So study
that architectural diagram or whiteboard that shows what should happen in a perfectly scalable
application, and compare it to your test results.
What should happen? What does and does not happen? The answers to these questions will tell you
where to focus your attention.
For example, as user load increases, you should see an increase in the web server’s requests per
second, a dip in the web server machine’s CPU idle, an increase in the app server’s active sessions, a
decrease in free worker threads, a decrease in the app server’s operating system CPU idle, a decrease
in free database thread pool connections, an increase in database queries per second, a decrease in
the database machine’s CPU idle, and so on—you get the picture.
Is that what you see in your test results?
By using the power of visualization you can drastically reduce investigation time because you can
quickly spot a condition that does not represent a scalable application.
8. Look for KPI trends and plateaus to identify bottlenecks
As resources are reused or freed (as with JVM garbage collection or thread pools), there will be dips
and rises in KPI values. Concentrate on the trends of the values, and don’t get caught up on the
deviations. Use your analytical eye to determine the trend. You have already proved that each of your
KPIs tracks with the increase in workload, so you should not be worried about chasing red herrings
here. Just concentrate on the bigger picture—the trends.
A solid technique for identifying the first occurring bottleneck is to graph the minimum
response times from the front-end KPIs. Use granularity to analyze and identify the first occurring
increase from its floor. That lift in the minimum response time won’t deviate much, because once
there is a saturation of a resource, the floor is just not achievable anymore. It’s pretty precise.
Pinpoint the moment in elapsed time that this behavior first occurred.
Be aware that TPS or hits per second will plateau as the deployment approaches the first occurring
bottleneck, and response times will either degrade or increase immediately following. Error rates are
cascading symptoms.
Your job is simply to identify the first occurring graphed plateau in the monitored hit rate
KPIs that precedes the minimum response time degradation. (This is why I advocate collecting
three monitored metrics per sustained load. One data point value gives you a peak in a graph, but
three data points give you a plateau. Plateaus are gold mines.) Use the elapsed time of the load test.
The first occurring plateau in a hit rate indicates a limitation in throughput.
Once you've located the server with the limitation, graph out all of its free resources. A free resource
doesn’t need to be totally depleted for it to affect performance.