Top 10 performance engineering techniques that work
Performance and load tests produce a sea of data that can be overwhelming to analyze. Fortunately, there are a few methodical practices you can use to do this efficiently.
Derived from my 17 years of experience performance-testing and performance-tuning mobile, web, and Internet of Things (IoT) apps, the 10 best practices listed below should help any performance engineer get started.
1. Identify tier-based engineering transactions
In the typical performance test harness, load scripts contain transactions or ordered API calls that represent a user workflow. If you are creating a performance harness for an IoT application, the script will contain transactions and logic/behaviors representing a device.
Engineering scripts contain a single transaction that targets a specific tier of your deployment. By spotting degradation in an engineering transaction, you can isolate the tier of the deployment on which you need to concentrate your efforts.
To do this, you want to identify which transactions hit which tiers. If you have trouble doing so, ask your development or supporting infrastructure team for help.
Every deployment is unique, but here are some examples of the tiers and problems you may encounter:
- Web tier: A transaction that GETs a static non-cached file.
- App tier: A transaction that executes a method and creates objects but stops there and does not go to the database tier.
- Database tier: A transaction that requires a query from the database.
Make each of these engineering transactions its own script so you can individually graph out each engineering transaction's hit rate (TPS) and response time values. Use a constant think time (15 seconds, for example) before each engineering transaction to space out the intervals of execution and create a consistent sampling rate.
[ Special coverage: PerfGuild performance testing conference ]
2. Monitored KPIs
Front-end KPIs show the current capacity by correlating user load, TPS, response time, and error rate. Monitored KPIs tell the entire story of why an application starts to degrade at a certain workload level. Hit rates and free resources are two illuminating KPIs for every hardware or software server.
The hit rate will trend with the workload. As the workload increases in a ramping load test, so does the hit rate.
Here are examples of hit rates you can monitor:
- Operating system: TCP connection rate
- Web server: Requests per second
- Messaging: Enqueue/dequeue count
- Database: Queries per second
Remember that each deployment is unique, so you need to decide what qualifies as a good hit rate per server for you, and then hook up the required monitoring.
I tend to monitor the free resources KPI because, unlike with used resources, free resources trend inversely to the workload. Because of that, you can easily identify bottlenecks on a graph. (But you'll have to go with used resources if free resources aren't counted.) Whichever resource is your target, if it has queuing strategies, be sure to add a queued counter to show waiting requests.
Here are examples of free resources you can monitor:
- OS: CPU average idle
- Web server: Waiting requests
- App server: Free worker threads
- Messaging: Enqueue/dequeue wait time
- Database: Free connections in thread pool
To determine relevant monitored KPIs or hook them in, start by studying an architectural diagram of the deployment. Every touch point where the data is received or transformed is a potential bottleneck and therefore a candidate for monitoring. The more relevant monitored KPIs you have, the clearer the performance story.
Now it’s time to prove your monitored KPIs’ worth. Assuming you have built a rock-solid performance test harness, it’s time to spin up a load test using both the user workflow and those engineering scripts.
Set up a slow-ramping test (for example, one that adds one user every 45 seconds up to, say, 200 virtual users). Once the test is complete, graph all your monitored KPIs and make sure that they have either a direct or inverse relationship to the TPS/workload reported by your load tool. Have patience, and graph everything; the information you collect from this test is extremely valuable in isolating bottlenecks. You are exercising the application in order to validate that your monitored KPIs trend with the workload. If the KPI doesn’t budge or make sense, toss it out.
Also, set up your monitoring interval to collect three values per sustained load. In this case, since you are adding a user every 45 seconds, you want to have the load tool sample every 15 seconds. The reason: Three values will graph as a plateau, whereas a single value will graph as a peak. Plateaus are trends.
Catch unanticipated resources. Perhaps not all of the resources will be caught during the review of the architecture diagram, so spin up a fast-ramping load test. Again, you don’t care about the results; this is just an investigation to see what processes and operating system activities spin up. If you notice an external process and have no idea what it is doing, ask! It could be a KPI candidate to add to your harness.
3. Reduce the number of transactions you analyze
Now that you are getting into the analysis phase, you need to significantly reduce the number of transactions that you graph and use for analysis. Trying to analyze hundreds of tagged business transactions isn't efficient.
All of these business transactions are using shared resources of the deployment, so pick just a few to avoid analysis paralysis. But which ones? That depends on the characteristics of your application.
From the results of your upcoming targeted single-user load test (I will describe this shortly), choose your landing page, the login, the business transaction that has the highest response time, and the transaction with the lowest response time.
Also include and graph all of the engineering transactions. The number of engineering transactions depends on how many tiers there are in the deployment: Five tiers equals five engineering transactions.
Now, instead of analyzing all transactions executing in a load test that emulates a realistic load, graph only a subset. The graph of response times will be less chaotic and far easier to analyze. And when you are creating performance reports, you need to include response times for all of the business transactions.
4. Wait for the test to complete before analyzing
It’s funny to watch business stakeholders during a load test. It usually goes like this: The stakeholders concentrate on the orange response time line, the test ramps up slowly and methodically, and then they exclaim, “Whoa, look at those lightning-fast response times! I told you we had overcapacity. We didn’t even need to pay for all this hardware. Such a waste.”
Then, as response times start to deviate, the stakeholders get nervous. They speculate about the cause of bottlenecks, but they have no evidence to support their theories. They point fingers at groups responsible for certain tiers of the deployment. You calm them down by noting that the values are in milliseconds, but they get restless again quickly.
If response times start to exceed three seconds, they get more worried still. There's no finger-pointing this time—it didn't go over so well the last time—but there's a lot of loud sighing and what looks like praying.
Then response times spike, and the stakeholders jump up, insisting that the app has crashed, demanding that someone be fired, and wondering why they are paying for an elastic cloud deployment that was supposed to solve all their scalability limitations. (Ah, yes, that magical cloud.)
All a performance engineer can do at this point is to calmly explain the value of running performance tests prior to going live and that the tests are intended only to verify that the app is executing as planned. It's not the time or the place to analyze.
The best approach is to design a methodical load test to answer a specific engineering question, kick it off, make sure it's behaving as expected, and then go to lunch and let the monitoring tool do its automated job. Don't just sit there, observing each data point as it arrives; the results and the trends will be far easier to interpret after the test has completed, so relax.
5. Ensure reproducible results
For every test scenario, run the same load test three times to completion. For these three test executions, do not tweak or change anything within your performance test harness: not the runtime settings, not the code in the load scripts, not the duration of the test, not the ramp schedule, and absolutely not the target web application environment. Only allow data resets or server recycles, and only to bring the environment back to the baseline between test runs.
The “magic of three” will save you a ton of wasted hours chasing red herrings. It will reduce the data you need to analyze by removing irreproducible results.
Yes, the magic of three requires that you run more tests. But because these are automated tests, you simply press start. The time it takes to run those three tests is tiny compared to the time you could spend analyzing irreproducible results. So run every test scenario three times, and conduct a preliminary analysis to validate that the results or the TPS plateau at the same elapsed time.
If your results are erratic, stop there. Are you sure you built a rock-solid performance test harness? Is the target application code half-baked and throwing errors? You need a pristine, quality-assured build in order to conduct efficient load testing.
Once your results can be reproducible three times, you will have the confidence you need to invest your valuable time in analysis.
6. Ramp up your load
Targeted workloads will make the analysis process much easier. Here's how to get started ramping up your load.
Run ghost tests
Begin by running ghost tests, which check the system without executing load scripts. A ghost test has no real user activity, but it is important: The system is left alone to do housekeeping. What's important is that the monitored KPIs are collecting metrics.
You might be surprised at the number of resources your deployment uses even without user load. It’s better to know that now than try to differentiate user load from system load later in your project. Use this test to calibrate your monitor KPIs, and establish resource usage patterns.
I recommend running the ghost test three times a day. If you find that every half hour a job that crunches the database server kicks off, isolate and understand this activity before executing realistic load tests.
Move to single-user load tests
Assign a single user to execute every single-user and engineering script, and start all of your tests at once. If you have 23 scripts, you should have 23 users executing. Remember: Three times assures reproducible results.
This test is a benchmark to show the minimum response time achievable under a single-user load, which is your best-case scenario. Transactions’ minimum response time values are your transaction response time floor. You also use the results of this test to identify business transactions with the highest and lowest response times.
Create concurrent user load scenarios
Move on to your concurrent test scenarios: Create a slow-ramping staircase scenario that allows for the capturing of three monitored KPI values for each set load. In other words, configure the slow ramp of users to sustain a duration before adding the next set of users. Your goal is to capture at least three KPI metric values during the duration of the sustained load.
For example, If you are ramping by 10 or 100 users at a time and collecting KPIs at 15-second intervals, then run each set load for a minimum of 45 seconds before ramping to the next one. Yes, this elongates the test (by slowing the ramp) but the results are much easier to interpret. Use that magic number three again. It excludes anomalies. A spiking KPI metric that isn’t sustained isn’t a trend.
Living by the law of halves and doubles when performance testing greatly simplifies your performance engineering approach. Start off with the goal of achieving half the target load, or peak users if the application scales to half the load. Then you can double it to the target load. If it does not scale, reduce the load by half again. Do this over and over, if need be. Keep reducing by half until you get a scalable test, even if that’s just 10 users and your goal was 10,000!
7. Use visualization to spot anomalies
If you know what a perfectly scalable application looks like, you can spot anomalies quickly. So study that architectural diagram or whiteboard that shows what should happen in a perfectly scalable application, and compare it to your test results.
What should happen? What does and does not happen? The answers to these questions will tell you where to focus your attention.
For example, as user load increases, you should see an increase in the web server’s requests per second, a dip in the web server machine’s CPU idle, an increase in the app server’s active sessions, a decrease in free worker threads, a decrease in the app server’s operating system CPU idle, a decrease in free database thread pool connections, an increase in database queries per second, a decrease in the database machine’s CPU idle, and so on—you get the picture.
Is that what you see in your test results?
By using the power of visualization you can drastically reduce investigation time because you can quickly spot a condition that does not represent a scalable application.
8. Look for KPI trends and plateaus to identify bottlenecks
As resources are reused or freed (as with JVM garbage collection or thread pools), there will be dips and rises in KPI values. Concentrate on the trends of the values, and don’t get caught up on the deviations. Use your analytical eye to determine the trend. You have already proved that each of your KPIs tracks with the increase in workload, so you should not be worried about chasing red herrings here. Just concentrate on the bigger picture—the trends.
A solid technique for identifying the first occurring bottleneck is to graph the minimum response times from the front-end KPIs. Use granularity to analyze and identify the first occurring increase from its floor. That lift in the minimum response time won’t deviate much, because once there is a saturation of a resource, the floor is just not achievable anymore. It’s pretty precise. Pinpoint the moment in elapsed time that this behavior first occurred.
Be aware that TPS or hits per second will plateau as the deployment approaches the first occurring bottleneck, and response times will either degrade or increase immediately following. Error rates are cascading symptoms.
Your job is simply to identify the first occurring graphed plateau in the monitored hit rate KPIs that precedes the minimum response time degradation. (This is why I advocate collecting three monitored metrics per sustained load. One data point value gives you a peak in a graph, but three data points give you a plateau. Plateaus are gold mines.) Use the elapsed time of the load test. The first occurring plateau in a hit rate indicates a limitation in throughput.
Once you've located the server with the limitation, graph out all of its free resources. A free resource doesn’t need to be totally depleted for it to affect performance.
The first plateau will indicate a root cause—either a soft or a hard limitation. Soft limitations are configurable (for example, max thread pools). Hard limitations are hardware (such as CPU).
The vast majority of bottlenecks I have uncovered are soft limitations, and no amount of hardware will fix that. While tuning can increase scalability, it’s a balancing act. You want to tune to increase throughput without saturating the server hardware.
It’s the alleviation of soft limitations that allows applications to efficiently scale up and out in cloud deployments, saving companies significant operating expenses. I recommend load testing for peak load conditions, and then noting which resources have spun up to accommodate the workload. Then dedicate those resources to your deployment. Pay for it now, and only use the elastic cloud for surges beyond your anticipated peak load.
Remember, it's important to isolate the first occurring KPI. Don’t stop at the first plateau you stumble upon and declare victory, because that could be a symptom, not a root cause. A premature conclusion will cost you hours of wasted time in configuration changes and retesting, only to see that degradation happens at the same time, meaning the load is encountering the same bottlenecks.
Note: if you have two or more KPIs that look like a race condition, you can usually see which plateau occurred first by overlaying the KPI graphs to get a clearer visualization. If that doesn't work, design a new load test that slows the ramp as it approaches the same peak capacity load. Slowly move it down to collect more data points. This will make the results clearer.
9. Don’t lose sight of engineering transactions
Remember those engineering transaction scripts? These can be gold mines for uncovering scalability issues. Sometimes, if I don’t have back-end monitoring, I’ll rely on the data from these scripts alone. But together with monitoring, they tell a very accurate performance story.
The engineering transactions are executing on a sampling rate, so graph them in correlation with the user load. I usually name my transactions according to the tier that they reach. For example, "WEB," "APP," "MESSAGING," "DB."
Use your analysis skills to see which engineering transaction starts to degrade first. Both the hit rate and the response times will reveal where you need to concentrate your efforts.
10. Increase granularity for better clarity
Granularity is vital for both KPI monitoring and visualization analysis. Often, if a test runs very long, the load tool will use a higher-granularity interval when graphing the results. In effect, the tools are aggregating data and presenting only averaged sampling in graphs.
Aggregated data is not optimal for analyzing, so analyze the raw and absolute data in order to understand the scalability limitations. The higher granularity makes your graphs look cleaner, which looks good when reporting to upper management. For performance engineers, however, those cleaner graphs present skewed results.
To be clear, A real plateau consists of multiple data points. But at a higher resolution, you see that plateaus may be disguised as peaks, and peaks hide the essential pattern that is a plateau.
Simply changing the data resolution (from 256 seconds to 15, for example) will drastically change the graph’s visual. Presto! Peaks become plateaus. Yes, the graph will look a heck of a lot busier, but squint and you can see the trend amid the noise.
If your tool can’t lower the resolution to what you need, export the raw data and create your own graph. Yes, this is a manual process, but you won't end up spending your precious time chasing red herrings.
Also, run longer tests. Everyone is in a hurry to do a day’s work in an hour, but these things take time if you want to do them well. Make the analysis easier by slowing the ramp and having more KPI data points.
Only through testing do you cuts risks
The most important result of any performance project is to isolate and expose the resource that limits your scalability. The job is not done until you have achieved this goal.
Even if your application currently scales to your target workload, identify that next bottleneck and put it on the radar, even if there's no need to eliminate it currently. Doing so will save you valuable time as workload increases.
Finally, performance testing is a reiterative process. Every new build or environment change creates a requirement to validate the current scalability of an application. Even seasoned performance engineers who monitor production and suspect a root-cause bottleneck need a methodical performance harness to prove that their tuning will solve the scalability issue. So before changing production configurations, first validate the impact of the changes with a realistic performance test.
Whether you are load testing every build, a new application deployment, a new feature, new infrastructure, or new architecture, any change introduces a risk. Methodical performance testing will mitigate that risk for you.
For more on these methods, participate in my live online presentation at the Performance Guild virtual conference. Missed the live presentation? Registered users can still watch it after the event.
Image credit: Flickr