You are here

Performance reality check: 4 ways to keep up with unexpected surges

public://pictures/Christopher-Null-CEO-Null-Media.png
Christopher Null, Freelance writer

From online retail to finance, and from streaming entertainment to healthcare providers and agencies, all manner of businesses and government entities have seen their web traffic and app activity explode. COVID-19 has changed everything for many organizations, and it did so overnight, with no time to prepare.

Microsoft says the use of its Teams collaboration software in Italy increased by 775% in March alone, sending the company scrambling to increase capacity while implementing temporary restrictions and quotas for Azure customers to prevent systems from going offline.

While performance is increasingly being considered during the software design phase, more typically performance engineering is an iterative process that occurs after a service or app has been launched. If the system is bogging down due to heavy load, engineers figure out what they can do to make it better, whether that's redesigning code through component decoupling or simply deploying cloud services such as caching systems and content delivery networks. Performance improves, and everyone relaxes.

But enterprises such as Instacart, Zoom, Amazon, and Netflix just don't have the luxury of time in the new COVID-19 reality. Last month, Netflix usage hit its all-time high, and it's still going up. Analysts and users alike are increasingly worried whether it and other Internet services can handle the load.

So far, they've been able to, through a combination of savvy business decisions and clever engineering. TechBeacon spoke with several tech leaders to suss out the best practices that were keeping these types of services afloat amid the huge—and sudden—crush of activity, and what other organizations can learn from how they're coping.

Here are four best practices from the front lines of performance engineering.

[ Get up to speed on Performance Engineering with TechBeacon's Guide. Plus: Get this report: Adopt a Performance Engineering Approach for DevOps ]

1. Embrace simplicity

James Pulley, co-host of the performance testing podcast PerfBytes, harkens back to another event that changed much overnight. During 9/11, the major online news organizations "had to move to a highly simplified web page because they were getting so much traffic," he said.

"They got rid of fancy graphics to reduce the number of bytes they were sending. Today, the same strategy is being employed at a Netflix level, reducing bit rates at the top end." 
James Pulley

Sure enough, the company has voluntarily cut its bandwidth usage in Europe. This doesn't just help keep streaming from overloading public networks, but it also helps Netflix servers avoid a meltdown.

Byte creep is a real problem, Pulley said. The average size of a web page grew from about 1MB in 2011 to 3MB in 2017, and it's still on the rise. As such, the first step organizations need to take to weather a traffic crush is to cut the fat.

"If you have six web beacons, go to one. If you have a style sheet full of artifacts, get all of that out."
—James Pulley

Use Minify and Zip where possible; you want the data you send to be as small as it can be. And make sure your cache model is refined and working. If you see a file that hasn't changed in five years that's still being requested every five days, "you're wasting bandwidth," he added. Finally, be more aggressive about returning resources to the pool by giving shorter sessions to users.

2. Treat data like code

Matt Yeh, senior director of product marketing at DataOps platform vendor Delphix, said that one way to scale your development cycles to handle increased demand is by focusing on agility in the data tier. This means getting teams to treat data the same way they treat code.

Feeding and refreshing test data for cloud-based test environments from an on-premises production instance "is harder than many leaders think," he said. Without the ability to automate data delivery as part of a DevOps workflow in the cloud, "a data agility problem is created that drives bad behavior and slows the delivery pipeline."

Data agility is the most important factor for application development, because while code is being built and deployed via self-service, "the data tier still remains rigid and resistant to the same level of speed and automation at most organizations," Yeh said.

Make your tools and processes address the data tier in the same way teams have improved the flow of their application code.

"You can eliminate data-related waste, rework, and wait times."
Matt Yeh

[ Learn what your team needs to know to start taking advantage of test automation with TechBeacon's Guide. Plus: Get the Buyer's Guide For Software Test Automation Tools ]

3. Take a 'shift-left' approach to quality

The Rule of 10 says that after each step of quality assurance, it is 10 times harder and more expensive to fix a defect. To address this, said Tal Weiss, CTO of software quality tool vendor OverOps, many organizations are starting to understand the merits of adopting a shift-left approach to quality.

This means that by increasing quality measures taken in the development and testing phases of software delivery, you can significantly reduce the odds of production issues. This mindset can be applied to the data tier as much as it can be to code.

4. Embrace performance engineering tools

"Performance engineering and testing doesn't necessarily need to change, but it definitely needs to happen," said Guillaume Quintard, software engineer for content delivery network vendor Varnish Software.

Performance engineering teams need to understand where their breaking point is and have a proactive plan in place for what to do when it is reached to avoid an outage.

"As an industry, we've had these tools for a long time."
Guillaume Quintard

Mike Canney, business development and product strategy professional at performance analytics tool vendor Accedian, expanded on that thought. Performance engineering teams typically have a very good understanding from a systems perspective, but "what's missing is a real understanding of the network and application visibility," he said.

To get a complete picture of network performance, you have to understand things like bandwidth and latency, he said. "You need to be able to see how having multiple users impacts bandwidth, and how much bandwidth and latency impacts response time." 

Canney also noted that sheer bandwidth isn't always the issue. There are many variables that can affect a network.

"Before you throw money at expanding bandwidth, make sure that's actually the problem by utilizing network visibility tools."
Mike Canney

Other factors to weigh

It's also important to understand business transactions—measuring the unit of work that is actually performed in each transaction. For example, a single search on a retail website may ultimately create thousands of micro-transactions as the item is located, inventory is measured, the offering of upsells or related goods is weighed, and so on.

"Once you understand the business transactions and what goes into them, it's much easier to do network modeling and predictive analysis."
—Mike Canney

Ultimately, all organizations need to take a long-term approach to performance engineering. 

"Testing is more relevant than ever, but relying on traditional tests is no longer an option."
Tal Weiss

Building in automated quality gates and feedback loops will allow for thorough, fast testing that doesn't hold up release timelines. This can be done by leveraging a variety of automated testing methods within your CI/CD pipeline, such as static and dynamic code analysis.

Process and procedure trump everything

Goutham Belliappa, vice president of AI engineering at Capgemini, sums up the matter by noting that the overarching issue surrounding traffic surges is not usually a technical one but a human one.

"COVID-19 has forced many companies to go digital, even those that thought they were unable to. For many businesses, bandwidth is being slammed, but the problem is more of a process and procedure one." 
Goutham Belliappa

For example, VPNs hastily rolled out to workers who were never envisioned as working from home have only exacerbated problems for IT departments that also are no longer in the office.

The bottom line: The tools needed to weather the storm are already in most organizations' collective arsenals, but development teams need to more fully utilize them and better understand how the data tier affects application performance.

The good news is that it's easy to take some smart first steps. Reducing data overhead and minimizing data transfer—even if it's only temporary—are essential ones to keeping your services up and running during this crisis.

[ Understand quality-driven development with best practices from QA practitioners in TechBeacon's Guide. Plus: Download the World Quality Report 2019-20 ]