A major software failure is embarassing, but need not be career-ending. Think of it as an opportunity to learn how to recover from failure, how to prevent another crash from happening, and how to face your team.

A QA professional's guide to surviving a software failure

"What's the worst that could happen?"

That sentiment may seem funny when someone says it right before loading up on the sriracha sauce or running their first 5K. But QA professionals know all too well that sending software to production without catching major bugs will result in the worst-case scenario. Millions of dollars, or even lives, can be lost due to a software failure. Sometimes, even an isolated issue can result in a situation that makes a QA professional want to face zombies rather than their peers on the development team. But while a serious crash can be a blow to any QA professional's ego, it doesn't have to be the end of the world.

Even the best QA experts have dealt with software failures for years. They know that introducing humans into the equation can result in error. Fortunately, if there's a lesson to be learned from failure, and you're quick to propose solutions, it's possible to recover from a software disaster stronger than ever.

First and foremost, don't dwell on it

Software failures are inevitable, and there's no sense in beating yourself up over it, says Olga Nikulina, QA engineer at B2B e-commerce software vendor NuOrder. "Dwelling on failure or letting the team down can be counterproductive and isolating, neither of which is valuable for moving forward and recovering," she says.

Just as all successes are the development team's efforts coming to fruition, all failures fall on the shoulders of the development team equally. The team needs to rally to examine what caused the failure, how it was overlooked, and how to fix both the issue and the process for future projects, Nikulina says.

Log the mistakes, and use them to your advantage

Nikulina advises fellow QA professionals to log major failures as detailed test cases for future software pre-releases. This creates a stronger test base. "Over time, getting familiar with the software and with its areas of potential weaknesses — because no software is free of those — builds an additional level of instinct-based, ad hoc testing," she says.

Communication between the development team and QA is also critical for recovery. By staying in touch with the development team working on the software, QA professionals are better poised to catch omissions or errors in the next iteration. "Collaborating with team members is probably the most valuable way to test a feature, because the expression 'two heads are better than one' really holds true," Nikulina says.

Overhaul how testing is viewed in the organization

QA professionals are already very familiar with this scenario: the time line for a new software release gets extended, development runs late, and as a result, testing gets pushed to the bottom of the priority list and squeezed in at the last second. A major failure should be used as a wake-up call if testing was neglected before the launch.

For example, a large telecommunications company was launching a mobile application with a social media component, and all services were coming from the back-end database. Part of the process involved security testing before the site was launched, according to Tony Rems, CTO of test tool vendor Appvance. However, while the application was still in development, and not even close to being finished, the organization wanted to run testing. Rems advised against this, but was overruled by developers who wanted to be able to check off the "security testing" box on the list.

"As organizations get very large, they have checkboxes around things that have to get done, but not rigor as to why it has to get done," Rems says. Quality and security need to be integrated into every aspect when it comes to building software, not just as checkboxes, but in an integrated fashion, he adds.

Another time, Rems was working on an e-commerce implementation for a Fortune 10 company. The go-live date coincided with an ad campaign, and there was no room to push back. Rems's team worked on its half of the application and had it ready for deployment, but the other team wasn't ready and pushed its portion live. The software went down two hours later and couldn't be brought back up. "It turned out that they had skipped testing entirely," he says. The other team had architected the application so that it couldn't scale beyond two simultaneous users.

Recovering from that software failure required a 48-hour redevelopment marathon to go live one hour before the ad campaign launched, and the experience underscored the importance of testing. "When those corners get cut, you see the risk to the software," Rems says.

Make sure testing keeps up with development

With the increasing speed of application deployment, particularly with agile and continuous delivery becoming part of the parlance, it's imperative that testing keep pace. "When I've worked with companies moving to DevOps...they talk about increasing velocity," Rems says. "The thing that hasn't changed in 20 years is testing." QA professionals often are using the same tools for unit tests, functional tests, database tests, and other tests. In some cases, it might make sense to build your own testing tools, he adds.

"The [companies] you don't hear about having failures have built the capabilities themselves, and do a huge amount of testing as part of their continuous integration," says Rems. For example, Google and Amazon have built custom testing tools into their software development lifecycles and continuous integration processes. They look at quality as a necessary investment to prevent downtime and issues.

Eliminate human error as much as possible

Testing is important, but it's sometimes worthless when human error is involved in writing the testing code. Walter O'Brien, founder of Scorpion Computer Services, advocates removing humans from the testing process as much as possible. "The main take on a failure is that you have to do a root cause analysis: was it human error or a system error?" he says. The rate of human error can be as high as three percent, he adds.

Testing automation is one way to eliminate human error. Write programs that run the tests that use all possible scenarios, as in a chess game. In this way, you can push one button, and all tests run overnight.

By running those tests repeatedly, you can catch bugs that would otherwise slip through unnoticed. For example, one of O'Brien's clients was building security for a nuclear control system with all the usual authentication procedures: username, password, and a fingerprint to log in. O'Brien's testing program found that the software worked the first time around, but on the second log-in attempt the application remembered the credentials and would allow anyone to use it.

Look at all parts of development

In another instance, a team installing O'Brien's software put an automated stock trading system that was supposed to be in testing mode into production. The system was robotically trading real money. After turning off the system, the team discovered that a setting had been changed in the configuration manager, something that's often viewed as a side hobby of developers, O'Brien says.

But in reality, configuration manager errors account for 30 percent of bugs, issues, and downtime, O'Brien says. A perfectly coded piece of software that's passed all tests with flying colors can be installed upside down and make the developers look bad, which is why a dedicated configuration manager role is needed, he adds.

Ultimately, the only successful way to handle software failure is to learn from it. That means finding out what went wrong, and figuring out what can be done to make sure it never happens again. It's not a career-ending move if there's a plan, and experts agree that testing should definitely be included in that plan.

Image credit: Flickr

Topics: Quality