How to prevail over airline-grade system outages

Last week, Delta Air Lines’ operations were crippled by a power outage that caused a catastrophic systems failure and massive disruption to the company's passengers and employees. Power was restored by midmorning, but hundreds of flights still had to be canceled, with cancellations and delays continuing for several days. This type of problem is not unique to Delta. Less than three weeks ago, Southwest Airlines canceled more than 1,000 flights in the wake of a similar system outage. Although the underlying causes of these system failures were different, the outcomes were largely the same: grounded planes, stranded passengers, and news articles asking the same questions. Why couldn’t this be prevented, and what could the airlines’ IT teams have done differently to minimize the impact on customers and front-line employees?

In a video published in social media by Delta News Hub, Delta CEO Ed Bastian apologized to passengers for the outage and assured them that the airline was doing everything it could to restore operations as quickly as possible. As a QA professional who has lived through a major and very public systems outage, I can relate to the "all hands on deck” response from Delta. The good news is that it's not the end of the world: You, the QA professional, will survive—if you play your cards right. Here's how to prepare for the unexpected and survive a catastrophic hardware or software system failure without doing too much damage to your company's brand and customer loyalty.

Failure is inevitable. What's your plan?

It’s easy to be a Monday-morning quarterback and say that with proper redundancy measures in place none of this would have ever happened. Or for software quality teams to blame software failures on network outages and shrug off any responsibility for the systems not coming back online promptly after power had been restored. Let’s face it: System failures will happen, and having a plan for troubleshooting and recovering from production outages in a timely manner is as important as putting in place the preventative measures to avoid those outages in the first place.

How I lived through a massive systems outage

Shortly after I joined T-Mobile USA as head of IT QA and testing, we faced a very public Sidekick data outage that resulted in an estimated 800,000 Sidekick smartphone users in the United States temporarily losing personal data, including emails, address books, and photos. The root cause was a partner’s hardware failure, and at the time the incident was described as the biggest disaster in cloud computing history. Fortunately, T-Mobile not only made it through the incident but successfully rebuilt customer confidence and trust.

In hindsight, the outage was a blessing in disguise. The T-Mobile IT team used the failure as a wake-up call to begin modernizing IT systems, ultimately helping the carrier secure its current position as one of the leading mobile service providers. The lessons that I learned from that experience will benefit any company, whether you're an airline, a telecom, or in another industry.

1. First things first

First, take a deep breath, then focus on getting the systems back up and running as soon as possible. Take stock of what business-critical processes are impacted by the outage and what systems need to be brought back up and in which order to minimize impact on the customers and front-line employees. Establish a Dev/Test/Ops command center with an all-hands-on-deck mandate with clear leadership, then break into smaller teams for technical troubleshooting and reviewing logs and the knowledge base. It helps to get everyone’s cellphone number so that you can reach the right people quickly when you need them. If needed, plan for your team to work in shifts until the immediate problem is resolved and operations restored.

2. Provide updates frequently to executives and communications teams

Assemble the most knowledgeable people in a virtual command center and appoint a liaison to provide updates to executives every 15 minutes. Keep them informed of any progress—and any obstacles that need to be unblocked. Open communication and regular progress reports as you recover from the failure will help you set expectations and plan for customer and employee communications.

One other thing: Keep nontechnical executives and communications teams off the technical bridge, if at all possible, to avoid distraction from the immediate goal of restoring the system. Keep them informed, but keep them out of the detailed technical interactions.

3. Troubleshoot now, diagnose later

A system-wide outage is not the time to finger-point or assign blame, nor is it the time to dive into deep diagnostics to try to find the root cause. Focus on getting up and running ASAP. The rest can wait until after the crisis has passed. As you bring the system back up, tell the team to be diligent about taking notes. You will need them to review what has changed, and in what order. Keep your runbook for all your core systems handy to prevent errors as teams rush to fix immediate issues.

4. Don’t let a good catastrophe go to waste

Once the crisis is behind you, identify what went wrong in the first place, how it could have been prevented, and what could have been done differently in handling the situation. While the executives are dealing with the aftermath of the calamity on the customer-facing front (offering future spending credits, vouchers, rewards, or other perks), focus on recognizing your team and rewarding them for the extraordinary effort they put forth to help recover from disaster.

Then focus on the future: Start thinking about how you can prevent such a crisis from happening again. Create and review an inventory of your software and hardware systems. Identify the systems that are most prone to failures and consider modernizing legacy applications and untangling some of the excess complexity in your core hardware and software infrastructure.

Hardware failures shouldn’t bring your software down. Your system shouldn’t be that fragile. But in some industries, such as air travel, they are. Different companies deal with this differently. For example, Amazon and Google use open-source tools for DevOps, similar to Chaos Monkey, which bring systems down randomly to test how well their redundancy works in different circumstances.

What did you learn from your failures?

These stories of outages and disruptions aren't specific to airlines, telcos, or any other vertical market or industry. Failure can happen anytime, to anyone. The key to recovery is to identify your own risks and vulnerabilities, improve your systems for the future and find ways to bounce back without damaging your company’s reputation—and the bottom line.

Has your company survived a major outage or systems failure? What lessons did you learn from your experience, and how have you improved your systems' stability? Weigh in below with your own stories and advice.

Image credit: Flickr

Keep learning

Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.

Read more articles about: App Dev & Testing, App Dev

You are here