Epic IT Ops fails: The 5 worst blunders of 2017

public://pictures/Ericka-Chickowski .jpg
Ericka Chickowski, Tech Journalist, Freelancer

According to analyst estimates, the cost of IT downtime in North American organizations adds up to somewhere north of $700 billion per year. Breaches and other cybercrimes cost the global economy $450 billion in 2016. And though the cost of poor application and web performance is hard to aggregate, SaaS providers lost an average of $195,000 in 2016 due to performance issues, according to a survey conducted by the digital-performance analytics company Catchpoint.

That's why we've gathered up some of the most harrowing public examples of IT Ops failures witnessed this year. By reviewing the mistakes and mishaps of others you can glean some valuable lessons—even if it is simply developing an awareness that the cost estimates for these issues are probably not overinflated. If anything, they may be coming in a bit low.

The State of Analytics in IT Operations

1. Amazon AWS goes dark

One of the biggest dumpster fires in IT ops land resulted from the mother of all outages when Amazon AWS saw a huge swath of its S3 buckets in the eastern US region go down for five hours in late February of last year. The outage brought down sites and affected web performance at some of the biggest brands on the web, including Slack, Trello, Autodesk, Twillio, Citrix, Expedia, Nest, Adobe, and dozens of others. To make matters worse, Amazon was unable to update its dashboard to let customers know what was going on.

The cascading effects of such an outage at a major cloud infrastructure provider were mind-boggling. According to one estimate by Cyence, business interruption losses amounted to about $150 million for S&P 500 companies and $160 million for US financial services companies that used the affected S3 infrastructure.

The lesson: Don't put all your eggs in one cloud infrastructure basket, said Steve Brown, founding partner of IT services firm Rutter Networking Technologies. Organizations should seek ways to increase the redundancy of cloud workloads and ensure that those workloads are deployed on robust machines in different geographical regions, either through a single cloud provider or a multi-cloud architecture.

This is a call to arms for better workload distribution, Brown added. "To minimize the risk of an outage due to overloading, you can distribute the workload across multiple redundant systems in what is known as an active-active high availability (HA) cluster," he said in a blog post. "You can do this by putting redundant systems behind a load balancer, the device responsible for distributing the workload. In AWS, workload distribution is achieved through what Amazon calls Elastic Load Balancing."

2. Heads roll after Equifax data breach

When your CIO, CSO, and CEO "retire" in the wake of an IT failure, you know it was bad. This year's Equifax breach was one of the worst, and not just because it pushed the highest levels of executive power out the door. The credit-reporting agency exposed sensitive financial information for more than 140 million people when its systems were breached through a well-known and very patchable vulnerability in the Apache Struts framework that was running on some of Equifax's systems.

When he went in front of Congress to answer for the breach, Equifax's former CEO tried to throw a single, unnamed security worker under the bus, claiming that his failure to patch the system put the entire business at risk.

But anyone with any security knowledge knows that fundamental security blocking and tackling, such as patch management, should never fall on one person's shoulders.  And mistakes made in one area, such as patch management, should be mitigated by multiple other controls in other areas.

For example, why wasn't Equifax engaging in proper network segmentation so that a single system intrusion didn't turn into a network-wide breach disaster? Clearly, the underlying problem here was a lack of security processes and culture in a company whose core business is aggregating, analyzing, and safeguarding data about millions of people.

And that was just the tip of the iceberg in this disaster. With irregularities that had some executives selling off company stock before the breach news went public, and the company asking consumers to relinquish their claims for suit by signing up for security monitoring services once things were in the open, this wasn't just an IT ops fail. It was a complete business fail from top to bottom.

"In the case of Equifax, it not only appears as though they had their head in the sand from a cybersecurity perspective, but also from a governance and breach response perspective as well. The CEO and his team of internal and external providers bungled every step of the response: messaging, PR, consumer protection communications and offers, and everything else imaginable," said Chris Pierson, CSO and general counsel for financial platform vendor Viewpost. "

"The breach is a shining example of what happens when you do not prepare for data breach response ahead of time, do not adequately [test] your responses, and do not have that single incident commander leading the charge." 
Chris Pierson

[ Webinar: What’s New in Network Operations Management (Dec. 11) ]

3. British Airways outage pulls plug on service

Airlines are notorious for running some of their most critical business and booking systems on older, legacy platforms. As a result, large-scale outages are not infrequent in the industry, and it's common to see examples of IT ops mayhem in the travel industry every year. In fact, this year started out with just such a case from Delta, when an outage of several hours caused the airline to cancel hundreds of flights.

But nothing quite takes the cake like the British Airways outage last summer, caused by an engineer who first accidentally pulled the plug on the power supply at a London data center and then caused a surge while reconnecting it.

The carnage on this one was epic, causing more than 700 flight cancellations over the course of three days that left 75,000 passengers stranded. Estimates pegged the total costs from this fail at more than $100 million.

This particular IT operations catastrophe "provides a bonanza of lessons for execs everywhere," said Forrester Research's Naveen Chhabra in his blog. The most fundamental: Even though infrastructure resilience is expensive, it is crucial to build redundancies into the systems that are the lifeblood of your company.

Management doesn't want to burden the company's budgets with investments in building redundancy, but they're inevitably shocked by the whopping damage disruption causes, Chhabra said.

"Direct passenger compensation alone will exceed BA's predicted cost savings from not implementing enough redundancies."
Naveen Chhabra

4. GitLab loses production data

A hot startup in DevOps land, GitLab has garnered tens of millions in venture funds to fuel its web-based Git repository management platform. But early this year, the company showed that it was still in learning mode when it suffered an 18-hour outage and lost 300 GB of customer data when backup procedures didn't go according to plan. After a backup failed, the company had to recover to a six-hour-old snapshot, and anything created during that six-hour window was lost.

On the plus side, the company was extremely transparent throughout the recovery process, sharing its in-depth post-mortem processes, plans for improving backup and recovery procedures, and lessons learned from the ordeal.

The lesson that others can learn from GitLab is that to create a daring IT environment, organizations need the procedural backstops that can help them recover when mistakes are made.

"An ideal environment is one in which you can make mistakes but easily and quickly recover from them with minimal to no impact," the GitLab team wrote in its post-mortem. "This in turn requires you to be able to perform these procedures on a regular basis, and make it easy to test and roll back any changes."

For GitLab, that means developing procedures that allow developers to test database migrations, and to create more resilient recovery procedures across the infrastructure. Hopefully, that will include not just the procedures themselves but regular stress testing of them to ensure that catastrophes don't occur.

5. Accenture's Amazon S3 fail exposes credentials, API data

What is a management and technology consulting firm but its intellectual property and its relationship with its customers? Accenture put both assets at risk with a forehead-slapping cloud leak caused by insecurely configured Amazon S3 buckets.

Discovered this fall, the leaky buckets left a smorgasbord of the type of data that makes cybercriminals and corporate spies salivate. This included "API data, authentication credentials, certificates, decryption keys, customer information, and more data that could have been used to attack both Accenture and its clients," according to Dan O'Sullivan, cyber-resilience analyst with UpGuard, the security firm that uncovered and disclosed the leak. All of this data revolved around access to the Accenture Cloud Platform, the firm's cloud management system.

"Taken together, the significance of these exposed buckets is hard to overstate. In the hands of competent threat actors, these buckets, accessible to anyone stumbling across their URLs, could have exposed both Accenture and its thousands of top-flight corporate customers to malicious attacks that could have done an untold amount of financial damage," O'Sullivan said.

For clients of Accenture or other big consulting firms, the incident is a wake-up call,  and a reminder that outsourcing data and systems to the "experts" doesn't necessarily protect you from big lapses in security. For the tech industry overall, the lesson is that S3 buckets are no more safe than any other default infrastructure option, on-premises or off. You need careful configuration management to ensure the security of AWS infrastructure, and the onus is on you, the customer, to do so.

Unfortunately, S3-related incidents are increasingly common. Most recently, a Pentagon contractor exposed a huge repository of intelligence, scraped from over a billion social media posts, in much the same fashion as the Accenture leak. 

Get back to risk-management basics 

If there's one common denominator in all of these epic fails it is this: There's no substitute for good risk management. You must thoroughly assess risks, build in controls to mitigate them, and develop solid incident response and recovery plans for reducing the impacts, should disaster strike.