You are here

The largest provider of video, high-speed Internet, and voice services to US-based residential and small to medium size businesses experienced a wide-spread outage affecting millions of its customers in June 2015.

Major Internet outage shows the value of capacity planning and IT Ops

public://pictures/Todd-DeCapua-CEO-DMC.png
Todd DeCapua, Technology leader, speaker & author, CSC

This article is part of an ongoing series of Performance Retrospectives that assess real-world application performance issues in the recent past, analyze what might have happened, and offer up best practices that just might help you avoid similar problems.

How reliant is your home and/or business on the Internet? What impact would you experience if you had poor service all day, then a total loss for three hours in the evening? This performance retrospective showcases a capacity planning incident regarding the largest provider of video, high-speed Internet, and voice services to US-based residential and small to medium-sized businesses.

[ Enterprise Service Management brings innovation to the enterprise. Learn more in TechBeacon's new ESM guide. Plus: Get the 2019 Forrester Wave for ESM. ]

What happened

Comcast XFINITY customers in California, Washington State, and Arizona experienced poor internet service all day on Monday, June 1, 2015 and a total loss of service from 6:30 pm to 9:30 pm Pacific Time. During the outage, most customers were unable to learn what was actually happening, with some end users saying they waited 45 minutes to get through to the call center. Comments from articles like GeekWire show how frustrated end users were as a result.

As a result, Comcast felt compelled to "make it right" by offering a $5.00 credit to their customers. They explained: "This $5 credit offer is being extended to our customers with active XFINITY Internet service in California, Washington State, and Tucson, Arizona who were affected by the issue with our DNS servers that was first reported on 6/1/15."

Customers had to call the number provided to request the $5.00 credit.

[ Learn how to transform your IT with AIOps in TechBeacon's guide. Plus: Download the analyst paper on how AI is changing the role of IT. ]

Why it happened

This outage calls into question the resiliency and reliability of modern software and hardware systems. When they fail, there are multiple points of potential overload, and overload appears to be the root cause of this incident. Specifically, this involved an overloaded DNS server within the Comcast Internet backbone. As we've seen, the impact of these incidents can be broad. In this case, the problem affected customers across a large swath of the US West Coast.

The business impact

Based on the 10Q filed on May 4, 2015, Comcast has 22.4 million high-speed Internet customers. To calculate the business impact, let's conservatively estimate at least four percent (1 million) subscribers were impacted by this outage and will be registering for the $5 credit, costing Comcast $5 million in subscription credits from this one outage. This doesn't factor in several other costs, impacts, and shareholder value.

Takeaways

Comcast Cable, as the nation's largest provider of video, high-speed Internet, and voice services (cable services) to residential customers under the XFINITY brand, also provides similar services (and more) to small and medium-sized businesses.

In a corporate press release, Mark Muehl from Comcast addressed "What happened" and "What we are doing about it." Comcast confirmed a hardware failure and a software routing failure, along with the overload of a local DNS server.

When failure occurs at a major Internet provider, the results are often catastrophic. Understanding and testing the capacity / disaster recovery / performance are all essential elements that must be done. In this example, it appears the root cause was the overloaded local DNS server capacity caused many customers to experience service interruptions. Are your servers redundant? Where would your capacity fail in this scenario?

See "Testing fail: How performance engineering can help dev avoid disaster" for another example and several considerations for preventing these kinds of failures.

[ Learn how robotic process automation (RPA) can pay off—if you first tackle underlying problems. See TechBeacon's guide. Plus: Get the white paper on enterprise requirements. ]