You are here

How a systems failure took Bloomberg customers offline

public://pictures/Todd-DeCapua-CEO-DMC.png
Todd DeCapua, Technology leader, speaker & author, CSC

This article is part of an ongoing series of Performance Retrospectives that assess real-world application performance issues in the news, analyze what might have happened, and offer up best practices that just might help you avoid similar problems.

Shortly after 8:00 AM on Friday, April 17, the Bloomberg trading terminals went dark and stayed offline for two hours after a major systems failure.

[ Get Report: Gartner: Use Adaptive Release Governance to Remove DevOps Constraints ]

What happened

"We experienced a combination of hardware and software failures in the network, which caused an excessive volume of network traffic," Bloomberg said in a statement about the systems failure. "This led to customer disconnections as a result of the machines being overwhelmed." The company added that "multiple redundant systems" failed to prevent the disruption.

[ Get Report: The Top 20 Continuous Application Performance Management Companies ]

Why it happened

This outage highlights the complexity of modern composite applications and the dependencies on services. When these services become unstable or fail, they can have a broad impact on internal users and customers.

The business impact

The systems failure took down Bloomberg's trading platform, data service, and chat platform. It affected financial markets around the world, exacerbating a spike in volatility in European stocks and causing some debt sales to be postponed. Bloomberg suffered negative publicity. A Financial Times article carried the headline "Bloomberg's global outage paralyses investors" and noted that 315,000 customers rely on the service. Bloomberg operates in a competitive market. Last year, the financial data services firm increased its market share to 32 percent of the $26.5 billion market. Outages can send customers to competitors.

It's difficult to fully measure the cost of the outage to Bloomberg subscribers in terms of total business losses. However, assuming that each of its 315,000 subscribers works the typical 264 days a year and pays $20,000 per year for the subscription, the calculated impact of the two-hour outage would be $18.94 per subscriber or a total of $5.97 million just in paid subscription costs.

This does not include other impacts to subscribers, such as the negative impact on the trading floor, the inability to make informed decisions, and compromised communication between traders who were unable to execute timely transactions. More broadly, "The lack of price visibility was blamed for accelerating a sell-off in European shares, while trading volumes in German government-bond futures contracts fell by around a third," according to the Reuters story.

Takeaways: Test for resiliency

Initial software and hardware failures highlight a weakness in the resiliency of the system. Performance and resiliency of the system should be tested and hardened to prevent this type of outage from occurring again. There are several ways you can do this today. Modern testing practices use lifecycle virtualization to quickly and inexpensively recreate conditions and dependencies in a pre-production or disaster recovery environment. This practice enables testers to conduct "what if" scenarios easily while observing the resiliency of applications and the end-to-end system.

*Image source: Gforsythe (Own work) [CC0], via Wikimedia Commons

[ Get Report: Buyer’s Guide to Software Test Automation Tools ]