Storm clouds

How do you recover when your cloud is the disaster?

The Amazon Web Services (AWS) S3 outage in February 2017 naturally prompted discussion about the pros and cons of using a single cloud service provider (CSP). But just as significant is the issue of unforeseen risk to a company’s IaaS/PaaS disaster recovery plan. Specifically, what happens when the CSP experiences a widespread disaster?

Part of the allure of cloud services is not having to worry about disaster recovery (DR). Yet, during the S3 outage, having an independent DR plan would have come in handy for Apple, Snapchat, and other companies reportedly affected.

Every organization has the ultimate responsibility of planning for as many likely disaster scenarios as can be imagined. The S3 outage should spur all cloud customers to consider how DR planning for IaaS/PaaS and DRaaS (DR as a service) fits current needs, and reassess the risks presented by their CSP.

Here are three essential questions that can assist with a risk assessment.

Multicloud Monitoring: How to Ensure Success the First Time

1. How do you initiate a DR plan when the incident involves the cloud provider?

The classic conundrum of when to initiate (or “invoke”) a DR plan is further complicated by the cloud. Every DR plan has, at its heart, a definition of a “disaster.” The time and expense of initiating the plan must be weighed against the chances that the production system will be restored in a reasonable amount of time.

This decision is obvious if your data center has become a smoking crater. But what do you do if your CSP is in its fourth hour of downtime for production systems and there’s no end in sight?

Know the transparency of your CSP
The transparency provided by your CSP during an outage is an important factor in DR decision making. DR initiation can be a difficult call, even for services hosted in your own data center, and a potential lack of visibility into the problem can heighten the challenge. If your CSP has a history of high transparency during outages, then your plan can tolerate longer recovery time objectives (RTOs), because decision making should occur faster.

Compare cost, speed of recovery with business importance of the service
A second factor is the speed and cost of executing the plan, compared with your business impact analysis (BIA) for the service. Not every service is created equal; BIAs, which typically have board-level visibility, will provide a sense of the tolerance for downtime that is appropriate for the service. Cloud technologies often reduce both recovery time and cost, simplifying the decision. For example, a low-cost, high-speed failover for a business-critical service would result in a short fuse for initiation.

It is best to do this analysis in advance of a disaster, when more thought and clarity can be part of the analysis, instead of emotional decision making during a crisis. That analysis may lead you toward a more robust architecture for avoiding downtime. For mission-critical services already running in Amazon EC2, AWS recommends launching instances in separate Availability Zones to protect applications from failure of a single location, but this strategy would not have prevented an impact from the S3 outage. It may help to set up cross-region replication, yet this still leads to the next question.

2. Does it make sense to use the same CSP for IaaS/PaaS and DRaaS?

Amazon’s dominant position in infrastructure as a services (IaaS) means that organizations have put many of their eggs in the AWS basket. Single-sourcing a cloud provider has price negotiating advantages and reduces the complexities inherent in a multi-cloud architecture.

But widespread natural disasters, such as a major hurricane or earthquake, could challenge a CSP or DRaaS provider to meet demand. With the S3 outage, we’ve seen that self-inflicted human error is capable of widespread outages, too.

Nevertheless, with an increase in hybrid cloud implementations, the lines are becoming blurred.

Joseph George, vice president of product management for Global Recovery Services at Sungard Availability Services, notes that “a key element to successful recovery is ensuring that recovery assets are sufficiently separated from the production environment—both geographically, so that you are not exposed in the event of a regional disaster, and logically, so that the production and recovery networks are sufficiently isolated (in the event of a security compromise).

“There are clear advantages to leveraging a service provider that can manage hybrid dedicated/always-on servers with cloud recovered servers,” he adds, “as well as provide post-recovery production services, provided they maintain the above physical and logical separation between production and recovery assets.”

Whether you choose single-source or multi-source, to assess risk, ask the DRaaS provider these questions before you commit:

  • How do you plan for supporting a significant widespread event that causes an instant spike in demand for DR services?
  • What are the terms and conditions for the customer’s right to terminate? (particularly useful if a test does not go well)
  • How many tests per year and what type do you allow (e.g., full, partial, surprise, planned, etc.)?
  • What are the terms and conditions for detecting, notifying, and remediating data breaches in the service provider’s cloud? (recovery from ransomware is driving this concern)

3. How do you decide what services are best for DRaaS vs. other DR solutions?

With hybrid IT scenarios, where services are sourced from physical or virtual machines, and public or private clouds, this may become more of a question of load balancing when disaster strikes, rather than devising a separate DR plan.

Since DRaaS has become mainstream, according to Gartner, it is fair to say that DRaaS can be considered for most use cases. The most common scenario is on-premises, mission-critical services, where there is a need for an off-site warm standby environment, which can be more cost-effective than a dedicated environment.

As the recent S3 outage reminds us, IaaS/PaaS services can also be considered for DRaaS, if there is concern about the CSP living up to its availability targets.

The responsibility, ultimately, is yours

Public cloud services offload many of the operational tasks necessary for IT service support, but the service provider’s DR plan isn’t always clear. What is clear is that the responsibility for IT service continuity management, including disaster recovery planning and execution, remains within the IT organization.

Let this latest AWS outage prompt further discussion on your team about any hidden risks in your current plan.