Accidental hybrid clouds: What to do when your migration is stuck

When a cloud migration fails, it's not pretty. Systems are left in limbo: some have been fully migrated to the cloud environment, while others are still running on premises because they've not been able to be migrated to the cloud.

This mix of on-premises and cloud workloads results in what I call an "accidental hybrid cloud." While creating an accidental hybrid cloud is relatively easy, leaving it is hard. You can't complete the migration due to technical, financial, or other reasons, yet you can't reverse the migration because you’ve already invested time, money, and engineering resources. What do you do?

Here’s how accidental hybrid clouds happen, how you can tackle them, and how to avoid them if your team is currently planning or under way with a cloud migration.

Why companies choose hybrid cloud

A hybrid cloud is a deployment model where applications, services, and systems are distributed across a combination of public cloud, private cloud, and on-premises infrastructure.

This is a common setup in industries where the demand for scalability, security, and privacy is high, such as financial services, government, and healthcare. According to a recent Everest Group survey, some 72% of respondents described their cloud strategy as hybrid-first.

Combining different types of infrastructure lets you leverage each environment's unique benefits, such as:

Using public clouds for greater scalability, reduced maintenance effort, and lower upfront costs
Using private clouds to isolate workloads, guarantee resource availability, and manage security more effectively
Using on-premises infrastructure to completely control sensitive systems and data, comply with regulations, or use specialized infrastructure that public/private clouds can't provide

How accidental hybrid clouds happen

If your cloud migration plan requires migrating workloads in stages (for example, moving individual services), then your teams are effectively running a hybrid cloud. And if your migration stalls or fails before it's completed, the result is an accidental hybrid cloud. There are several common causes of accidental hybrid clouds.

The cloud platform doesn't support your workloads

Making sure a cloud platform can support your workloads doesn't just mean checking software compatibility; performance and throughput are also critical factors.

For example, when migrating a database, you need to ensure that your cloud provider not only supports your database engine, but that it can meet or exceed the level of performance and throughput that you're getting on premises. This is especially true as you scale to serve more users. Any hit to performance can have a noticeable, cascading effect on your application as a whole.

You run into unexpected complexities

New environments come with steep learning curves. There are concepts to master, tools to download, best practices to adopt, and new interfaces to use. Cloud platforms introduce complex concepts such as availability zones and regions, virtual private clouds, infrastructure as code (IaC), command-line tools, management consoles, observability and monitoring, and more.

Each of these concepts consumes engineering time, adds costs, and increases the number of potential failure points.

You run into budget constraints

Accurately budgeting for cloud infrastructure is an ongoing challenge. More than half of companies of all sizes spend over $1.2 million annually on the public cloud, and the top priority for most of these organizations is optimizing costs, according to the 2021 Flexera State of the Cloud Report.

Even the most meticulously designed environments can leave you with a surprise bill. While there are cost-saving strategies, it's hard to know what the final cost will be until you get the invoice, and an unexpectedly large bill can easily bring a migration to a halt.

You have more systems to migrate than you thought

Maintaining a comprehensive, up-to-date inventory of systems is difficult. Applications are always changing, and in a microservices architecture in particular, services spin up and down constantly. It's easy to overlook a critical service by accident, or migrate it before its dependencies.

You can mitigate this by using a tool to automatically detect and catalog your hosts and services, but even then, you run the risk of forgetting a critical piece of your infrastructure.

How you can mitigate and prevent accidents

To avoid an accidental hybrid cloud, prepare your systems, services, and applications to handle the unique conditions and failure modes present in cloud environments. This is easier said than done, as many of the problems that crop up during migrations are unexpected and difficult to predict. However, there is a method you can use to proactively uncover these issues: chaos engineering.

Chaos engineering is the practice of intentionally experimenting on a system, observing how it responds, and using your observations to improve its resilience. It allows you to proactively test and validate the operational behaviors of your systems and applications so that you can feel more confident migrating them to a new environment. This includes the ability to simulate conditions in the new environment before you even begin your migration.

Here are a few ways that chaos engineering helps overcome migration hurdles.

Identify potential failure modes before migrating

Cloud environments can introduce unpredictable conditions that aren't always present on premises, such as added network latency and performance bottlenecks caused by under-provisioning.

These can cause unexpected failure modes in your applications. With chaos engineering, you can proactively test for, uncover, and address these failure modes so that your applications are cloud-ready before you begin a migration.

For example, consider a common problem: a dropped network connection between two services. This could be due to a faulty network switch, a temporary DNS outage, or one of the services failing. You can mitigate this concern by adding mechanisms to your services such as fallback and retry mechanisms, timeouts, alternate network routes, or by replicating each service. But how do you know whether they'll work in production?

Using chaos engineering, you can run a black-hole experiment to drop network traffic between your services. This simulates an outage, allowing you to verify whether your resilience mechanisms work as expected. If so, you can feel confident that your service can withstand these issues in your new environment. If not, you may need to make more changes before migrating.

Prepare for network delays and outages

Modern cloud-native architectures require fast, low-latency network connections. These aren't always available, though, especially once your environment grows to multiple availability zones and regions. Large, distributed deployments can run into network latency issues, and if you haven't built your applications to tolerate this latency, they can cascade into system-wide outages.

Chaos engineering helps here, too. By running latency experiments, you can simulate high-latency, low-throughput conditions across any type of network traffic, and control the amount of added latency down to the millisecond. This can uncover all kinds of unexpected issues: for example, adding just 20ms of latency to a database can decrease the throughput of a web application by over 80%.

Running latency experiments is especially important when migrating an application from a co-located, on-premises environment to a distributed cloud environment. When on premises, the expectation is that latency will be relatively low. Cloud environments may not introduce significant amounts of latency, but testing your applications against even a small amount of added latency can help you avoid a future outage.

Right-size capacity

Right-sizing infrastructure is a tough balancing act. Provision too much and your monthly costs skyrocket; provision too little and you might experience a bottleneck during traffic surges. Even if you leverage auto-scaling, you still need to tweak and adjust your thresholds to ensure you can scale quickly during periods of peak demand. Just a few minutes of downtime during a busy traffic period can cost hundreds of thousands in lost sales.

With chaos engineering, you can use resource attacks to proactively test your ability to scale by simulating high-traffic events. For example, let's say you've configured your cloud infrastructure to automatically scale when CPU usage exceeds a certain threshold. Normally you'd need to pick a threshold based on what you think might happen, and hope that it will be enough to handle real-world conditions.

You can also use chaos engineering here, to run a CPU experiment to simulate heavy load and trigger the threshold yourself. You can then monitor your infrastructure to see how quickly it scales, then tweak your thresholds based on these observations. You can also use this process to optimally size your cloud infrastructure, set monitors and alerts, and feel confident that your systems can scale to meet real-world traffic demands.

Get back on track

Accidentally ending up in a hybrid cloud doesn't mean the end of your cloud journey. Cloud migrations are difficult to navigate, and they rarely happen flawlessly. Chaos engineering can help bring your migration back on track by preparing your systems for the unique and unpredictable challenges present in cloud environments.

Keep learning

Choose the right ESM tool for your needs. Get up to speed with the our Buyer's Guide to Enterprise Service Management Tools
What will the next generation of enterprise service management tools look like? TechBeacon's Guide to Optimizing Enterprise Service Management offers the insights.
Discover more about IT Operations Monitoring with TechBeacon's Guide.
What's the best way to get your robotic process automation project off the ground? Find out how to choose the right tools—and the right project.
Ready to advance up the IT career ladder? TechBeacon's Careers Topic Center provides expert advice you need to prepare for your next move.

Read more articles about: Enterprise IT, Hybrid IT

You are here