Cloud complexity: Why it's happening—and how to deal with it

With cloud computing on track to become the mother of all shifts, especially in regard to IT's approaches to development and operations, we yet again face the issue of conversion mistakes—this time a hundredfold greater than previous moves to distributed computing and the web.

While it's certainly an obvious problem, the solutions are not so obvious. The devil is truly in the details. Cloud complexity is the result of rapid acceleration of cloud migration and net-new development, without forethought about the complexity this brings to operations.

Here's how to approach cloud complexity right now, some categories of tooling that are available, and approaches to consider going forward that don't add to the complexity issue.

Understanding the current complex state

The problem is that IT does not eliminate endpoints (technology platforms such as servers, databases, etc.) as it moves workloads into the cloud. That is, when applications move to the cloud, the associated legacy servers, databases, and other platforms continue to exist in some form.

They're used less, but they must still be managed. That adds to the complexity in IT, with public, private, and multi-cloud added to legacy systems.

Another core issue is heterogeneity. In the past, enterprise architects wandered the hallways looking for system development projects that did not comply with platform and development standards.

"Compliance" was the single most talked about topic during enterprise architecture conferences in days gone by. In some respects, those talks were successful, considering that today we have IBM shops, Microsoft shops, open-source shops, and so on.

Common patterns were followed. It was fairly easy to retrofit common security, governance, and management/monitoring layers—at least, those that were considered best practices at the time.

Today, things are different. The cloud makes it extremely easy to pick whatever type, brand, and function of a service you feel is best of breed. Dev teams are encouraged to work fast with aggressive sprints, decoupled from other dev teams and from any centralized IT compliance.

Enterprise architects have been replaced by cloud architects, whose theme is to use whatever technology they feel is the best fit. Multi-cloud is king, and microarchitectures will define enterprise architecture through event-driven decoupled sprints, and not long-term centralized planning.

How to manage it all

It's time to learn how to leverage technology and approaches as force multipliers that either arrest the negative effects of cloud complexity or reduce its impact on operations budgets, outages, and breaches. These include issues in security and operations.

Operations must work smarter

CloudOps is seeing the complexity issue firsthand. In most companies, systems were built around different technology stacks, approaches, cloud platforms, and brands. Everything got tossed over the wall to CloudOps, and that team is expected to successfully run the systems for years.

While you would have thought that DevOps (which translates to the tight coupling of the development and operations teams) would have solved this problem, the reality is that dev teams may still not know where the ops teams sit in the building.

Moreover, many enterprises have centralized ops to reduce costs. If you have small teams sitting with dev, that means you require more money for ops resources, because complexity means more things to operate, which means more money needed to operate cloud-based applications and infrastructure.

Many in enterprise IT find that problematic. In some respects, DevOps pays lip service to IT, and small DevOps teams make up only a small portion of how applications and data stores are built, deployed, and operated.

The core of the solution is to work smarter and simplify the views of the systems. You could require that all ops teams leverage the native ops consoles, such as those provided by AWS, Microsoft, and Google.

However, if you add all your cloud-native databases to that list—along with performance monitoring, IoT, serverless, and the machine-learning systems that need to be managed—you quickly have two dozen or more systems that need to be managed using native tooling. All of these are different and require specific skills.

Simplify your management strategy

IT often issues the battle cries of "Abstraction!" and "Automation!" as a means to reduce the complexity of heterogeneous technology stacks. But it's really more about simplifying how things are managed, rather than specific approaches or tools.

That said, the tools in the marketplace that should help you simplify operations include cloud management platforms (CMPs), cloud service brokers (CSBs), resource governance, service governance, cost governance, multi-platform monitoring, and multi-platform management.

Some would consider this an exercise in just extending the heterogeneous on-premises management tools you've used for years. But cloud-based systems have special requirements, including new cloud-native interfaces, purpose-built databases, and the use of identity for security and governance.

Most relevant is the fact that you're multiplying all of those systems by three major cloud brands, all with their own proprietary takes on how cloud services should exist.

Use common security layers

Security is perhaps the scariest aspect of cloud complexity. We know that complexity breeds vulnerabilities, as platforms, databases, storage, and other cloud services miss updates and fixes.

The real threat is how many resources you need to employ to remove enough risk to be acceptable. In many cases, this is beyond what enterprises can afford.

When you had five full-time security operations (SecOps) staffers working with 300 virtual on-premises servers using three types of platforms (e.g., LAMP, Windows NT, etc.), and 20 databases using three brands of databases (e.g., Oracle, DB2, etc.), you could fairly easily keep up with those on-premises SecOps needs.

Now the same five SecOps resources are tasked with keeping track of 200 to 400 virtual servers in three different brands of public cloud providers that expand and contract automatically (are elastic), and 40 databases that are purpose-built for machine learning, analytics, IoT, and high-speed transactions.

All of this must be managed with no common security services, such as identity and access management, that span all public cloud ecosystems.

Do you think susceptibility to breach has increased? Indeed it has, at least twentyfold. Both scenarios support roughly the same number of business processes, and have the same value to the business.

Of course, you can always increase the number of resources, perhaps moving to 10 full-time SecOps staffers. But for most companies out there, such a drastic increase in the SecOps budget is not an option. After all, didn't cloud promise cost savings?

The only path to success here is the use of common security layers. These are technologies that work across cloud brands and cloud services within those brands—for instance F5, ping identity, and Splunk.

The tradeoff is that they are often not as effective as cloud-native tools because they must be all things to all clouds and all cloud services. Taking a least-common-denominator approach could mean having visibility and management capabilities into only subsets of the native platforms.

As management and monitoring tools progress using new approaches such as AIOps or operational tools that leverage machine learning, most of these limitations should go away.

Close the complexity-management talent gap

The lack of talent available to deal with cloud complexity issues is another key contributor to this issue. There are a few factors to think about here, including the fact that you have no experience managing architectures as complex as the ones we have now.

Those who want to take this on as a career path need to do two things:

Get better at spinning plates of very different technologies, on very different brands of public clouds. This issue won't go away anytime soon. You'll need bright people around who are operationally focused to keep the systems running, and to keep outages and breaches at a minimum.
Get better at removing cloud complexity using automation and abstraction. Spinning plates will only work until you have too many plates to spin, and then they all come crashing down. Same goes with cloud complexity. There is a tipping point where the risk goes up at a much steeper angle, and the ROI for cloud computing becomes neutral or negative.

You need architects and CloudOps teams that can think proactively about how to leverage technology as a weapon to go to war with cloud complexity. This means using some very pragmatic approaches, and searching for existing or emerging technology to work smarter.

The goal is to remove most of the complexity for the ops teams and enhance your ability to take new and migrated cloud workloads into production.

This is doable; you've been here before

The way out of the cloud complexity conundrum is both simple and difficult. It's simple in that we know what to do. The path to eliminate native system complexity from databases, platforms, operating systems, containers, HPC, and specialized systems means leveraging tools to manage and monitor those systems.

You need to use common abstractions that are consistent from system to system, database to database, platform to platform.

The hard part (some say impossible) is to actually implement a series of solutions that systemically remove complexity from multi-cloud and even single-cloud deployments.

In many cases, you'll need to commit 10% of the IT budget to make it happen. That means something else won't get done, such as that ERP upgrade that's already years late. Or you'll simply have to spend more money, and that requires an explanation to investors, leadership, and other stakeholders.

The good news is that this is a solvable problem if you can buckle down and fix the issues after complexity becomes a problem, or perhaps you can be more proactive and avoid the complexity in the first place. If yours is like most organizations, however, you'll need to fix the problems after the fact. Traditionally, that's the approach we take in IT.

Read more articles about: Enterprise IT, Hybrid IT

You are here