Robot

4 ways to use AI for better cloud ops efficiency

From on-premises data centers to cloud and converged architecture, IT operations has evolved exponentially over the last decade with the help of companies such as Amazon, Microsoft, and Google that have removed much of the heavy lifting related to installing data centers, servers, managing networks, storage, and more.

This has led to a more widespread embrace of the DevOps philosophy—saving time and improving performance, bridging the gap between engineers and IT operations. However, DevOps hasn’t truly delivered what was expected from it, since engineers still have to respond to many of the alerts about issues and events in their infrastructure.

But what if we could let humans solve the newer, complex problems while we let machines resolve known, repetitive, and identifiable problems? Enter a new philosophy: AI Ops.

The increasing adoption of cloud (soon, 80% of all IT budgets will be committed to cloud solutions) and the emergence of artificial intelligence (AI) and machine-learning (ML) technologies are allowing companies to use intelligent software automation to make decisions on known problems, predict issues, and provide diagnostic information to reduce the operational overhead for engineers. Over the next 18 to 24 months, AI Ops will make the practice of waking up engineers in the middle of the night for downtime and other issues part of a bygone era.

Below are the top four areas of cloud IT operations where AIOps promises to deliver the biggest improvements.

Containers demand new data center designs: 5 steps for survival

Managing cloud costs

The No. 1 cloud priority for enterprises today is optimizing costs. Cloud cost challenges are causing massive headaches for finance, product, and engineering teams due to dynamic provisioning, the auto-scaling of support, and a lack of garbage collection for unused cloud resources.

With AI Ops, machine intelligence and AI technologies can detect cost spikes, provide deep visibility into who used what, and help companies deploy intelligent automation to address these issues. For example, an enterprise can automate the purchasing process of Amazon EC2 Reserved Instances through simple code, with the help of AWS Lambda. Another cost-saving initiative with AI Ops is automating the power cycle of development instances, by turning them off during the weekend and turning them back on at the start of the week.

Ensuring cloud security compliance

How can companies ensure that every cloud resource is provisioned with the appropriate security compliance configuration, all the while meeting regulatory requirements such as PCI-DSS, ISO 27001, and even HIPAA?

AI Ops helps companies stay compliant and reduce their business risk by using the real-time event configuration management data from cloud providers. It can issue instant alerts (within milliseconds) to inform provisioners, and even take actions such as shutting down machines if compliance isn’t met.

Reducing alert fatigue

Addressing critical problems is a complex process that involves many departments: the traditional network operations center, IT support team, and engineers. As the number of problems increases, so does the difficulty in managing and resolving them. Fortunately, many of these “noisy alerts” are often caused by known events or identifiable patterns.

AI and ML on AI Ops can filter out unnecessary alerts, suppress duplicate alerts, and automate actions for known events and identifiable patterns for more concise alert management. How? With the help of anomaly detection powered by machine intelligence and embedded business logic.

Intelligently automating operations

The engineers responsible for managing the production operations (from IT Ops to the DevOps era) have been frustrated with static tooling. Machine intelligence and deep learning make dynamic tooling possible, enabling the creation of automated remediation actions and alert diagnostics, so that teams can focus on using code as a mechanism for resolving problems. Simple remedies such as this can save hours of time after every deployment, handling failures gracefully.

In addition, AI and ML on AI Ops can intelligently automate other areas of operations including deployment (with cluster management and auto-healing tooling), application performance management (not just what’s happening but why it’s happening due to what), log management (real-time streaming of log events and auto detection of relevant anomaly events based on application stack), and incident management (by suppressing noise from different alerting systems and providing diagnostics for engineers to get to the root cause faster).

Get out front

With the constant evolution of the cloud and IT offerings, new challenges are introduced every day. Companies must leverage AI Ops and technologies such as artificial intelligence and machine learning to disrupt cloud operations and ease infrastructure management.

Is your team tapping into AI or machine learning? Share what you have learned in the comments section below.

Containers demand new data center designs: 5 steps for survival
Topics: IT Ops