Predictive analytics: How to bring fortune telling to CloudOps

Data analytics is supposed to improve IT operations—both on premises and in the cloud—by examining data as it spins out of your systems. Predictive analytics tools are supposed to spot trends that will lead to hardware or software failures, as well as determine the root causes of problems—and automatically correct those problems with no human intervention. That's the ideal. 

Here's the reality: The fact that some tools leverage tactical analytics for IT operations monitoring is just part of the story. What's often missing is the ability to compare overall system performance data to benchmarks and historical performance data, and to compare it against outside influences such as projected data growth, security attacks, or even the weather.  
 
What operational decisions would you make if you had near-perfect information?What if you could derive conclusions based on your ability to more effectively leverage your data? With the integration of technologies such as machine learning and deep learning, we're just now starting to see the potential for this technology. For example, if you use predictive analytics to determine that a 20% spike in usage is likely in the next 30 days, you can automatically provision more servers and expand security. 

It's all about giving IT visibility into cloud operations in order to make predictions about usage—and respond automatically. Here's how it works.

Multicloud Monitoring: How to Ensure Success the First Time

It's all about the data

The trick is to bring predictive analytics to ops. You can do this with a set of tools that can both gather information over time and respond to conclusions reached by the ops tools' analytical engines. 

To better understand this concept, consider a quick case study.
 
Let's say that you're running 100 cloud applications on a public cloud platform that's made up of 50 object storage instances, 100 compute server instances, and 10 databases, all working together to serve the 100 enterprise applications in the cloud. 

There are two types of data to consider here.

Internal ops data

This includes data that relates to both past and current ops metrics: I/O usage, response times, system errors, users' time spent in each application, CPU saturation, and everything else that can be monitored before the data is stored away.  

This means that we have a good understanding of what happened in the past, as well as what's happening right now. The objective is to leverage this data to help predict what will happen in the future, and to provide better, proactive ops for the present.

External ops data 

This involves data related to events that occur outside the systems' scope but that could still affect the cloud-based systems. And this is where things get a bit weird. Here we gather data such as past and current weather, employee sick days, economic data points—anything related to the way we leverage the systems over time. 

Take the weather, for example. Some businesses are up when the weather is good, some are up when the weather is bad. You need to understand those relationships and what they mean to the systems' load. 
 
The same case can be made for gathering economic data, which can be trended with internal ops data to determine correlations. Then it's just a matter of doing predictive analytics on the economic data. You want to correlate the economic data—say, the rise or fall of interest rates—with ops data to determine future system needs.
 
It's important to look at all relevant external data and how that data is related to the internal ops data. This could involve, say, as many as 20 different external data streams that relate to 100 different internal data streams.

Making predictions

As the name "predictive analytics" implies, the idea is to predict the future based on past data, to determine trends that have a high probability of repeating in the future. These trends are determined either through humans who set up the data analytics, including models and results, or by using AI-based analytics, where the AI model rather than humans identifies the trends.  
 
Taking the AI path, you would gather internal systems data for a few months, and then use an AI process to look at that data in the context of some external data points, such as weather, economic trends, etc. Then you can determine the system's ability to make predictions from those relationships.  
 
There are three basic paths when it comes to relating external data to internal data: They can be uncorrelated, negatively correlated, or positively correlated. If they are uncorrelated, that means there's no relation of the external data to the internal, such as weather data that has no influence on future system usage patterns for a facility that manufactures 3-D printers.
 
"Positively correlated" means there is a direct relation between the external and internal data. For example, as economic indicators go up, the usage for a popular retail site could go up. A negative correlation is an inverse relationship; this might happen when economic data trends go down and system usage trends are up anyway, such as manufacturing demand for DIY products.  
 
However, it's not enough to simply spot the relationship between external and internal data. You need the ability to make predictions based upon historical data and trends that the data will show in the future. Then it's just a matter of looking at those trends and the effect they will have on systems based upon the correlations that you've identified. 

Pulling it together

This is the biggest step. You need to piece things together to form an ops system that incorporates the collection of internal and external data, and the ability to perform predictive analytics on that data in support of ops. In other words, move from being just ops-oriented to looking at outside influencers as well.  
 
Here are some of the major advantages of doing all this.

Highly accurate demand planning

Traditionally, demand forecasting has been more about guesswork, even in the cloud. However, you can predict the demand that will be placed upon your systems—in the cloud and on premises. You can figure out, then, how much money you need to spend during that time, and assure that you provision enough servers to meet the needs of the systems based on the predictions.

This allows enterprises to do things such as buy cloud capacity in advance at a discounted rate. 

Consistent system performance for your end users

Most end users are accustomed to outages or slow system performance due to a lack of resources. That doesn't mean your customers are happy about them, though. By leveraging predictive analytics, we can proactively spot issues and solve problems before end users even know there are problems, in many cases.  

Better security

One of the advantages of leveraging predictive analytics is the ability to incorporate outside data to better protect your systems. For example, your ops system can gather and trend the latest attack data along with other external data to proactively fix a security hole.  

Getting started 

Unfortunately, no specific tool stacks are available for these tasks. An emerging best practice is to select and use ops tools that can store internal system data in a traditional database. In turn, you can leverage other tools to slice and dice that database, as well as mash up its data with external data that can reside in the same database, or in a different database.

Avoid tools that are based on proprietary databases.  
 
From there you have two other problems to solve. First, what predictive analytics tools to leverage, including AI-based tools. Second, how to get the outcomes of the predictions back into the ops systems so they can be acted upon. An example would be auto-provisioning compute and storage servers based on demand predictions.  
 
Doing this without human intervention is extremely important. There should be no latency between demand requested and demand met. Thus, system ops become better automated, and continue to move toward complete automation. 
 
Some ops tools provide APIs that allow outside control, such as from predictive analytics systems. The tradeoff is that you have to maintain the integration yourself versus relying on a third-party ops tool provider.  
 
Everything starts with a plan. Make sure you gather the internal data that's needed, and make sure your tool is storing the data so it can be analyzed along with the external data.  

A common mistake is a plan that fails to include system-related data over time from the onset. You can't go back in time and fix that oversight, so be sure to include most of the data you think you'll need.

System performance implications 

However, you need to keep an eye on the impact of all that data gathering on system performance. In some cases, ops teams have overdone it. You will need to justify all of the data being gathered. In other words, you need to find a sweet spot between the data needed and the potential performance impact of gathering that data.

Predictive analytics is a consistently improving process. There should be an ongoing review of the performance of your predictive analytics system, evaluating its impact on cloud operations.  

It can take three to six months to establish the trends and correlations needed to make accurate predictions of system utilization and problem avoidance. The upside: Once you get really good at this process, the ops teams can finally relax.