Many different color dots as art

10 ways machine learning can optimize DevOps

Successful DevOps practices generate large amounts of data, so it is unsurprising that this data can be used for such things as streamlining workflows and orchestration, monitoring in production, and diagnosis of faults or other issues.

The problem: Too much data. Server logs themselves can take up several hundred megabytes a week. If the group is using a monitoring tool, megabytes or even gigabytes of more data can be generated in a short period of time.

And too much data has a predictable result: Teams don’t look directly at the data, but rather set thresholds whereby a particular level of activity is believed to be problematic. In other words, even mature DevOps teams are looking for exceptions, rather than diving deeply into the data they’ve collected.

That shouldn't be a surprise. Even with modern analytic tools, you have to know what you're looking for before you can start to make sense of it. But interpreting a lot of large data sets, including those generated by DevOps, are rather like Potter Stewart’s description of pornography: I'll know it when I see it.

[ Special Coverage: All in on All Day DevOps ]

Nor is it surprising that much of the data created in DevOps processes surround application deployment. Monitoring an application produces server logs, error messages, transaction traces—as much and as frequently as you care to collect. 

The only reasonable way to analyze this data and come to conclusions in real-time is through the help of machine learning. So what can machine learning applications do to help with these practices? A lot of things, as it turns out.

Whether you buy a commercial application or build it yourself, here are 10 ways to apply machine learning to improve your DevOps practices.

World Quality Report 2017-18: The state of QA and testing

1. Stop looking at thresholds and start analyzing your data

Because there is so much data, DevOps teams rarely view and analyze the entire data set. Instead, they set thresholds, such as "X measures above a defined watermark," as a condition for action.

In effect they are throwing out the vast majority of data they collect and focusing on outliers. The problem with that approach is that the outliers may alert, but they don't inform.

Machine learning applications can do more. Yuu can train them on all of the data, and once in production those applications can look at everything that's coming in to determine a conclusion. This will help with predictive analytics.

2. Look for trends rather than faults 

This follows from above. If you train on all of the data, your machine learning system can output more than simply problems that have already occurred. Instead, by looking at data trends below threshold levels, DevOps professionals can identify trends over time that may be significant.

3. Analyze and correlate across data sets when appropriate

Much of your data is time-series in nature, and it's easy to look at a single variable over time. But many trends come from the interactions of multiple measures. For example, response time may decline only when many transactions are doing the same thing at the same time.

These trends are virtually impossible to spot with the naked eye, or with traditional analytics. But properly trained machine learning applications are likely to tease out correlations and trends that you will never find using traditional methods.

4. Look at your development metrics in a new way

In all likelihood, you are collecting data on your delivery velocity, bug find/fix metrics, plus data generated from your continuous integration system. You might be curious, for example, to see if the number of integrations correlates with bugs found. The possibilities for looking at any combination of data are tremendous.

5. Provide a historical context for data

One of the biggest problems with DevOps is that we don’t seem to learn from our mistakes. Even if we have an ongoing feedback strategy, we likely don't have much more than a wiki that describes problems we've encountered, and what we did to investigate them. All too often, the answer is that we rebooted our servers or restarted the application.

Machine learning systems can dissect the data to show clearly what happened over the last day, week, month, or year. It can look at seasonal trends or daily trends, and give us a picture of our application at any given moment.

6. Get to the root cause

Root cause is the Holy Grail of application quality, letting teams fix an availability or performance issue once and for all. Often teams don't fully investigate failures and other issues because they are focused on getting back online. If a reboot gets them back up, then the root cause gets lost.

7. Correlate across different monitoring tools

If you're beyond the beginner's level in DevOps, you are likely using multiple tools to view and act upon data. Each monitors the application's health and performance in different ways. 

What you lack, however, is the ability to find relationships between this wealth of data from different tools. Learning systems can take all of these disparate data streams as inputs, and produce a more robust picture of application health than is available today.

8. Determine the efficiency of orchestration

If you have metrics surrounding your orchestration process and tools, you can employ machine learning to determine how efficiently the team is performing. Inefficiencies may be the result of team practices or of poor orchestration, so looking at these characteristics can help with both tools and processes.

9. Predict a fault at a defined point of time

This relates to analyzing trends. If you know that your monitoring systems produce certain readings at the time of a failure, a machine learning application can look for those patterns as a prelude to a specific type of fault. If you understand the root cause of that fault, you can take steps to avoid it happenning.

10. Help to optimize a specific metric or goal

Looking to maximize uptime? Maintain a standard of performance? Reduce time between deployments? An adaptive machine learning system can help.

Adaptive systems are those without a known answer or result. Instead, their goal is to take input data and optimize a particular characteristic. Airline ticketing systems, for example, attempt to fill planes and optimize revenue by changing ticket prices up to three times a day.

It turns out that you can optimize DevOps processes in a similar way. You train the neural network differently, to maximize (or minimize) a single value, rather than to get to a known result. This enables the system to change its parameters during production use to gradually approximate the best possible result.

The ultimate goal is to measurably improve DevOps practices from conception to deployment to end of life. Machine learning systems can accept and process data in real time and come up with an answer that DevOps teams can apply to improve processes and better understand the behavior of their application.

Learning through iteration

Most machine learning systems use neural networks, which are a set of layered algorithms that accept multiple data streams, then use algorithms to process that data through the layers. You train them by inputting past data with a known result. The application then compares algorithmic results  to the known results. The algorithm coefficients are then adjusted to try to model those results. 

It may take a while, but if the algorithms and network architecture are chosen well, the machine learning system will start to produce results that closely match the actual ones. In effect, the neural network has "learned," or modeled, a relationship between the data and the results. This model can then be used to evaluate future data in production.

These learning systems can also be applied to data collected from other parts of the DevOps process. This includes more traditional development metrics such as velocity, burn rate, and defects found, but DevOps involves more measures.

DevOps includes data generated by continuous integration and continuous deployment tools. Metrics such as successful integrations, number of integrations, time between integrations, and defects per integration all have value if they can be properly correlated and evaluated.

For more on machine learning and DevOps, see Barry Snyder's presentation, "The DevOps Smart Road: Integrating AI into DevOps," at the AllDayDevOps 2017 online conference. Snyder is Senior Manager, DevOps Developer Frameworks & Application Quality at Fannie Mae, which is in its third year of enterprise DevOps and Agile adoption. He is using AI to make rapid improvements to the organization's DevOps platform. Admission to this event is free, and you can also watch Snyder's presentation after the event.

World Quality Report 2017-18: The state of QA and testing

 

Topics: DevOps