You are here

You are here

Go data-driven with your DevOps: 5 tips for improving speed and scale

Harpreet Singh Founder and co-CEO, Launchable
Kohsuke Kawaguchi Co-CEO, Launchable

One of the side benefits of leading the Jenkins automation server project over the years is that I’ve had the opportunity to interview numerous companies on their workflows and practices. In some cases, I’ve found regular people who achieved extraordinary results through ordinary means.

Many times these people don’t think of themselves as doing anything special. To them, it looks as if they are just using duct tape to smooth over embarrassing problems that shouldn’t exist in “proper” software. And yet these problems are the kinds of things that every software development organization struggles with.

What separates those who overcome and those who don’t? I like to call it a data-driven DevOps mindset. What is data-driven DevOps? I think of it as taking advantage of the wide array of data that your team produces and using that to move faster.

Here are 5 tips for adopting a data-driven mindsetand improving speed and scale.

1. Use the data you already have

Years ago, when I started working on Jenkins, the data produced from continuous integration (CI) was nominal. Most of us had only a few projects, each with a straightforward build script, and we were content to run tests every night. Fast forward to today, and many organizations have hundreds of libraries, modules, and services, each with multiple CI scripts running at various parts of the workflow. This generates a tsunami of data, not to mention the output from all of the various linting, security, and code quality tools. What’s interesting about all of this is that even though organizations have more data, few are taking advantage of it.

Here are a couple of practical examples of how you can take advantage of it. Most practitioners tend to treat the output from builds in a binary sense: pass or fail. But there a ton of ways you can use the data contained in your build logs:

  • What if you correlated the pass/fail data of your tests to identify flaky tests? You could produce a sorted list of tests based on a “flakiness score” and use that to identify prime candidates for repair or rewrite.

  • What if you were able to group failures in your CI based on the error generated? This could join hundreds of null-pointer exceptions in a run into a single group and perhaps highlight that there were other errors generated in a run with lots of failures.

2. Get the right information to the right people

As roles diversify in larger organizations, it can be harder for the people on the ground—those writing new features and fixing bugs—to get the information they need in a timely manner. Where possible, seek to automate information delivery:

  • What if you tweaked build notifications to more smartly identify and message the developer responsible? This could be as simple as running a git-blame on the line of code that generated an error in a build and notifying the Git author of the failure.

  • Instead of notifying developers at the end of the build, what if they got a notification when the first test failed? If your tests take several hours to run, this could make a huge difference to developers.

3. Look for low-hanging fruit

At conferences and on the Internet, you sometimes only hear from people doing amazing things at huge companies. But there are many folks who are solving “unsexy problems with duct tape.” These are the true heroes.

For example, I met someone from a CI team who is providing infrastructure for hundreds of engineers. He had this annoying problem: Sometimes tests failed because the CI infrastructure failed (the server was out of disk space, the database was down, etc.). Previously, notifications were set up in such a way that developers were notified, but the problem had nothing to do with them. That was eroding confidence in the system.

The solution was to automatically scan the last 50 lines of the build log for a few keywords. If those matched, the failure notifications were sent to him instead of the developers. This was very easy to do, and very effective.

Simple solutions like this are often underestimated. Be on the lookout for low-hanging fruit.

4. Consider how DevOps problems are solved at scale

Regardless of your organization's size, it’s worth considering how large companies solve some of these problems at scale. I’ve taken a lot of inspiration from companies such as Facebook and Google. They have written about how they are using machine learning to run only the most important tests for each code change (and have reduced test cycle time as a result). This is called predictive test selection, and it’s what I’m working on at my current company. There are other examples out there as well. Facebook built a tool that automatically finds bugs and submits fixes to engineers to review. And here’s Google talking about how it mitigates flaky tests.

While big companies have more resources to devote to these issues, you can still learn a lot from their approaches. You may even be able to identify shortcuts based on their learning to get much of the value without the effort. Also, with some research, you may discover a web service or open-source project that is trying to address the same issues you face.

5. Use key metrics to track your progress

While this one may seem obvious, it’s worth considering how you can use key metrics to track your progress. I find that there are at least two axes that are worth looking at:

  • Metrics that are helpful to your team to report and celebrate progress internally: Think about these as the key performance indicators that can show off what your team is working to improve. Code coverage, test runtime, the time from code push to build complete—these are all examples of metrics you might want to track.

  • Metrics that are helpful to you as you communicate progress to leadership: I find the DORA metrics especially helpful here. This is the set of metrics originally articulated in the book Accelerate that help measure DevOps. They include lead time for code changes, change failure rate, and time to restore service.

Data-driven DevOps is about more than the data

A data-driven mindset is about more than just collecting the vast amount of data that is available to a modern software team. It’s about trying to use that data for the most gain within your organization. There are many places where we can help teams move faster if we just apply a little thought to how we can use data to make improvements.

Maybe you’re already doing this, but don’t think that what you are doing is particularly special. Remember that one person’s duct tape may be another person’s treasure.

I for one would love to hear more about your experiences. Consider writing a blog post about what you are doing or submitting a talk to a conference. The world needs more people talking about how they are using data to drive improvements.

Keep learning

Read more articles about: DevOpsDevOps Transformation