What to measure—and why

6 proven metrics for DevOps success

Ann Marie Fred DevOps and Security Lead, IBM

Craig Cook DevOps Coach, IBM

Many teams struggle to identify the key performance indicators that will demonstrate to management why DevOps matters and how it affects business results. Not having the right metrics is just one of the many ways that DevOps initiatives fail.

Visualizing objective, automated metrics makes them easier to understand. That's why choosing the right DevOps metrics is key. Here are six that have worked for my team at IBM, why they were chosen, and the improvements seen as a result.

1. Availability and uptime

In the book Accelerate, by Nicole Forsgren, Jez Humble, and Gene Kim, one of the four key software delivery metrics is "time to restore service." Uptime is essentially a way to aggregate this metric over time.

My team calculates availability by taking the total time of all outages reported by our primary production monitor for each service, subtracting that from the total time, and then dividing by the total time.

Our availability dashboard calculates this on demand for any date range. Most of our services have agreed to a service-level objective (SLO) of 99.95% uptime, with a few committing to 99.85% instead.

Our general manager reviews the uptime numbers for each of our production services quarterly. This motivates teams to:

Make sure their production monitors are accurately measuring the true end user experience
Focus on improving the availability of services that aren't meeting their goals
Celebrate when they meet their SLO

2. Work in progress

The Toyota Production System of lean manufacturing taught us that limiting work in progress (batch sizes) helps teams improve overall throughput. Put another way, it's better to finish one project today than to chip away at 10 projects and finish none of them.

With our work-in-progress metric added to our dashboard, my team can simply count the number of open issues of each type (story, defect, task). When the number gets too high, it's time to stop taking on new work and focus on what has already been started. That improves our overall velocity.

3. Repository speed

Pull requests represent value waiting to be delivered to production. The repository speed score is based on the time from submission to merge (i.e., review duration) of GitHub pull requests, over the last 30 days. A perfect score is given when the average time per pull request comes in at zero to two weekdays (M-F), decreasing to a score of zero at five weekdays.

Old pull requests can get lost, affecting the repository speed metric. When you're working with more than one repository, it's difficult to keep track of outstanding pull requests.

One new feature shows where all outstanding pull requests for every repository the squad owns are shown, highlighting the ones that are old, and some squads review these in their daily stand-up. By highlighting old pull requests waiting for review, we ensure that our developers focus their code review efforts where they are needed.

Some developers frequently created pull requests before they were ready to deliver to production. Some of these requests stayed open for weeks, and that drove down our repository speed scores.

These long-running pull requests were used as a discussion point with other squads to clarify information that was missing in the specs. It was also a way for developers to communicate changes that would affect other squads.

Finally, this issue also highlighted a tight coupling between squads that my team wanted to avoid. The goal is for our squads to be autonomous but also loosely coupled to each other. The idea is to keep our microservice architecture very flexible, using APIs between the different services/squads. But when more than one squad began working on the same pull request, it highlighted an issue that seemed like an architecture/code ownership problem to us. If your APIs are well designed, you should be able to make changes at will, without affecting other squads.

After discussing best practices to improve this metric, squads were driven to experiment with pair programming, development on branches, API versioning, and contributing code across repositories.

4. Deployment frequency

Squads that deploy more than once per week can fix outages in production faster because they have the automation in place to deploy changes quickly and easily. They're also delivering value to customers more frequently.

Finally, they're more likely to fix newly reported critical security vulnerabilities within a few days because they don't have to wait for the next scheduled deployment window or implement an "emergency change" procedure.

Our deployment frequency score is based on the number of successful deployments over the last 30 days, with color zero to one deployments red, two to three yellow, and four or more green.

Initially my team received pushback from teams that deliver code to production only every couple of weeks or, in some cases, only once per month. Deployment frequency is a key metric, and we were able to use the research from Accelerate to back up the importance of frequent deployments.

Promoting more frequent deployments

To further encourage frequent deployments, teams were put on a weekly sprint cadence. And early on, we also had weekly playbacks. It quickly became clear which teams were delivering value every week and which were not.

Eventually, as the organization grew, my team settled into a pattern of biweekly playbacks so they could fit half of the teams each week into a one-hour session.

All of this led to conversations about how these teams might adopt continuous delivery. In some cases, we've been able to sit with a team and get its CI/CD pipeline set up within a few days.

A red deployment frequency tile also shows us where the team needs to pull down the latest code and check for security vulnerabilities, and hopefully automate that process as well.

5. Deployment stability

Deployment stability is the percentage of time when the most recent build for a given repository was successful. On our scale, 0% to 50% is red, 50% to 90% is yellow, and 90% or higher is green. A developer on our team invented this metric after our developers complained that they were spending too much of their time fixing builds.

Broken builds can be a good thing, when excellent test automation has stopped a poor-quality change from being delivered to production. Not all broken builds are due to code errors; sometimes there's an infrastructure problem that needs to be fixed. Broken builds become a problem when they persist for a long time and start to interfere with developer productivity.

By visualizing broken builds over time on the dashboard, it's obvious where squads need to spend some time cleaning up technical debt and fixing the deployment process.

6. SonarQube metrics

Most of our repositories have integrated SonarQube reporting into their CI/CD processes, pulling the overall scores for security, code quality, and test coverage into our dashboard. They're displayed at a repository level, squad level, and cross-platform level.

Making these scores visible helps teams ask for more time to improve their scores. My team has seen, for example, that one team improved its SonarQube security score from a D to an A in two days just by implementing the small changes SonarQube recommended.

My team is frequently challenged to defend our goal of 100% unit test coverage. Some teams believe they should get an "A" on the dashboard for a lower test coverage score, like 70%. My team stuck to our convictions here, and if a squad is happy with 70% coverage, then they can accept a lower grade.

Others point out that it's possible for teams to game the system by telling the code coverage checker to ignore large swathes of code. In reality, there has not been widespread abuse of code coverage "ignore" flags. Our developers and code reviewers should have the integrity to be honest about their code coverage.

Understand your goal—and start slow

Your goal should not be to make every tile on the dashboard turn green. You want to see where your squad is performing well and where there's room to improve. Then you can make an informed decision as to where to invest time.

These six metrics are simple for people to visualize, and they have resulted in positive changes in my organization. But start out small and roll them out gradually as you explore how these metrics can help your own organization.

Read more articles about: App Dev & Testing, DevOps