Why you need an error budget—and how to make it work

How many times have you seen Google go down? Not many, I bet. You might not even notice if it happened. And if you did, you'd probably blame it on an Internet connection problem.

But Google isn't perfect. As Werner Vogels says, "Everything fails, all the time." If even Google doesn't have 100% uptime, maybe you should accept that you won't either.

Shouldn't you instead focus on how to recover from failure? It's been proven that systems fail when they change, but your customers probably wouldn't like it too much if you never updated your software. So since you have to change, how do you know when to change?

Maybe you've had good uptime recently, and a small failure won't hurt you too much. But you have to know whether that's the case or not. You need metrics that tell you if it's a good idea to freeze changes for a time or if having some errors is still acceptable.

Here's why you need an error budget, and how to make it work.

How to Protect Hybrid IT

Why you need an error budget

If you've compared cloud services, you may have heard about the nines of availability. It's a number that tells you how much of the time systems are going to be down. When someone talks about having four nines (99.99%) of availability for a system, that person is saying the system will be down only 52 minutes and 35 seconds a year.

The more decimals, the more uptime. For instance, let's say that you defined a rule specifying that the system has to respond in under 500ms 99.99 times out of 100. If latency goes up, then your system is considered down because it's above the 500ms threshold.

This number is used to define the service-level agreement (SLA) or service-level objective (SLO). The error budget is how much time you're willing to allow your systems to be down, and it depends heavily on the SLA that you've defined with the product team.

Everyone would like to have systems with 100% uptime, but you need to be realistic. How much availability are you willing to provide, based on how much your customers care? Are your users going to notice that your system is up 100% of the time? What about 99.99%? Or even 99%? They might not.

It's important to have an SLA or SLO that works for you so that, at the moment a deployment fails, you'll think twice before trying to fix something in production or going back to a stable environment. Having an error budget helps support a plan to not push changes if people don't trust those new changes.

Uptime vs. innovation: Should I pick one?

High uptime has risk beyond financial costs and complexity. It also puts you in the position of worrying too much when deploying changes. Some might use error budgets to support their theory that every change affects the stability of the system. That means no more changes, in their mind. But I'd advise against that mindset. It's better to avoid risking stability in other ways.

Operations will always seek to have systems that are highly available by putting in place replication, redundancy, auto-scaling, backups, and everything that makes systems more robust. On the other hand, developers will try to write code that satisfies the requirements that came from the business. That's how the DevOps movement started: People wanted to create a culture where these frictions are minimal.

If you care more about having several nines of availability than releasing new features, then innovation will stop. Sure, it might be better to be conservative than to take the risk. But let's face it: No one will care about reliability if your system doesn't provide any value. Successful systems solve users' problems. There are always trade-offs, but keeping systems static won't keep your customers happy.

How do you keep the budget positive?

It's important to have room in your error budget in case something happens that's external to deployments—something such as Internet connection issues, fires in the data center, cloud providers that go down, and any problem that's not in your control to fix (and that complaining about on Twitter won't solve).

When you push changes gradually, you're more in control of the error budget. If something starts to affect uptime, you can roll back immediately, before it consumes your budget. Also, you might soar over your error budget if you don't release in small batches. Deployment strategies such as blue/green deployments or canary releases are good options to keep numbers positive. Automation becomes your best friend here. Every second counts, especially when you need to do a rollback.

You can also start with the code. How does your code respond if the database has problems? Or what about Redis? Problems with dependencies will always happen, so it's better if your application can support that.

Let's say your system is composed of several microservices. If one of those goes down, rather than fail, the client should have a default response or take data from the local cache. For example, Netflix has a library called Hystrix. If you're not in the Java world, you can still internalize the principles behind other companies' levels of support for problems.

Fail, but don't get caught

Failure is an option, but the trick is how you manage it. Netflix practices failure all the time with its chaos monkeys. It goes to the extreme of bringing down entire clusters in production several times, all the time. Now, let's be clear about this. If Netflix is down, users might get mad, but no one will die. The company won't lose money, either, because of its monthly subscription plans. But its reputation could be affected if the system is constantly down, and that would affect revenue months later through lost users.

So should you start bringing servers down? What would be the business leaders' reaction when you tell them? They would probably freak out and respond with a solid no. It will depend on the impact that downtime has on the users. But even if you don't put your systems in failure situations in production, as Netflix does, you should at least practice it in a testing environment. Doing that shows you care about reliability and are prepared for common failure scenarios.

When AWS went down some years ago, Netflix was one of the few customers that came through unscathed. It failed, but it didn't get caught.

Keep innovating while staying up

With DevOps, some organizations are so focused on delivering fast that they sometimes don't adequately assess risk. And if that's you, developing an error budget can help you be aware and respond properly. Most businesses prefer to be conservative, so you need to learn how to sell DevOps or even automation to management while accounting for risk. At the same time, you need to keep innovating without affecting reliability.

Having an error budget will force you to have metrics in place to know if you’re meeting expectations or not, and it will help you take action to reduce the chances of being unreliable.

Error budgets give you more than just a number. They'll change your thinking when you're delivering software. You'll want to shift to the left everything that will make your systems more reliable.