5 ways site reliability engineering transforms IT Ops

Traditional IT operations do not work at the speed of modern, cloud-native software delivery. That's why new approaches—particularly site reliability engineering (SRE)—are gaining traction across the industry.

But SRE, pioneered by Google, is radically different from IT operations of the past, due to its focus on the error budget, the inter-team relationships brokered by the error budget, the focus on everything as code, and the ability of SRE teams to push back on bad software.

Over the past 10 months I've worked with three organizations (two large and one small-to-midsize enterprise) to reconsider their approach to IT operations for cloud-native software. To do that, the teams explored the SRE model in some detail: the effect on supplier contracts, the dynamics of SRE as a service, the skills gap, and so on.

Here are the ways your large enterprise can take advantage of SRE, and what effect that has on IT operations for both leaders and hands-on managers.

Gartner Market Guide for AIOps Platforms

1. Let software engineers design IT Ops

People on SRE teams are either software developers with strong operations knowledge, or IT operations people with strong software development skills. Either way, software is the approach that SRE teams use to solve problems.

If SREs have to undertake the same manual steps to restore service to an application more than a couple of times, they will write software to automate the task. And because SREs understand and practice modern software development techniques, the software they write to fix the problem will not be just a clunky shell script, but well-written software with test scaffolding running in a continuous integration environment.

This software-first approach to IT operations also extends to the role of the development team at times. If the SRE team that looks after a particular application or service finds that it is spending more than 50% of its time doing manual operational work to address problems in the software, the development team must pick up the slack.

This is done, according to Stephen Thorne, customer reliability engineer at Google, by:

  • Monitoring the amount of operational work being done by SREs and redirecting excess operational work to the product development teams
  • Reassigning bugs and tickets to development managers
  • [Re]integrating developers into on-call pager rotations

All the redirection ends when the operational load drops back to 50% or lower. 

So if a development team produces software that is too difficult to operate within the 50% balance for the SRE team, the development team must take on the operational tasks and help fix them, learning about operational aspects as needed.

This is a highly disciplined balance between leaning on the skills of SREs and retaining responsibility for the operability of the software within the development team.

2. Rigorously focus on error budget and SLOs

At the heart of the SRE approach is the SLO for the application or service that is being run by the SRE team. The product manager for the service must choose an appropriate SLO that gives her enough margin of possible downtime to cover unforeseen problems while delivering features and updates at a rate that users expect.

Because any service downtime is measured by "neutral" pervasive tooling, there is no dispute about the figures. The SLO approach also drives the adoption of synthetic transaction monitoring, an excellent practice for customer-facing systems. This tests whole customer journeys on a regular basis (usually in 5-to-10-minute increments) from an automated script. This in turn brings the service closer to the customer and, by extension, the dev and SRE teams closer to the customer as well.

As a product manager working with an SRE team, if you are unhappy with the restrictions on deploying new features because you have used all your error budget, what are your options? You can either redefine the SLO to be less available (and therefore possibly have more downtime) or put more effort into operational aspects of the software so that it has better operability and doesn't fail as much. It's a simple choice!

3. Treat IT Ops as a value center, not a cost center

SRE is a high-skill activity, and SRE experts are in short supply; even Google struggles to hire SREs. The unusual mix of deeply technical skills and customer-focused attention to SLO and error budget means that trying to reduce costs for an SRE team is not a wise move.

Enterprises that adopt SRE therefore need to stop treating IT operations as a line item subject to cost reductions. Instead they must treat IT operations as a value center that can help the company avoid downtime and maximize revenue and service availability.

Instead of hiring large numbers of junior, lower-skilled front-line staff, SRE demands that we select high-skilled, experienced, committed staff who will automate their way out of mundane activities. This is analogous to hiring aircraft pilots who have many years of long-distance flight experience rather than junior ground staff. Modern software systems are complicated, expensive machines; why would we hire low-skilled staff to run them?  

Thankfully, SRE teams are optional. That's right; not every development team at Google uses SRE. "Downscale the SRE support if your project is shrinking in scale, and finally let your development team own the SRE work if the scale doesn't require SRE support," said Jaana B. Dogan, SRE at Google.

So enterprises can retain a small SRE footprint for critical services, but leave the IT operations of smaller and less proven services to development teams, who are well-placed to support the service they are building because they know it well.

4. Let SRE jump-start cloud-native IT Ops 

For enterprises beginning to move to cloud-based platforms and delivery models, the array of options for automation and team responsibilities can be a bit daunting. The range of different ways to do DevOps can be confusing, partly because context makes a huge difference to the effectiveness of these different options. 

The case of Poppulo is typical. Damien Daly, head of engineering at the software company, explained why Poppulo created an SRE group: "As we are getting bigger, concentrating our platform development and reliability expertise [in SRE] will allow us to more effectively develop both. Reliability and our platform are first-class concerns and need to be treated with the respect they deserve."

The SRE model presents a clear, specific set of practices and team dynamics that works for large organizations. If you are in an enterprise that needs to move rapidly to cloud-native IT operations from a more traditional setup, then adopting SRE could work well—though only if you adopt it properly and not just rename existing teams.

You may be able to to bypass some of the organizational awkwardness of other delivery models by adopting SRE, but beware of halfhearted implementations that do not set up the required, careful balance of responsibilities.

5. Use managed services to adopt SRE quickly

One way to get the benefits of the SRE discipline quickly, without hiring lots of expensive SRE people, is to use an external provider for SRE. Although SRE was developed and codified at Google using in-house teams, we are beginning to see some emerging SRE-as-a-service offerings from capable outsourced managed service providers. 

The SRE-as-a-service model might seem strange at first for IT organizations familiar with collaborative, in-house DevOps approaches to building and running software systems. But, as with many aspects of SRE, if we respect the delicate dynamics involved, then SRE as a service can work well.

The SLO and well-defined standard operating procedures that are at the heart of the SRE approach lend themselves well to a commercial contract. Keep in mind, though, that the details of the commercial contract need to be quite different from typical outsourced IT operations contracts.

We can see this "contract boundary" between dev and SRE in the well-known DevOps team topologies pattern Type 7. (The "DevOps," shown in green, is the collaboration between dev and SRE, not a separate team):

Diagram of SRE responsibilities in relation to dev teams. Image: CC BY-SA devoptopologies.com

At Google and other large organizations with in-house SRE staff, the "contract" is one of mutual trust around the SLO; for organizations using managed SRE services, the contract will have a commercial element. With a managed SRE service, Russ McKendrick, SRE practice lead at UK-based managed service provider N4Stack, highlights the importance of the SRE team having the authority to say no. "The ability of the SRE team to insist on good operability is a crucial reason for the success of the SRE approach," he said.

This means that a commercially managed SRE contract will include clear terms for the way in which the managed service provider will push back on software that does not work well. In practice, the SRE provider will probably help the dev team improve the operability before releasing to production, possibly through a parallel time-and-materials arrangement.

Another aspect of success with managed SRE is the use of tooling to define and automate the standard operating procedures needed to keep software running in production. Procedures written in Word or PDF documents are not going to work.

As DevOps luminary Damon Edwards, co-founder of Rundeck, stated in a blog post on operations as a service: "Standardizing procedures helps SREs save time, reduces errors (especially under pressure or when a procedure is critical but run infrequently), and makes it easier to spot anomalies (the outcome is different than expected, or log output is unexpected)."

When adopting a managed SRE approach, you should expect to invest time on an ongoing basis to create and evolve standard operating procedures using a software tool shared with the managed SRE partner (and with the procedures probably stored in version control such as Git).

Engineer-driven

Ben Treynor, vice president of engineering at Google, said SRE is "what happens when you ask a software engineer to design an operations function." This may sound strange to people from an IT operations background, but essentially it means that people on SRE teams have excellent coding skills and—crucially—have a strong drive to automate repetitive ops tasks using code, thereby continually reducing toil.

Google's SRE experts have helpfully written a book, freely available online. In a nutshell, they define SRE as a high-skill operating model for online, high-traffic software services. (Note: The "site" in SRE means website, not geographical or office location.) 

It's fair to say that the SRE model balances several metrics and team dynamics—including the following—in a highly effective but rather delicate equilibrium:

  • Product dev teams begin by running their own services, including being on call for incidents.
  • If and when the service reaches a high-traffic state, the dev team may request support from an SRE to take on running the service in production, leaning on the SRE's reliability and to-scale engineering skills.
  • The product owner for the service must define a service-level objective (SLO) based on the downtime deemed acceptable. So, 99.9% availability equates to just 43 minutes of downtime per month, whereas 99.99% availability leaves just 4 minutes of downtime each month.
  • The acceptable and available downtime becomes the error budget for the service, which the dev team can spend how it likes. This includes trying out new features, improving operability, etc. But if the service goes down for more than the budgeted time in a month, no new changes are permitted.
  • To be permitted to deploy again, the dev team must demonstrate increased reliability through automated operational tests.

This creates a very powerful dynamic for addressing operational problems rapidly and keeping product owners honest about both the required SLO and the operability level in their software service. Any services that must be highly available need huge investments in automation and testing to enable a continual flow of user-visible changes.

Get started with SRE

SRE is a specific approach to IT operations for large-scale, cloud-native software systems. The SRE model sets up a healthy and productive interaction between the development and SRE teams using SLOs and error budgets to balance the speed of new features with whatever work is needed to make the software reliable.

SRE therefore needs quite special skills to succeed, along with strong trust between teams. SRE might be suitable for some enterprises looking to adopt cloud-native approaches quickly, possibly by using an SRE-as-a-service offering from an outsourced provider.

The SRE model is one of several team patterns for modern software delivery explored in the forthcoming book Team Topologies, by Matthew Skelton and Manuel Pais. Follow @TeamTopologies on Twitter for more details.

[ Upcoming Webinar (Oct. 23): Simplify Discovery and Change Management for Cloud and Container Environments ]