Muscle up

How to get big results with a small SRE team

Pierre Vincent Head of Site Reliability Engineering, Glofox

One responsibility of every site reliability engineering team is to help other engineers deliver changes quickly and safely to customers. In many engineering departments, this responsibility belongs to a small number of site reliability engineers (SREs), who must juggle many different priorities, including cloud infrastructure, developer tooling, information security, and incident response.

Luckily, SRE principles were established with an eye to building lasting, scalable value, and are a critical part of DevOps. For example, when an SRE team provides developers with tools to own their service lifecycle, they become a productivity multiplier instead of a bottleneck.

Your SRE team members can achieve this only when they have sufficient time to solve these problems. They can't be spending all their time reacting to alerts and tickets. Finding this time can be a struggle, especially if yours is a small team.

Here's tried and tested advice to help you build a strong and sustainable SRE team regardless of the team's size.

Help others, never block them

A small SRE team can have major influence by providing self-service capabilities to engineering teams, thus removing itself from the critical path to production.

This starts with deployment pipelines that can be autonomously operated by development teams, allowing them to build and release code changes—including configuration, such as environment variables—without handoffs.

This does not remove SREs' responsibility to ensure that these changes are safe, since this can be achieved through constraints built into the pipeline, such as test coverage rules, static analysis quality gates, security vulnerability checks, or conservative deployment strategies.

These deployment strategies can include blue/green—maintaining two separate but identical deployment environments—and canary, which involves incremental rollouts.

Once changes are live, it's also important for the deployment team to observe the results and react to common issues without the need to involve SRE. This means full access to metrics, logs, traces, and dashboards for every engineer, as well as the ability to take action (which includes rolling back a deployment, flipping the kill switch on a feature, and so on).

Beware of rolling your own stack and tools

The last few years have brought an overwhelming choice of infrastructure options. The initial excitement from the capabilities of some tools can unfortunately often turn into a lot of technical debt and maintenance headaches.

Managing your own Kubernetes, building your own Kafka cluster, implementing your own monitoring stack—all of these are almost impossible to get right when you have a very small SRE team. Instead, turn your attention to managed cloud services, which you won't have to maintain from the ground up.

It is also often easy for SRE pros to fall into the trap of solving problems by writing your own tools—a trap that engineers of all types tend to fall into.

However, every line of code you own is a liability, not only because of the maintenance burden, but also because any progress on the tool will only happen if you have time to dedicate to it. Unless this homemade tool is a unique differentiator, your best choice is to turn to third-party products that fulfill the requirement.

Avoid technology spread

Similarly, the acquisition of an increasing number of tools, left unchecked, can turn your architecture options into a Swiss Army knife. Technology spread is not only a source of increased maintenance work for SRE, but it also adds to the mental overload of your engineering team.

For example, if you build an application using five programming languages and four databases, any team building something new has 20 options to choose from!

Development tooling also fits in this category. Think source control, monitoring, CI/CD pipelines, and so on. Having multiple options in any of these can become a source of confusion and poor consistency. It's okay to migrate from one tool to another. Just make sure you don't stop halfway.

Measure where your team is spending its time and effort

A core principle for a sustainable SRE team is keeping the amount of toil to a minimum. Google's SRE e-book defines "toil" as manual and repetitive tasks that aren't adding value in the long run. Without being automated away, toil work will increase proportionally to the number of engineers, the size of an application, and the usage of the product, slowly becoming the only thing a SRE team does.

The counterbalance of toil work is engineering work, which yields long-lasting value. Examples include building a self-service deployment pipeline instead of creating SRE tickets to deploy applications, improving alerting rules to reduce the overload of interruptions from false positives, and implementing auto-scaling policies to allow systems to adapt to loads without manual intervention.

In a small SRE team, the balance between toil and engineering can very quickly tip the wrong way. Measuring this work ratio on a regular basis is key to understanding when the situation is about to become unsustainable.

If toil consistently accounts for more than 50%, not only does it mean that more than half of SREs' work yields no lasting value, but it can also be the source of demotivation and burnout for team members stuck with manual and repetitive tasks.

Any work-management software makes this balance very easy to report on via tags. Review the biggest components of unplanned work often, and question their actual value and whether automation work should be prioritized to reduce their ongoing operational burden on your team.

Put structure around unplanned work

Despite trying to get engineering teams to be as independent as possible, they still occasionally require help or guidance. These interruptions can take the form of Slack messages, Jira tickets, or even walking over to the SRE team's offices to ask a question.

Instead of having the full team jumping to answer everything, designate one member of your team, whose main responsibility is to handle this reactive work. Rotate in a different person every week. In this way, the rest of your team can remain focused on engineering work, without constantly context-switching.

This is also a valuable learning opportunity for recent joiners on the team, since they will be exposed to more areas of the SRE sphere of work. When the SREs on "interruption duty" aren't responding to requests, they can still participate in toil-reducing work by addressing lower-priority technical debt items, such as cleaning up noisy alerts, reviewing older incidents post mortem to ensure actions were addressed, and so forth.

Spread on-call responsibilities

The best monitoring system won't help with incident response if engineers get alerted only during working hours. It's typical for SRE teams to share off-hours responsibilities by having a weekly rotation. You should have two people on call: a primary (receiving alerts) and a secondary (in case primary is unable to respond in time).

For teams with only two or three engineers, however, that's unsustainable, since individual engineers would need to wear pagers and be on call for several consecutive weeks. This can lead to on-call fatigue and carries a high risk of burning out your engineers. In this situation consider democratizing the on-call work by including developers and test engineers in your rotation.

Involving developers and testers in on-call rotations has several benefits. You're forced to make SRE "tribal knowledge"—how to deal with incidents, document them, and make the responses repeatable. That means that every alert that triggers a pager must contain clear runbooks to ensure that the response will be consistent.

On-call developers and testers also have more context about application-related issues. And as Chris Ann O'Dell puts it in her talk, "You build it, you run it: Why developers should be on call," it is also a great driver for prioritizing bug fixes and better testing strategies so you can avoid waking people up at night.

Your small team can yield big results

The essence of SRE is to help build solutions that have lasting value and that will work at scale. A well-thought-out software delivery platform will work for 10 engineers or 200, as long as you keep aiming for zero SRE intervention between the developer committing code and seeing the change have an effect in production.

Unfortunately, such a target often feels out of reach when the most-needed improvements stay in the backlog as you keep reacting to issues that block you from making real, lasting progress.

This is even more pronounced for small teams, which is why every decision you make about adopting new technology, building a custom tool, or "doing the quick thing and fixing it later" needs to be made while taking into account the recurring toil taken on your team.

Unplanned work and toil remain a reality with SRE work, but it should be second nature for your team to want to reduce it. So put structure around your work, track it and understand it, and keep it from becoming everything your team does.

Want to know more? During my Agile + DevOps Virtual conference session, I'll offer additional practical tips on how to maximize the value that very small SRE teams can deliver. The conference runs June 8-11, 2020.

Read more articles about: Enterprise IT, IT Ops