Micro Focus is now part of OpenText. Learn more >

You are here

You are here

4 attributes of a great site reliability engineer

Jayne Groll CEO, DevOps Institute

The discipline of site reliability engineering (SRE) continues to spread throughout ops teams and IT organizations worldwide, opening up new career pathways for developers and IT Ops and ITSM professionals. According to the Upskilling 2021: Enterprise DevOps Skills Report, some 22% of the 2,000-plus respondents have adopted SRE, compared to 15% in the previous year.

This growth is occurring because SRE, the Google-created practice of taking software engineering concepts and applying them to IT operations, is a much-needed way to build upon automation and reduce toil for teams and engineers. In fact, 47% of the report's respondents agreed that SRE skills are a must-have in the process and framework domains, up from 28% in 2020.

Technology leaders in both dev and ops teams recognize the importance of SRE, as employment demand continues to grow for this role in IT. This is especially true as new work patterns have emerged over the past year and as new IT operating models are creating competitive advantages for both startups and large enterprises.

Sourcing SRE talent and skill sets is a high priority for IT organizations, but what key characteristics should identify a great SRE for hiring managers?

As a preview to the upcoming SRE-focused SKILup Day conference, I asked several DevOps Institute ambassadors and SRE subject-matter experts to weigh in on what makes for a great SRE. Here are the key attributes they suggested.

1. Problem solving 

The first step toward solving any problem is recognizing that there is one. As an established role in IT organizations, a site reliability engineer's primary responsibility is to help solve problems that inhibit value delivery. Lisa Chan, head of software engineering and DevOps at Petronas, said, "If something is holding up the value chain, it doesn't matter if it isn't the SRE’s job. They will volunteer to help solve the problem if they know how. An SRE should be curious, even nosy about asking people how things work, or what they are doing, or freely dispensing advice to colleagues outside their formal sphere of influence."

Since an SRE has direct access to developers in order to create continuous feedback between them and other business units—especially IT Ops teams—they must have a big-picture perspective. Andre Almar, co-founder and technical trainer at DevOps Bootcamp, said, "A great SRE is a good problem solver who must have effective communication skills with the ability to think outside the box."

2. Awareness building

One of an SRE's biggest challenges is to help increase flow and reliability through change management. In SRE, a team is given an error budget representing the gap between 100% service reliability and the agreed-to service-level objectives (SLOs).

While the team is expected to regulate its own workload, there are understood policies and consequences that govern what happens if an error budget is spent or service levels are breached. Since error budgets are meant to be spent, the team can make autonomous decisions to increase flow.

Part of SRE is determining how to create awareness of the outcomes from the decision-making process and then spread this feedback loop across the organization.

Marc Hornbeek, author of Engineering DevOps, and CEO and principal consultant at Engineering DevOps Consulting, said, "A great SRE has a keen awareness of the big picture of how customers use services and how the services operate over the entire production environment. They like to set clear, measurable goals that the entire team is made aware of and that reflect customer needs. Not only do they combine proficient process and automation skills, but they also are familiar with tools that can help automate production tasks. They are also opinionated about production perspectives, yet open to new ways to get work done so long as the work will benefit production environments that reflect the customer's interests.”

3. Collaboration

Any approach to service management is meant to enable the marriage of people, process, and technology. Loyalty to one particular framework or set of guidance can be problematic.

At the end of the day, IT teams need to manage services through a set of operational practices. Developers will need to learn about this, just as IT operations people need to learn how to code. A strong SRE takes input from many different sources to bring the best possible solutions to the table.

Shivagami Gugan, chief technology officer at CX Tech Unicorn, said, "SREs are huge collaborators. They are people who are goal-driven and can work on multiple aspects of product resilience. This makes them multi-skilled, with an extensive growth mindset. They should have the ability to get into details quickly, think on their feet, and be brave as they move toward problem resolution. When any mistakes happen—and they always do—SREs can blamelessly look at the situation, which again makes them very cool-headed and collaborative."

4. Empathy

For SRE teams to work most effectively, they require codes of conduct that build upon psychologically safe environments, blameless post-mortems, and more. There is no room for a blame culture in any self-regulating system such as SRE, especially if the organization needs to be agile enough to meet the demands of the system.

Stephen Walters, solution architect at xMatters, said this is all about empathy. "Having empathy means that a great SRE understands the challenges that the organization faces across all aspects of the enterprise. This means that an SRE can be more predictive, pre-emptive, and proactive to the challenges that an organization faces before they are personally called to action. It also means that a great SRE is more inclined to be inclusive of quality and adaptive to continuous improvement. This assists in ensuring that the organization is a "safe space" for experimentation and innovation."

Technical and process skills are the key

If you’re trying to find the right person to fill an SRE role within a team or organization, first identify someone who demonstrates the key attributes listed above, with a flair for teamwork. A great SRE excels at the human (soft) skills that embody a team player in every sense, which make them a connector of people and ideas.

They make great support engineers and typically have a unique view of operations to help manage availability, latency, performance, efficiency, change management, monitoring, emergency response, capacity planning, and more. These technical and process-oriented skills are key aspects of any SRE role, but a great SRE reinforces the need to invest in and update humans with the same enthusiasm as we do technology and automation.

Want to learn more and join the SRE discussion? Connect with DevOps Institute ambassadors and speakers during SKILup Day: Site Reliability Engineering on May 20, 2021. 

Keep learning

Read more articles about: Enterprise ITIT Ops