You are here

You are here

Why you should burn your IT ticketing system

public://pictures/dave_mangot.jpeg
Dave Mangot Principal, Mangoteque
 

Ticketing systems are one of the most common obstacles that hold IT operations teams back from becoming high-performing, which is essential to digital transformation.

Most people who have held an operations-type job are familiar with ticketing systems: They record the requests for the IT Ops team so that everyone involved in the process has visibility into when an item was requested, who requested it, who is performing the task, how long it took to be completed, etc.

These all sound like good things to track, and they tell you all kinds of things about the work that is happening. But when I see this pattern in an organization, here's my advice:

Burn it down 

"Surely," you might think, "Dave's not advocating actually burning the ticketing system to the ground?" Yes, that is precisely what I'm suggesting. But it's important to be perfectly clear about what I mean by "the ticketing system" as opposed to other kinds of work tracking systems.

I’m not talking about Scrum walls or Kanban boards here. In those types of work-tracking systems, the teams are empowered to decide which work is most important, and then help prioritize the work appropriately for the best possible outcome. The systems I’m referring to are the ones where someone makes a request for a Jenkins host, a new AWS EC2 instance, or a host to be added to the load balancer. The type of work that is necessary to operate a production service but that falls into the domain of IT operations (or a DevOps engineer, site reliability engineer, production engineer, etc.) to fulfill.

There are several problems with this approach that I’ll dive into during my talk at DevOps Enterprise Summit Las Vegas Virtual. First, this ticketing system requires waiting, and waiting like this should be classified as waste, as it is in lean. Second, the tickets in the system should be viewed as software exceptions being thrown by the value stream. However, it’s the third problem I want to address here: toil. If your organization is designed in such a way that the value stream flows through humans operating on a ticket, then your ticketing system is designed to track toil.

What is toil?

In Chapter 5 of The Google SRE BookEliminating Toil, Vivek Rau writes: “Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.” There is a lot in there to unpack, but here are the key concepts from Rau that demonstrate how a ticketing system is designed to track toil.

Manual, repetitive, automatable

If this work sits and waits until a human comes and picks it up off the queue before it can be executed, then it is by definition manual. Even if some automation executes once the work is picked up, a human is still executing the work. The fact that this work needs to be done as the result of a ticket should also demonstrate its being repetitive. Whether something is truly automatable is a matter of debate, but if it’s something that happens repeatedly, isn’t that worth examining more closely as a candidate for automation?

Scales linearly as the service grows

On this point, I disagree with Rau. I believe that toil scales sublinearly as a service grows. Why? Because of the coordination costs involved in executing many of these tasks: The more of this type of work you try to push through the system, the worse the system will perform. Ultimately, it will collapse under its own weight. There are only so many minutes in a day that you can wait for someone else’s task to finish first, or for approvals that must be granted by the change advisory board to make changes in the environments, or between changes so that if there is a resulting outage, you are more clear on which change was the one that caused the problem.

This is a problem as the organization grows. The more successful the organization is, the more likely there will be more toil that the system needs to process. If you’re starting with a responsive, well-liked process and you’d like to keep up the same level of service to your organization, then you will need to hire more humans to execute on this toil—which means that your hiring will scale linearly with your toil and still slow down. To what end? The best one can hope for in this case is to constantly tread water and be stuck exactly where you are. It’s no wonder that big companies can have such difficulty executing compared to their smaller, more nimble competitors.

Transformation

Clearly the path of continually processing more toil is unsustainable. But what are the alternatives? How do you start to eliminate toil? As with all transformation initiatives, change must be deliberate. When working with companies on this problem, I follow Dominica Degrandis' advice and make work visible. You need to examine the type of work that is being done and characterize it. How you characterize the work may be different for different organizations. Maybe you characterize by time, by value, or by cardinality. Regardless of your scheme, you are looking for work that can be eliminated, either because it’s not needed or because it’s work that you can empower teams to perform for themselves.

If an autoscaling group needs to be incremented from three nodes to four, perhaps your teams could submit a pull request to be reviewed and applied by IT operations. Even better, if the pull request is deemed safe by a script and has been code-reviewed by another member of the team before the request is submitted, perhaps it could be applied automatically. Even better, maybe there is a way to reexamine the algorithm by which you size your autoscaling groups so that the group will autoscale itself appropriately based on some criteria.

You do not, however, want to create or perpetuate a system where the ops person must check out the infrastructure code, create a new branch, make the 1-byte change, and so on, all within the context of a ticketing system. In as many cases as possible, you want IT Ops teams to empower the organization by providing a self-service platform in which those teams can make changes themselves in a safe manner. This is precisely the service that cloud providers offer to their customers.

Now I’m overstaffed

If you’d hired more engineers to keep up with all that toil, does that mean you’re going to experience a massive round of layoffs? Certainly not. I’ve worked with companies on taking members of their staff that had been working on ticketing toil and have them focus specifically on the empowerment component of the problem. Their job becomes to literally eliminate toil to the greatest extent possible.

You can also take some of the more junior engineers and place them on your development teams to make truly cross-functional teams that have the capability to move as quickly as the organization will let them, since they will have all the skills necessary and won’t have to wait. Those engineers can also provide valuable feedback to your “toil eliminators” because they understand the problems their teams face at a deep level, as well as operational concerns, and are able to speak that language back to the eliminator teams. This is a great way to produce high-fidelity feedback while giving your engineers valuable experience working closely with development teams to understand their needs.

There is no Nirvana

Are you ever going to be able to eliminate toil entirely? Of course not. Even Google’s esteemed SREs are only tasked with keeping their toil levels below 40%. However, you need to make sure you are deliberate about systemically designing your systems to eliminate toil in order to empower your engineers to quickly and safely deliver high quality products to your customers. By actively working to eliminate processing toil as a way of completing work, even with the same number of people, you’ll make your ops teams and developers vastly more effective.

Want to know more? In my talk, "My Ops Team Can't Keep Up with My Dev Team: Creating Strategic Differentiation in Ops," at DevOps Enterprise Summit Las Vegas Virtual, I’ll focus on what a high-performing ops team looks like. I'll also talk about obstacles I’ve seen in ops, that hold a team back from being a high-performing one—including the ticketing system. The conference runs October 13-15, 2020.

Keep learning