How remote IT teams can flip the process on incident resolution

People working in IT support and incident management right now face unusual difficulties supporting large remote workforces and managing unpredictable workloads. On Reddit, systems admins and other IT professionals are bemoaning the hiccups and hassles of working in isolation during the pandemic while trying to resolve issues and maintain high service-level agreements.

Among other problems, you can't go grab your indispensable subject-matter expert for troubleshooting, because that person is also home and inundated with messages and alerts.

Networks, VPNs, and servers are being hammered hard right now due to higher usage of SaaS apps and collaboration tools. User expectations are even more grandiose than normal: IT organizations are expected to make applications and services work flawlessly—and to do so overnight.

Undoubtedly, IT pros are working assiduously to keep the wheels moving for their employers during a time of crisis. Yet working longer hours isn't the answer. You need new ways of working—ones that may prove better than the current status quo for the long term. And that involves turning the incident management process on its head.

Here's how to flip the process.

Problem first, metrics second

In many organizations, the process of incident management and resolution begins with people looking at screens full of metrics. You see a spike in something, such as network I/O, and immediately begin to investigate it, assuming the worst.

A more adverse scenario occurs when users call out problems on Twitter, Reddit, or other popular communities, creating a noxious PR situation that IT and the business must swiftly respond to and manage. This ad hoc, reactive fire drill takes time, and resolving that particular issue may not be the most important item of the day.

In software development, it's akin to the bad practice of designing the architecture of a system first and then the user experience.

Given everything that's at stake in the foreseeable future of stay-at-home mandates, site reliability engineers (SREs) and IT support technicians need to flip the incident management process.

Rather than beginning with the metrics and working backwards to uncover the problem, if there even is one, start by looking for the most important problems to solve. Then discover the resources and components that support that end-user process, and, finally, analyze the most relevant metrics.

Here's an example:

A B-to-C provider of a streaming video service experiences a 200% increase in daytime traffic that results in sluggish load times and glitches.
IT has tagged the components beneath that service, so IT operators know exactly which virtual machines, databases, and load balancers come into play.
Next, the operations team can identify the most relevant metrics, such as CPU idle time, to measure and take steps to correct.

Take this 5-step proactive approach

It sounds so simple—just flip the process—but it's not that easy. IT operators and SREs aren't accustomed to thinking in terms of which problems to solve or prevent. Being proactive or predictive hasn't been the common workflow.

Here is how IT operations and DevOps teams can take a proactive approach to managing the availability and health of enterprise services.

1. Coordinate across departments

Identifying the most critical problems to solve right now is the hardest step and will take the longest. Common problems these days are related to cost and scale: How do I right-size my environment to reduce costs or handle more workloads?

Broaden your field of inputs to narrow down to the specific issue, such as slow page loads on the financial reporting website. Product managers, account managers, and business unit leads who are closest to the customer experience can deliver feedback on the top issues affecting customer/user satisfaction.

Teams should also review recent support tickets to identify common themes of pain.

2. Fine-tune your approach to metrics

The Utilization Saturation and Errors (USE) Method is one way to approach the problem-first incident management process. As detailed by Brendan Gregg, a senior performance architect at Netflix, this methodology begins by posing questions, seeking answers, and working backwards to the metrics.

For each resource you want to measure, identify three metrics: one for utilization, one for saturation, and one for errors. "The USE Method has made you aware of what you didn't check: what were once unknown-unknowns are now known-unknowns," Gregg explained.

3. Standardize workflows

Create a common process for incident management across all your teams. Without the advantage of having almost everyone in the same room to huddle together in an ad hoc fashion when big issues crop up, it's imperative to institute clear steps and roles.

Doing so will prevent the frustrations, confusion, and oversights that needlessly delay resolution. Since most incidents are composed of multiple contributing factors, teams need to adopt a small number of user-friendly tools to document and organize the information.

At our company, we're now using tools such as Miro, an online whiteboard application, to replace our physical whiteboarding sessions. Of course there's also Zoom, Slack, Jira, and a host of other cloud-based tools already in place at many organizations. Mandate which tools everyone should use, and provide some guidelines on how to use them.

4. Increase automation

In some organizations, scaling requirements in response to demand have increased tenfold. Automation is playing a critical role now; moving away from a web GUI, for example, is more scalable and aligns with modern tools such as Chef and Puppet.

User tickets can be auto-generated, for instance, from emails, and linked to code management systems such as GitHub. Modern development and operations teams are also expanding automation in unit testing and provisioning.

5. Watch for burnout

Whether because there's more work and/or a need to fill the hours during long quarantine days, many software engineers, testers, and architects are working longer days right now. Yet exhaustion and burnout can lead to errors and oversights along with low morale.

It's up to managers to make sure that employees are taking breaks, working reasonable days, and having the time and energy to attend to personal needs.

Work it backwards

These days, with so many people working remotely, IT operations teams need new strategies to keep business operations running smoothly in a high-stress situation. That means flipping the incident management process by taking a problem-first approach and working backward toward the metrics that relate to quickly uncover the root causes.

Creating airtight communications and collaboration is the foundation for all of this. Provide the best tools possible to your teams, and deliver clear and frequent education on best practices to keep everyone on the same page and productive. And IT leaders should pave the way with encouragement, guidance, and moral support.

Keep learning

Choose the right ESM tool for your needs. Get up to speed with the our Buyer's Guide to Enterprise Service Management Tools
What will the next generation of enterprise service management tools look like? TechBeacon's Guide to Optimizing Enterprise Service Management offers the insights.
Discover more about IT Operations Monitoring with TechBeacon's Guide.
What's the best way to get your robotic process automation project off the ground? Find out how to choose the right tools—and the right project.
Ready to advance up the IT career ladder? TechBeacon's Careers Topic Center provides expert advice you need to prepare for your next move.

Read more articles about: Enterprise IT, IT Ops

You are here