Enterprise-scale ChatOps: How to get it right

public://pictures/Anders-Wallgren-CTO-ElectricCloud.jpg
Anders Wallgren, CTO, Electric Cloud

If you're planning to implement ChatOps across your enterprise, make sure the automation it will depend on gets established first. Make that automation general, and shareable, and be sure that your team has implemented that automation with discipline and security in mind. 

Once you've established general-purpose automation as a multidimensional, company-wide resource, you'll see several key benefits. The technical advantages accrue in many areas, including your DevOps pipeline. By implementing ChatOps in this fashion, you'll also help promote company culture across business functions.

Here's how those benefits are possible. 

How to Build a DevOps Toolchain That Scales

When you’re fighting fires ...

A 2016 survey from VictorOps focused on "the people who manage the systems and solve the problems," suggesting that when you're fighting fires, log files and a chat platform are your top tools. Things such as email were at the bottom of the list.

The kind of instant messaging you get with a chat platform is key. It's a way for people to connect when they're not in the same location. Plus it's fast and efficient, and it lends itself to sharing both in the moment and as part of the historical record.

[ Special Coverage: DevOps Enterprise Summit 2018 London ]

GitHub originated the term ChatOps—"put the tool in the middle of the conversation"—meaning that the chat tool can reach into the conversation with automated messages based on alerts and technical detail. It can tell you what's going on in your systems while allowing a shared conversation around the facts.

In terms of DevOps, communication, openness, and learning from mistakes in a blameless way all are important to success. But after the crisis is over and you've put out the fire, what's interesting about ChatOps is its persistence. You can refer back not only to the conversation, but also to the bot messages captured in the chat. You have a record of what you did and what you tried, what worked and what didn't.

ChatOps helps document and automate best practices

One of the problems organizations, especially larger ones, suffer from is silos. Traditional management techniques have always been hierarchical because, as the saying goes, management likes having one throat to choke.

But culturally, this is not the right way to look at things. The more constructive saying would be to walk a mile in someone else's shoes. That approach is not hierarchal, but flat, as an ideal team structure should be.

But there's a challenge in getting to that ideal approach across an entire company. If I'm the database administrator, you're the network admin, and we're trying to understand why traffic isn’t flowing or why some results are taking a long time to filter through, we can get together and hash things out. But everyone else across the organization who has a stake in those problems—and the resolutions—misses out.

ChatOps fixes that, in several ways. Here's how:

It promotes good behavior

ChatOps tools not only allow for persistence with communications, problem solving, and analysis, but also put everyone involved on their best behavior because the communication is public, at least in the company context. This social aspect of ChatOps helps keep communication civil and constructive.

Also, it's always helpful to understand how people whom you don't necessarily work with every day do their jobs. Sharing knowledge leads to sharing tools that one team didn't know another had, for instance, or sharing some technique that might accomplish a goal faster, more efficiently, and in less time.

Yes, you can share in lots of ways—over a beer, for instance. But too often in that context you're complaining about a problem. When you can be more open, shared visibility, collaboration, learning, and onboarding for new team members—i.e., the auditability, so to speak—are what make ChatOps so valuable.

It makes systems self-healing

You can make ChatOps even more valuable in the enterprise by extending the ChatOps environment from an alerting mechanism to a self-healing one. For example, the first things you try in response to a problem are often the same every time: "Let's just reboot and see what happens."

There’s no reason your automation infrastructure can't take these obvious steps for you and tell you what happened. If that doesn't help, then the automation system can raise a virtual red flag and alert humans to come help. With this approach, you can immediately escalate the issue to people with more specialized skills, knowing that the most common remedies have have been tried.

It helps with knowledge transfer and onboarding

ChatOps also speeds up the onboarding of new employees, in the sense that ways of work are made clear and specific techniques for handling legacy and quirky systems get documented. If you're new to a team, ChatOps can give you a crash course in how things work. It reveals, and can help reinforce, your company culture. 

How often have you been part of a big deployment when something goes wrong, then someone on the team goes away for a time and then returns saying, "Okay, problem resolved!" And you move on. But no one knows what that person did or what exactly was broken. Did that person do a root-cause analysis to make sure it doesn't break again? Or did he or she just patch something?

By contrast, when you can bring repair expertise inside the chat space, everyone gets to see what others do to fix a problem. If I'm watching you fix something, I'll know more about how to fix it myself next time.

[ Webinar: Agile Portfolio Management: Three best practices ]

ChatOps can give stakeholders access to the information they need

By using ChatOps, a CIO can make sure that everyone has access to the types of data needed to accomplish work—in other words, ensure that all people who work on problem resolution have access to the systems that are relevant. On a recent two-hour conference call, there were three times when my team asked a person, "Can you look at this log?" and the person said, "No, I don’t have access."

The person had to ask someone where the log resides and then request access to the data, because of a lack of permission to access that particular system. Ugh.

For an executive to make that sort of data available doesn't always mean opening up systems to anybody just to look at log files. With automation, you can collect that data through other means, whether that's through a data flow as part of the deployment or part of some operational system that uses a tool to pull those log files in.

Access to data can be critical when you're troubleshooting, and it's insane when it takes an hour to obtain a piece of data that takes you 10 seconds to act on and fix. You need automation to fetch that data for you.

Machine output can help resolve many issues, without requiring decision makers to rethink all of the governance and security they've developed for regulatory reasons. In many cases, you just need to look through a keyhole to see a specific issue; you don't need the key to the entire room.

Plus, security teams want to be able to say to auditors that no one has access to highly secure systems. They don’t want to say, "Oh, 400 people have access, because we have to allow IT Ops, backup, recovery, and other teams permission in case there's a malfunction of some sort." That doesn't go over well with regulatory agencies.

Optimize the signal, reduce the noise

Daniel Perez, services tool engineer at GitHub, pointed out in a recent DevOps Enterprise Summit presentation that you need many rooms for ChatOps, not just one. You have to figure out how to compartmentalize. There's the database, the network, the cloud. Should every message be broadcast to the full ChatOps lobby? If I need to reboot server XYZ, that isn't necessarily a topic everyone cares about.

You don't have to predetermine what all these different rooms will be, but you want the ability to segment at some point in your ChatOps evolution. This will reduce the noise for everybody else—the ones who don't need to get involved. Sure, you might want to start a conversation in the global chat environment, noting that something's broken. But then you want to be able to move to more focused chat rooms where the specific problem gets resolved.

You also don't want to spam your ChatOps environment with things that are already known and acknowledged. Visibility is great. But ubiquitous visibility always is the equivalent of everybody talking at the same time. And no one can hear anything.

This is analogous to the loosely coupled architecture of microservices. We're all shipping one thing, but we're breaking into teams, perhaps using different tool chains for different services. But there's a contract for how all the services fit together. Remember Conway's Law: The communications structure of an organization influences the shape of its output.

ChatOps furthers DevOps

A large, core part of DevOps success is based on communication, sharing, visibility, openness, and blameless retrospectives—after mitigation, of course. Problem solving and root-cause analysis help you put the fire out, but you need to know how the fire got started. ChatOps is an awesome tool for providing this retrospective.

You can just look at the transcript. It not only takes much of the grunt work out of a retrospective, but also removes some of the more contentious things from root-cause analysis or post-incident review. You don't have to defend yourself: You can show what you did and tried, and someone else can see the steps and learn what worked versus what didn't. Which means ChatOps is a great vehicle for doing DevOps well.

More context yields a stronger culture

ChatOps is a way to build a record of your organization's cultural evolution. The first place I go for the most up-to-date information is, frankly, never our wiki. It's the chat thread. It will be the same information that I would eventually paste into the wiki, but in the wiki there's less context. The chat tool offers the power of context. The way you did your job should be transparent.

The chat transcript becomes a source of reference—call it "data truth"— as to what happened, and how. ChatOps becomes the equivalent of black-box data and voice recorders as we figure out our successes and failures in this experimental world of IT. With it, we see exactly what happened, and we can base our practices and policy decisions on real results.

See Anders Wallgren's presentation at DevOps Enterprise Summit London on June 25-26, where he'll be speaking with representatives from Somos about that company's digital transformation.