How ChatOps can help you avoid a DevOps disaster
The scenario unfolds quickly. The production dashboard, displayed on a monitor in full view of the team, takes on a reddish hue. Graphs are flat-lining, and response times are through the roof. Team members’ smartphones start to buzz with WhatsApp, SMS, and email alerts.
It gets worse: Angry and frustrated customers start updating Twitter and Facebook with posts indicating that your site is down.
It doesn't have to be this way. DevOps not only changes how you develop and deliver software, it’s also changing the way you operate it. Of course, nobody’s perfect, and DevOps teams deliver their share of buggy changes to production. However, well-organized DevOps teams are able to identify and fix problems quickly, often within hours.
What happens, for example, when a DevOps team encounters their worst fear of an application or site going down? Here's how and why communication with ChatOps, including automated communication with chatbots, is the key to a quick remediation.
Wake up teams before the nightmare begins
No one wants their customers to be the ones who alert them that the site is down. Which is why, in the DevOps world, the deployment pipeline and production systems are constantly monitored to ensure that they are working as expected. As soon as problems are detected, the whole team is updated--including developers--through automated alerts to ensure that the right people know that something has gone wrong.
Today, most DevOps teams embrace collaborative messaging platforms, such as Slack, to communicate with each other. And more and more teams are adopting ChatOps by introducing bots into their chatrooms. With ChatOps, the bots provide an interface to systems such as service desks, lifecycle management systems, and production monitoring systems to connect people directly to the continuous delivery pipeline. They allow for two-way communication, to bring information from the systems into the chatroom, and to execute instructions, such as to log a defect.
Proactive chatbots go one step further. When a production monitor detects a failure, the bot can automatically create a new chatroom dedicated to the problem, and pre-populate it with information about the problem. The bot will invite to the chatroom relevant team members who immediately see a description of the problem, the most recent changes that were deployed to production, and who was involved.
Understand the impact of a problem
In a traditional environment, the operations team will do an initial investigation to understand the scope of the problem. For example:
- Is only one user, or just a few users, experiencing problems?
- Maybe this isn’t a problem at all. Has downtime been scheduled?
- Let's get the ops team to gather logs and see what errors appear.
Both DevOps and traditional environments typically employ continuous monitoring to detect problems as soon as possible. In a DevOps scenario, each change can only reach production through the deployment pipeline, which makes it is very easy to know what changes were recently deployed into production prior to the problem, and who was involved in that change.
If the team uses chatbots, they can have that information as soon as they go into the chatroom. There’s a chance that one team member knows exactly what the problem could be, and could apply an immediate fix. As long as they deliver the fix via the pipeline and aren’t tempted to manually tweak a production setting "just this once," that fix could be the end of the problem.
Reassure your users
As soon as possible, make sure that you alert your users that you’re aware of the issue, and that you’re working to fix it. Users appreciate being updated and knowing that you’re not ignoring them.
You can update your site’s front page with a notice, or communicate via official channels or social media.
Come up with some hypotheses
If it’s not a simple issue that can be fixed on the spot, think about possible causes:
- It could be a bug in the application’s code.
- It could be a change in configuration.
- It could be a change in infrastructure, such as upgrade of a critical component or dependency.
- It could be a security breach.
- It could be a Distributed Denial of Service (DDoS) attack.
- An external service might have gone down.
- A license might have expired.
- The mice have chewed through the network cables. Again.
Because, in a DevOps environment, the developers and the operations staff are working closely with each other, it’s easy for them to bounce ideas off one another. They can drill down into their systems to get more information by querying chatbots in their communications channels. Everyone on the team sees the same information, and have all the context they need to understand it.
As the team continues to investigate, they will share their findings with the rest of the team. But why, you might ask, can't you do this with simple email? Here's the problem: Email tends to encourage additional threads and side-discussions, resulting in everyone having a small part of the discussion, but no one having the whole picture. When it’s done through a chatroom, everyone has access to the discussion, and they can guide each other towards the solution. And the activities that used to happen behind each participant’s screen become visible to the whole team as they work to fix the problem.
Now...fix the problem
Once the cause of the problem has been found, it can be fixed. There is always the temptation to apply the fix immediately in the production system, but don’t do that. That approach violates the DevOps principle that any change to production is only allowed to get there via the deployment pipeline. This ensures that we have a record of the change, who was involved, what tests were run, and why it was introduced.
And don’t forget to inform your users
Once you’re confident that the fix has been applied, don’t forget to let your users know that you’re back in business. Be sure to thank them for bearing with you while you investigated and resolved the issue. Consider telling them what the problem was, why it happened, and how you fixed it.
Users appreciate transparency, and communication from you helps restore confidence after a downtime event. Make sure that you update all the channels used when the problem first appeared; this will maximize the reassurance you're providing to your users .
It’s all about communication
Olivier Jacques, Distinguished Technologist and DevOps strategist at Hewlett Packard Enterprise IT, notes that ChatOps is revolutionizing the way DevOps team members communicate with each other, by focusing their attention on the problem and reducing the overhead associated with email or misused ticketing systems.
DevOps doesn’t make us immune to production problems, but by having teams and bots working together, the team has more visibility into the systems involved, and can quickly get the information they need, and the context that goes with it, to solve the problem quickly and efficiently. ChatOps can be a very easy way to get Dev and Ops to work together, without a re-org.