Doing DevOps at scale: How to monitor for success

I started my career in software as a developer and eventually moved on to operations. So I have a pretty good picture of both the "Dev" and "Ops" sides of DevOps. In my early days, several years ago, doing DevOps in a small startup, I fought in the trenches side-by-side with one developer and one QA tester; we spent our time fixing code, writing features, and supporting production together. The borders between development and production were blurred, with bugs being investigated and fixed on the fly in a short space of time. Everyone worked on the same product and was completely familiar with the code, or easily able to find someone who was.

Today, I'm in a large corporate environment keeping a complex SaaS product up and running, and things are different. My Ops teams manage several functions, including handling bugs and issues reported by customers, helping customers configure their environments, and monitoring and maintaining the different environments. With many customers and even more end users, the environments have grown massively since my startup days, and doing DevOps at scale poses a whole new set of challenges:

Supporting five, six, or even eight products at a time, each with its own development and continuous integration/continuous deployment teams, running on different platforms and with completely different purposes.
R&D and operations that are in different business units, with separate budgets, goals, and KPIs (key performance indicators).
Developers who don't have access to production systems, and operations staff who don't have access to code.
Frequent outsourcing of the network operations center (NOC) and infrastructure to external teams (whether to another division or a different company) with different priorities, KPIs, and sets of systems and processes.

To overcome these challenges, I found we needed to adapt our DevOps environment to the needs of the organization's development groups, while remaining in harmony with the corporate ecosystem.

The need for faster turnaround times

In my domain, turnaround time is the time taken to verify or update monitoring scripts once a new release is promoted to staging. Back in the days when DevOps was just "Ops," when a new version was promoted to staging, it took a few hours to verify that the current monitoring scripts would still work. If any of the scripts broke because of updates in the software, we had to send a request to the scripting team to modify the monitoring scripts. Between testing and modifying the scripts, that meant turnaround times could have been anywhere from three to ten days. These days, with agile development and rapid release cycles, that kind of time frame is completely unacceptable. A new release must be up and running in the staging environment within hours, and in production within one to three days.

My team did not have the manpower to constantly update monitoring scripts for up to eight products at a time, and I needed to come up with a solution. The enterprise environment in which I work has a whole ecosystem that lies outside the boundaries of the development team, including DBAs, infrastructure teams, NOC, and, of course, application operations teams. These are the people running the ecosystem 24/7, and it made sense to me that they should handle monitoring. So I tried to outsource monitoring scripts to these other teams, but this had problems of its own. These teams operated on their own timelines and priorities and could not always deliver the monitoring scripts on time. We ended up having releases go out with minimal or no monitoring.

That was unacceptable.

If outsourcing doesn't work, try insourcing

By "insourcing," I mean encouraging the development teams to write their own monitoring scripts. While this sounds like passing the buck, for us it proved to be a much better solution than outsourcing the monitoring scripts, both for the developers and the DevOps teams. Here's a breakdown of the issues.


	When monitoring scripts are outsourced...	When monitoring scripts are insourced..
Focus	... the engineer that develops the script does not have internal knowledge of the application. This means that he monitors the first working setup.	... the engineer that develops the script has more detailed knowledge of the application, so he can target the weaker parts that need more attention.
Timing	... you are constrained by the external team's scripting platform, programming practices and work schedules. As a result, you end up monitoring less functionality. Sometimes nothing more than logging in is monitored.	... the internal team's scripting platform, programming practices and work schedules are coordinated with you, allowing for more functionality to be monitored.
Runbooks	... runbooks are minimal and often leave you guessing what to do when something goes wrong.	... runbooks are comprehensive and provide better instructions on how to manage problems with monitoring scripts.
Developer engagement	... the monitoring scripts are just another task to be completed and forgotten.	... development teams become part of production success, keeping them more engaged in the monitoring scripts to ensure stability and performance of the products they develop.

Adapt and tailor the DevOps suite

Our NOC teams use an enterprise platform designed to manage complex applications in the context of both cloud computing and traditional IT service delivery. We used this platform for monitoring as well as alerts and ensuring service-level agreement (SLA) compliance, so changing it wasn't an option. But because we needed our development teams to develop the monitoring scripts, the only way to succeed was to consider and adapt to their development practices. Since they all used tools for automated user interface (UI) testing, we went for monitoring through the application UI. This offered two advantages over our traditional request/response-based monitoring using network scripts:

Writing and debugging monitoring scripts based on the UI is more straightforward.
While request/response-based monitoring only measures the network and server times of HTTP requests. UI monitoring can also measure actual client-side performance, which is a better indicator of real user experience.

Packaging is everything

We had our development teams write their own monitoring scripts, which was great in theory, but we still faced issues. Some teams were using a proprietary internal tool to write their testing suites, which was fine: Their work easily integrated with our proprietary monitoring platform. But the teams that were developing in Java used Selenium. To make this work with our monitoring platform, they wrapped their Selenium scripts in a set of Java classes and interfaces that gave the monitoring platform access to browser drivers, snapshots, logs, storage, and more. On every build, the UI monitors are built, tested, and packaged so they can be directly and automatically deployed on our enterprise monitoring platform.

Adhering to the principles of DevOps in a larger team environment has its challenges, for sure. For instance, successfully monitoring several applications does not simply require replicating the monitoring procedures for one. The hardships of doing DevOps at scale include incompatibility between monitoring tools and the applications they need to monitor, incompatibility between rapid release cycles and slow round-trip times, and incompatibility between external teams' priorities and the internal KPIs. To resolve all of these incompatibilities, it's essential to introduce practices and methodologies that adapt to the realities of your organization, and to use tools that enable transparency between its different parts.

Keep learning

Take a deep dive into the state of quality with TechBeacon's Guide. Plus: Download the free World Quality Report 2022-23.
Put performance engineering into practice with these top 10 performance engineering techniques that work.
Find to tools you need with TechBeacon's Buyer's Guide for Selecting Software Test Automation Tools.
Discover best practices for reducing software defects with TechBeacon's Guide.
Take your testing career to the next level. TechBeacon's Careers Topic Center provides expert advice to prepare you for your next move.

Read more articles about: App Dev & Testing, DevOps

You are here

You are here

Scaling DevOps: The right way to monitor for success

The need for faster turnaround times

If outsourcing doesn't work, try insourcing

Adapt and tailor the DevOps suite

Packaging is everything

Keep learning

Subscribe to TechBeacon

Get the latest delivered straight to your inbox.