Why you should centralize IT Ops—and destroy data islands
The need to get IT services under a common framework in order to make data more widely available for better control and governance has never been greater. But to enable this transformation, IT operations management (ITOM) needs to evolve.
As IT operations teams have become more customer-centric, they have adopted a panoply of tools for specific projects, including performance monitoring, network operations monitoring, UX and user data monitoring, and service management automation. The tools may be there, but an overall management framework is lacking.
This can lead to the rise of ad hoc processes, and a lack of central IT governance across IT infrastructure, data, and tools. Central IT may not even be involved in the initial evaluation process.
Consider this scenario, for instance. A bank wanted to increase adoption of its mobile app from the current 20% to 100%. The CIO had not planned for this, and he didn't know if he could deliver on the business demand within the requested time frame.
There were too many variables for him to consider, including the user experience of existing mobile applications, how much system capacity he would need to sustain a fivefold increase in usage, application performance across various network conditions, load on the supporting application and infrastructure, and so on.
Here's how to pull all of that together.
Start with a common framework
You need to get IT services under a common framework to make data more widely available for better control and governance, because the data you need to analyze is probably available only in islands. But by using modern technologies and APIs, you can collect this data in a single repository where you can perform real-time analysis on it using machine learning and model-driven algorithms. In this way, detecting anomalies in data becomes easier.
There is no single data source that IT can use to make proactive decisions that will help the business deliver a superior experience to users. For example, an operating system has close to 200 parameters that need to be monitored to ensure performance, while applications, databases, middleware, and end-user apps have hundreds more.
Running applications over a dynamic infrastructure allows scale-in and scale-out based on load factors. Even simple mistakes in the configuration of dynamic infrastructure and applications are hard to detect and can lead to costly outages.
Often, application owners know that there is a problem but they cannot quickly pinpoint the cause. ITOM needs to adopt technologies that are using big data analytics in real time to better manage the problem of the unknown.
Include 5 key capabilities in your reference architecture
Due to the sheer volume of data that operations management tools generate, IT Ops needs to include five key capabilities in its IT operations management reference architecture. These include the ability to:
- Collect: Consolidate data into a high-speed columnstore repository that can respond to queries within seconds.
- Ingest: Ingest data from structured and unstructured data sources. Examples include metrics, logs, events, tickets, defects, and user experience.
- Analyze: Implement operational analytics that use machine learning to identify anomalies and trending topics, correlate metrics from various sources against dynamic service models, and relate the solutions to the anomalies.
- Robotize: Automate repetitive tasks using process automation to fix known incidents with known solutions. Replace manual processes with bots.
- Visualize: Simplify visualization that ties the operations data to the business context. Have the ability to personalize information based on personas.
The consolidation of data has a measurable impact on multiple processes in the organization. At the core of this effort is the detect-to-correct process, which impacts key processes such as incident management, change management, and release management.
In incident management, analytics can provide better visualization and detection of the root causes of issues. Process automation reduces remediation risks of known issues, and automated change records for traceability. Problem management using analytics on incidents can help to prevent repetitive incidents and improve service performance.
Performance indicators can also help to model capacity and forecast future performance. For example, consider capacity management from a metrics perspective. Each CPU has metrics, such as system mode and user mode utilization, run queues, and many others. These are affected by metrics such as I/O wait states and memory buffers that can affect application sessions in user mode (e.g., the number of concurrent connections). There are hundreds of metrics like this that can affect capacity.
Collect all the metrics
The key is to collect all the metrics and not exclude any, then use analytics to identify the ones with the highest correlation to service performance in real-world production systems.
The big issue is that it is hard to detect problems that may not have occurred before. Problems build up over time, and the root cause may be hidden in data that is a week old.
Often, operations looks at symptoms, and analytics and visualization go hand-in-hand in such situations. Having the ability to go back in time and visualize when the systems started producing anomalies can help guide the operator to troubleshoot the problem faster.
Anomaly detection can help to prioritize defect resolutions and, when combined with business-prioritized key performance indicators (KPIs), prioritize the severity of the impacts and focus the operations team on high-impact defects.
Take a platform approach to IT operations data
You need a platform approach to IT operations data that can ingest data from multiple islands, analyze data in real time, and provide context—for example, for service desk, service monitoring, security risk assessment, or personas such as different categories of business users.
The platforms must enable collaboration between personas via dashboards, ChatOps bots, and more.
Platforms built on containers and microservices let you add new capabilities, upgrade without downtime, and even configure high availability.
Digital transformation is about speed and agility, driven by the modern consumer experience of enterprise services. By consolidating islands of operational data, your organization gains the ability to continuously improve service performance, availability, and the user experience.
In this way you will have a better handle on the future of IT operations, and you'll be prepared to support the dynamic IT environment on which the business relies.