Key Takeaways
|
|
IT operations teams in enterprises across industries and of all sizes share similar goals: offsetting the challenges of “too many alarms.” This includes addressing these objectives:
During a recent conversation with a large financial services customer, people I spoke to recounted the challenges their teams faced with managing so many alarms. Level-1 teams were getting alarm fatigue. At the same time, understanding and segregating meaningful signals from noise was an ongoing challenge. Worried about missing important, genuine issues, teams would err on the side of over-engaging SMEs (including on false alerts), which created additional work for all involved. The result: without a better approach to alarm management, operational responsiveness and SLA/SLO attainment would suffer.
With increasing adoption of cloud services, containers, and microservices, the IT landscape has become significantly more complex. This customer turned to DX Operational Observability (DX O2), and the Situations capability to address these challenges. Situations can help teams programmatically and consistently address these questions:
By providing answers to these questions, DX O2 can help improve key business metrics like mean time to resolution (MTTR) and mean time to innocence (MTTI) and highly skilled experts can focus on innovative work.
“One of the top priorities for CIOs is staying ahead of emerging technologies and solutions.”
—Deloitte (Based on February 2024 poll of 211 US-based CIOs [Source: CIO.com, “The 10 biggest issues IT faces today”])
Achieving alarm noise reduction and improving alarm routing is made possible by combining rich observability with powerful AIOps. Let me elaborate with two specific examples.
This is a powerful capability of DX O2, which is delivered in part through the Situations feature. With Situations, users can reduce the alarm set by configuring a high-level rule for alarm filtering. Using this rule, DX O2 will analyze alarms based on attributes such as message text, entity, device, and business service to determine relationships between alarms and cluster those that are indeed related. This gives users tremendous flexibility for reducing alarm noise, without risking loss of signal.
With Situations, users can also isolate potential culprits, the potential root cause. This helps expedite routing incidents to the right teams.
Using Situations in DX O2, a large health insurance provider customer achieved noise reduction of 98.6%. This improves both the quality of results and the efficiency of teams across IT. Before adopting DX O2 and Situations, the customer noted, “Looking for the right alarm in the flood of alarms was like looking for important rain drops in a hurricane.” Using DX O2, alarm fatigue is reduced, important alarms get proper attention, and teams can focus more on actions instead of sifting through mountains of information.
In broad terms, alarm routing is a matter of triaging the right alarm and getting it to the right team on time. This means that alarms need to have relevant information, that is, sufficient context so they can be routed via the IT service management (ITSM) system to the right teams. Relevant information likely includes details on impacted CIs, the business name of the application, and so on. If the information is not available natively, then the platform should enable users to enrich the CI and the alarms with appropriate attributes. Once the enriched alarm is dispatched, the ITSM system can then route the event based on the additional details.
In another case, a customer using DX O2 wanted to route alarms through the ITSM event module to appropriate teams. The customer enriched the CIs using a key attribute ingested into DX O2 from their CMDB. Enriching tickets with attributes such as the name of the business service being monitored is a golden ticket: It informs IT operations which team owns the service and how to notify that team. For this customer, the business service name is critical for triaging associated events and getting them to the right team. Alarm enrichment using DX O2 allows their teams to enrich alarms with associated CI details so that events sent to ITSM include crucial business context. In addition, this helped ensure alarm routing was accurate, timely, and precise.
“CIOs face immense pressure to deliver successful digital initiatives while navigating budget constraints and increasing demands from senior executives.”
—Gartner, “Priorities CIOs Must Address in 2025, According to Gartner’s CIO Survey”
By reducing alarm noise and efficiently routing alarms to the right teams the first time—and with helpful context—teams can more confidently adopt cloud services, containers, and microservices, while ensuring the business services these environments support are healthy and performing well. When issues arise, they can be addressed quickly and with less effort, creating benefits for all of IT.