As IT infrastructures become increasingly complex to monitor and manage –with new compelling technologies such as virtual machines, software-defined networks and containers overlaid onto existing technology stacks– IT operations teams face the additional challenge of nearly unmanageable ticket volumes. Ticket prioritization, correlation, redundancies and sheer speed of ticket generation become problems in and of themselves.
Addressing these problems takes time, can increase mean time to repair (MTTR) and increase team stress for both your most skilled IT Ops practitioners and less experienced L1/L2 staff.
AIOps and Observability from Broadcom, through Custom Situations, offers an elegant solution to tackle these ticketing challenges. Custom Situations provides a programmatic way to cluster and filter ‘noisy’ alarms and help teams focus on issues that really matter. Using custom rules and thresholds established by your teams, Custom Situations assesses every alarm and issue present in the environment. As a result, individual practitioners and the organization overall expend less time and energy as a consistent approach for ticket prioritization is adopted. Over time, the rules and thresholds established by the team can be tuned as the organization learns and optimizes work, or as requirements change.
Getting started with custom situations
DevOps teams and SREs play an important and early role to prepare for Custom Situations. This begins with identifying IT infrastructure elements and applications which contribute to IT or business services that generate large numbers of tickets. For each IT domain (application, infrastructure and network), teams identify the specific entities (hosts, devices, databases, application components, etc) that are logically related to the service. This mapping exercise yields important insights and sets the stage for closer cross-domain collaboration.
With this information in-hand, they should then work with business stakeholders to define and establish service level objectives (SLOs) for their services. Service Level Objectives (SLOs) define the acceptable levels of reliability and performance, including thresholds for key metrics. Next, they can configure alarms based on these SLOs to trigger when the performance or reliability of a service deviates from the desired levels.
Click here to learn more about Service Observability and SLI/SLO capabilities within DX Operational Intelligence.
Thinking about alarm filtering & relationships
SLO alarms and filtering is an excellent first level of prioritization for all underlying infrastructure and application alarms plaguing IT systems. As important as alarm filtering is, teams need more. They need to:
- Analyze relationships between 100s or 1000s of different alarms from different monitoring tools
- Consolidate alarms into a higher-level incident to gain a more meaningful and actionable view of the overall system status.
Introducing "Custom Situations" from AIOps and Observability from Broadcom
Custom Situations provides powerful capabilities which allow IT administrators to filter and correlate alarms and take actions against them based on the criteria identified. By using Custom Situations to proactively monitor your infrastructure and automate responses to issues, you can reduce the number of alarms that demand individual attention and prioritization by practitioners while improving your overall incident management results.
With Customer Situations, intuitive rules-based configurations allow you to tune the solution to meet the needs of your team and for your specific infrastructure. Rule definitions enable flexible filtering of events and alarms, and establish adaptive clustering criteria suitable for the ever changing nature of the modern infrastructure. The situation's lifetime, which helps control the problem's tenure, is also managed using rules. With this context, let’s dive into the details of creating Custom Situations. Complete documentation can be found here.
1. Identify common issues
To use Custom Situations to manage infrastructure with fewer tickets, start by identifying common issues that result in tickets being created. These could be anything from disk space running low to CPU usage spiking to network connectivity issues. By identifying these issues, you can create Custom Situations to monitor for them and automate responses when they occur.
2. Set filters for alarms
Based on the issues identified above, set filters for alarms that represent the issues so that clustering can be performed on them. Most alarm fields are supported for the filtering, including fields from other Broadcom solutions such as UIM, DX APM, and DX NetOps.
3. Select clustering criteria
Custom Situations can be created based on clustering criteria such as entity and message (supports matching percentage), and service hierarchy. For instance, you can define business services that span the layers of the infrastructure, including network, virtual infrastructure (virtual machines, containers), and applications. When a problem occurs, which could be in any layer, it would impact services dependent on the associated infrastructure layers. You can use AIOps and Observability from Broadcom to define services hierarchy and the impacted entities across the monitoring infrastructure.
4. Configure time windows
These three time controls determine the lifetime of the Custom Situations.
-
Stabilization Time: Situation stabilizes if no updates are received within the time window.
-
Auto Extend: When a new alarm is added to a cluster, the stabilization time would be extended by the specified time value.
-
Max Stabilization Time: Situation cannot be extended beyond this time.
Examples
Below are three common scenarios for using Custom Situations.
1. SQL Server-specific situation
- Alarm filter specified to have SQL Server related alarms sourced from UIM
- Clustering Criteria: Cluster alarms based on exact (100%) entity name match
2. Message queue-specific situation
Consider the following MQ entities monitored using the APM Infrastructure Agent. These follow a specific hierarchy as presented by the pipe-separated paths.
SuperDomain|apdveapi01|Infrastructure|Agent|Queue Managers|host1|QM1|Queues|QM1_Queue1
SuperDomain|apdveapi01|Infrastructure|Agent|Queue Managers|host1|QM1|Queues|QM1_Queue2
SuperDomain|apdveapi01|Infrastructure|Agent|Queue Managers|host1|QM1|Queues|QM1_Queue3
Here, alarms can be clustered based on Entity Matching Percentage values. The matching logic considers '|' as a separator between tokens or levels. After ignoring the restricted word 'SuperDomain', the number of tokens in each of the entity paths above is nine. To cluster alarms based on token ‘host1’ so that alarms, irrespective of the queue, are clustered by the host value, the matching percentage should be 55% (5/9). If you increase the percentage to 66% for example, it would match 6 out of 9 tokens and would cluster alarms based on the Queue Manager name. Thus, alarms can be clustered at the correct level based on the percentage of the match on APM entity paths.
3. Clustering on a business or IT service, across its dependent layers
You can filter alarms that pertain to a specific service or that are part of a service hierarchy. For example, refer to the service hierarchy below. Alarms that belong to the services ‘App3’ and ‘App4’ and their child services of ‘Network’ and ‘container’ can be clustered together via a service criterion.
To achieve this:
- Set the filter criteria to Service = App3 and App4 AND
- Set the clustering criteria to Service (match 100%), while selecting the checkbox to ‘Include child services’
This definition will include alarms across these 4 services in a single cluster. This helps practitioners visualize their services based on dependencies and aggregates the components across applications, infrastructure, and network for simplified ticketing and triage for better insights.
For more information and examples, refer to our documentation here.
Ravindra Puli
Ravindra Puli works on the AIOps and Observability team as a Principal Software Engineer for DX Operational Intelligence. He has nearly 20 years of experience in architecting and developing enterprise solutions for monitoring and observability. Prior to Broadcom, he worked for Amdocs, Motorola and NetCracker.
Other posts you might be interested in
Explore the Catalog
Blog
January 10, 2025
When and How to Use Log-Based Metrics in DX Operational Observability
Read More
Blog
December 13, 2024
Full-Stack Observability with OpenTelemetry and DX Operational Observability
Read More
Blog
December 6, 2024
Power Up Your Alarms! Enriched UIM Alarms for Added Intelligence
Read More
Blog
November 26, 2024
Topology: Services for Business Observability
Read More
Blog
November 22, 2024
Regular Expressions That I Use Regularly
Read More
Blog
November 22, 2024
Cloud Application Performance: Common Reasons for Slow-Downs
Read More
Blog
November 4, 2024
Unlocking the Power of UIMAPI: Automating Probe Configuration
Read More
Blog
October 4, 2024
Capturing a Complete Topology for AIOps
Read More
Blog
October 4, 2024