<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=1110556&amp;fmt=gif">
Skip to content
    September 14, 2023

    Too Many Alarms? Take Advantage of Custom Situations

    As IT infrastructures become increasingly complex to monitor and manage –with new compelling technologies such as virtual machines, software-defined networks and containers overlaid onto existing technology stacks– IT operations teams face the additional challenge of nearly unmanageable ticket volumes. Ticket prioritization, correlation, redundancies and sheer speed of ticket generation become problems in and of themselves.

    Addressing these problems takes time, can increase mean time to repair (MTTR) and increase team stress for both your most skilled IT Ops practitioners and less experienced L1/L2 staff.

    AIOps and Observability from Broadcom, through Custom Situations, offers an elegant solution to tackle these ticketing challenges. Custom Situations provides a programmatic way to cluster and filter ‘noisy’ alarms and help teams focus on issues that really matter. Using custom rules and thresholds established by your teams, Custom Situations assesses every alarm and issue present in the environment. As a result, individual practitioners and the organization overall expend less time and energy as a consistent approach for ticket prioritization is adopted. Over time, the rules and thresholds established by the team can be tuned as the organization learns and optimizes work, or as requirements change.

    Getting started with custom situations

    DevOps teams and SREs play an important and early role to prepare for Custom Situations. This begins with identifying IT infrastructure elements and applications which contribute to IT or business services that generate large numbers of tickets. For each IT domain (application, infrastructure and network), teams identify the specific entities (hosts, devices, databases, application components, etc) that are logically related to the service. This mapping exercise yields important insights and sets the stage for closer cross-domain collaboration.

    With this information in-hand, they should then work with business stakeholders to define and establish service level objectives (SLOs) for their services. Service Level Objectives (SLOs) define the acceptable levels of reliability and performance, including thresholds for key metrics. Next, they can configure alarms based on these SLOs to trigger when the performance or reliability of a service deviates from the desired levels.

    Click here to learn more about Service Observability and SLI/SLO capabilities within DX Operational Intelligence.

    Thinking about alarm filtering & relationships

    SLO alarms and filtering is an excellent first level of prioritization for all underlying infrastructure and application alarms plaguing IT systems. As important as alarm filtering is, teams need more. They need to:

    • Analyze relationships between 100s or 1000s of different alarms from different monitoring tools
    • Consolidate alarms into a higher-level incident to gain a more meaningful and actionable view of the overall system status.

    Introducing "Custom Situations" from AIOps and Observability from Broadcom

    Custom Situations provides powerful capabilities which allow IT administrators to filter and correlate alarms and take actions against them based on the criteria identified. By using Custom Situations to proactively monitor your infrastructure and automate responses to issues, you can reduce the number of alarms that demand individual attention and prioritization by practitioners while improving your overall incident management results.

    With Customer Situations, intuitive rules-based configurations allow you to tune the solution to meet the needs of your team and for your specific infrastructure. Rule definitions enable flexible filtering of events and alarms, and establish adaptive clustering criteria suitable for the ever changing nature of the modern infrastructure. The situation's lifetime, which helps control the problem's tenure, is also managed using rules. With this context, let’s dive into the details of creating Custom Situations. Complete documentation can be found here.

    1. Identify common issues

    To use Custom Situations to manage infrastructure with fewer tickets, start by identifying common issues that result in tickets being created. These could be anything from disk space running low to CPU usage spiking to network connectivity issues. By identifying these issues, you can create Custom Situations to monitor for them and automate responses when they occur.

    2. Set filters for alarms

    Based on the issues identified above, set filters for alarms that represent the issues so that clustering can be performed on them. Most alarm fields are supported for the filtering, including fields from other Broadcom solutions such as UIM, DX APM, and DX NetOps.

    3. Select clustering criteria

    Custom Situations can be created based on clustering criteria such as entity and message (supports matching percentage), and service hierarchy. For instance, you can define business services that span the layers of the infrastructure, including network, virtual infrastructure (virtual machines, containers), and applications. When a problem occurs, which could be in any layer, it would impact services dependent on the associated infrastructure layers. You can use AIOps and Observability from Broadcom to define services hierarchy and the impacted entities across the monitoring infrastructure.

    4. Configure time windows

    These three time controls determine the lifetime of the Custom Situations.

    1. Stabilization Time: Situation stabilizes if no updates are received within the time window.

    2. Auto Extend: When a new alarm is added to a cluster, the stabilization time would be extended by the specified time value.

    3. Max Stabilization Time: Situation cannot be extended beyond this time.

    Examples

    Below are three common scenarios for using Custom Situations.

    1. SQL Server-specific situation

    1. Alarm filter specified to have SQL Server related alarms sourced from UIMESD_FY23_Academy-Blog.Too Many Alarms? Take Advantage of Custom Situations.Figure 1
    2. Clustering Criteria: Cluster alarms based on exact (100%) entity name matchESD_FY23_Academy-Blog.Too Many Alarms? Take Advantage of Custom Situations.Figure 2

    2. Message queue-specific situation

    Consider the following MQ entities monitored using the APM Infrastructure Agent. These follow a specific hierarchy as presented by the pipe-separated paths.

    SuperDomain|apdveapi01|Infrastructure|Agent|Queue Managers|host1|QM1|Queues|QM1_Queue1

    SuperDomain|apdveapi01|Infrastructure|Agent|Queue Managers|host1|QM1|Queues|QM1_Queue2

    SuperDomain|apdveapi01|Infrastructure|Agent|Queue Managers|host1|QM1|Queues|QM1_Queue3

    Here, alarms can be clustered based on Entity Matching Percentage values. The matching logic considers '|' as a separator between tokens or levels. After ignoring the restricted word 'SuperDomain', the number of tokens in each of the entity paths above is nine. To cluster alarms based on token ‘host1’ so that alarms, irrespective of the queue, are clustered by the host value, the matching percentage should be 55% (5/9). If you increase the percentage to 66% for example, it would match 6 out of 9 tokens and would cluster alarms based on the Queue Manager name. Thus, alarms can be clustered at the correct level based on the percentage of the match on APM entity paths.

    3. Clustering on a business or IT service, across its dependent layers

    You can filter alarms that pertain to a specific service or that are part of a service hierarchy. For example, refer to the service hierarchy below. Alarms that belong to the services ‘App3’ and ‘App4’ and their child services of ‘Network’ and ‘container’ can be clustered together via a service criterion.

    ESD_FY23_Academy-Blog.Too Many Alarms? Take Advantage of Custom Situations.Figure 3

    To achieve this:

    • Set the filter criteria to Service = App3 and App4 AND
    • Set the clustering criteria to Service (match 100%), while selecting the checkbox to ‘Include child services’ 
    ESD_FY23_Academy-Blog.Too Many Alarms? Take Advantage of Custom Situations.Figure 4

    This definition will include alarms across these 4 services in a single cluster. This helps practitioners visualize their services based on dependencies and aggregates the components across applications, infrastructure, and network for simplified ticketing and triage for better insights.

    For more information and examples, refer to our documentation here.

    Tag(s): AIOps , DX OI

    Ravindra Puli

    Ravindra Puli works on the AIOps and Observability team as a Principal Software Engineer for DX Operational Intelligence. He has nearly 20 years of experience in architecting and developing enterprise solutions for monitoring and observability. Prior to Broadcom, he worked for Amdocs, Motorola and NetCracker.

    Other posts you might be interested in

    Explore the Catalog
    icon
    Blog November 4, 2024

    Unlocking the Power of UIMAPI: Automating Probe Configuration

    Read More
    icon
    Blog October 4, 2024

    Capturing a Complete Topology for AIOps

    Read More
    icon
    Blog October 4, 2024

    Fantastic Universes and How to Use Them

    Read More
    icon
    Blog September 26, 2024

    DX App Synthetic Monitor (ASM): Introducing Synthetic Operator for Kubernetes

    Read More
    icon
    Blog September 16, 2024

    Streamline Your Maintenance Modes: Automate DX UIM with UIMAPI

    Read More
    icon
    Blog September 16, 2024

    Introducing The eBPF Agent: A New, No-Code Approach for Cloud-Native Observability

    Read More
    icon
    Blog September 6, 2024

    CrowdStrike: Are Regulations Failing to Ensure Continuity of Essential Services?

    Read More
    icon
    Blog August 28, 2024

    Monitoring the Monitor: Achieving High Availability in DX Unified Infrastructure Management

    Read More
    icon
    Blog August 27, 2024

    Topology for Incident Causation and Machine Learning within AIOps

    Read More