September 14, 2023

Too Many Alarms? Take Advantage of Custom Situations

As IT infrastructures become increasingly complex to monitor and manage –with new compelling technologies such as virtual machines, software-defined networks and containers overlaid onto existing technology stacks– IT operations teams face the additional challenge of nearly unmanageable ticket volumes. Ticket prioritization, correlation, redundancies and sheer speed of ticket generation become problems in and of themselves.

Addressing these problems takes time, can increase mean time to repair (MTTR) and increase team stress for both your most skilled IT Ops practitioners and less experienced L1/L2 staff.

AIOps and Observability from Broadcom, through Custom Situations, offers an elegant solution to tackle these ticketing challenges. Custom Situations provides a programmatic way to cluster and filter ‘noisy’ alarms and help teams focus on issues that really matter. Using custom rules and thresholds established by your teams, Custom Situations assesses every alarm and issue present in the environment. As a result, individual practitioners and the organization overall expend less time and energy as a consistent approach for ticket prioritization is adopted. Over time, the rules and thresholds established by the team can be tuned as the organization learns and optimizes work, or as requirements change.

Getting started with custom situations

DevOps teams and SREs play an important and early role to prepare for Custom Situations. This begins with identifying IT infrastructure elements and applications which contribute to IT or business services that generate large numbers of tickets. For each IT domain (application, infrastructure and network), teams identify the specific entities (hosts, devices, databases, application components, etc) that are logically related to the service. This mapping exercise yields important insights and sets the stage for closer cross-domain collaboration.

With this information in-hand, they should then work with business stakeholders to define and establish service level objectives (SLOs) for their services. Service Level Objectives (SLOs) define the acceptable levels of reliability and performance, including thresholds for key metrics. Next, they can configure alarms based on these SLOs to trigger when the performance or reliability of a service deviates from the desired levels.

Click here to learn more about Service Observability and SLI/SLO capabilities within DX Operational Intelligence.

Thinking about alarm filtering & relationships

SLO alarms and filtering is an excellent first level of prioritization for all underlying infrastructure and application alarms plaguing IT systems. As important as alarm filtering is, teams need more. They need to:

Analyze relationships between 100s or 1000s of different alarms from different monitoring tools
Consolidate alarms into a higher-level incident to gain a more meaningful and actionable view of the overall system status.

Introducing "Custom Situations" from AIOps and Observability from Broadcom

Custom Situations provides powerful capabilities which allow IT administrators to filter and correlate alarms and take actions against them based on the criteria identified. By using Custom Situations to proactively monitor your infrastructure and automate responses to issues, you can reduce the number of alarms that demand individual attention and prioritization by practitioners while improving your overall incident management results.

With Customer Situations, intuitive rules-based configurations allow you to tune the solution to meet the needs of your team and for your specific infrastructure. Rule definitions enable flexible filtering of events and alarms, and establish adaptive clustering criteria suitable for the ever changing nature of the modern infrastructure. The situation's lifetime, which helps control the problem's tenure, is also managed using rules. With this context, let’s dive into the details of creating Custom Situations. Complete documentation can be found here.

1. Identify common issues

To use Custom Situations to manage infrastructure with fewer tickets, start by identifying common issues that result in tickets being created. These could be anything from disk space running low to CPU usage spiking to network connectivity issues. By identifying these issues, you can create Custom Situations to monitor for them and automate responses when they occur.

2. Set filters for alarms

Based on the issues identified above, set filters for alarms that represent the issues so that clustering can be performed on them. Most alarm fields are supported for the filtering, including fields from other Broadcom solutions such as UIM, DX APM, and DX NetOps.

3. Select clustering criteria

Custom Situations can be created based on clustering criteria such as entity and message (supports matching percentage), and service hierarchy. For instance, you can define business services that span the layers of the infrastructure, including network, virtual infrastructure (virtual machines, containers), and applications. When a problem occurs, which could be in any layer, it would impact services dependent on the associated infrastructure layers. You can use AIOps and Observability from Broadcom to define services hierarchy and the impacted entities across the monitoring infrastructure.

4. Configure time windows

These three time controls determine the lifetime of the Custom Situations.

Stabilization Time: Situation stabilizes if no updates are received within the time window.
Auto Extend: When a new alarm is added to a cluster, the stabilization time would be extended by the specified time value.
Max Stabilization Time: Situation cannot be extended beyond this time.

Examples

Below are three common scenarios for using Custom Situations.

1. SQL Server-specific situation

Alarm filter specified to have SQL Server related alarms sourced from UIM
Clustering Criteria: Cluster alarms based on exact (100%) entity name match

2. Message queue-specific situation

Consider the following MQ entities monitored using the APM Infrastructure Agent. These follow a specific hierarchy as presented by the pipe-separated paths.

Here, alarms can be clustered based on Entity Matching Percentage values. The matching logic considers '|' as a separator between tokens or levels. After ignoring the restricted word 'SuperDomain', the number of tokens in each of the entity paths above is nine. To cluster alarms based on token ‘host1’ so that alarms, irrespective of the queue, are clustered by the host value, the matching percentage should be 55% (5/9). If you increase the percentage to 66% for example, it would match 6 out of 9 tokens and would cluster alarms based on the Queue Manager name. Thus, alarms can be clustered at the correct level based on the percentage of the match on APM entity paths.

3. Clustering on a business or IT service, across its dependent layers

You can filter alarms that pertain to a specific service or that are part of a service hierarchy. For example, refer to the service hierarchy below. Alarms that belong to the services ‘App3’ and ‘App4’ and their child services of ‘Network’ and ‘container’ can be clustered together via a service criterion.

ESD_FY23_Academy-Blog.Too Many Alarms? Take Advantage of Custom Situations.Figure 3

To achieve this:

Set the filter criteria to Service = App3 and App4 AND
Set the clustering criteria to Service (match 100%), while selecting the checkbox to ‘Include child services’

ESD_FY23_Academy-Blog.Too Many Alarms? Take Advantage of Custom Situations.Figure 4

This definition will include alarms across these 4 services in a single cluster. This helps practitioners visualize their services based on dependencies and aggregates the components across applications, infrastructure, and network for simplified ticketing and triage for better insights.

For more information and examples, refer to our documentation here.

Tag(s): AIOps , DX OI

Ravindra Puli

Ravindra Puli works on the AIOps and Observability team as a Principal Software Engineer for DX Operational Intelligence. He has nearly 20 years of experience in architecting and developing enterprise solutions for monitoring and observability. Prior to Broadcom, he worked for Amdocs, Motorola and NetCracker.

Other resources you might be interested in

Office Hours October 23, 2025

Rally Office Hours: October 9, 2025

Discover Rally's new AI-powered Team Health Widget for flow metrics and drill-downs on feature charts. Plus, get updates on WIP limits and future enhancements.

View Recording

Course October 23, 2025

AAI - Navigating the Interface and Refining Data Views

This course introduces you to AAI’s interface and shows you how to navigate efficiently, work with tables, and refine large datasets using search and filter tools.

Go to Training

Office Hours October 23, 2025

Rally Office Hours: October 16, 2025

Rally's new AI-driven feature automates artifact breakdown - transforming features into stories or stories into tasks - saving time and ensuring consistency.

View Recording

Blog October 22, 2025

What’s New in Network Observability for Fall 2025

Discover how the Fall 2025 release of Network Observability by Broadcom introduces powerful new capabilities, elevating your insights and automation.

Read Blog

eBook October 22, 2025

Modernizing Monitoring in a Converged IT-OT Landscape

The energy sector is shifting, driven by rapid grid modernization and the convergence of IT and OT networks. Traditional monitoring tools fall short.

Read eBook

Blog October 22, 2025

Your network isn't infrastructure anymore. It's a product.

See why it’s time to stop managing infrastructure and start treating the network as your company's most critical product. Justify investments and prove ROI.

Read Blog

Blog October 22, 2025

The Network Engineers You Can't Hire? They Already Work for You

See how the proliferation of siloed monitoring tools exacerbates IT skills gaps. Implement an observability platform that empowers the teams you already have.

Read Blog

Blog October 8, 2025

Nobody Cares About Your MTTR

This post outlines why IT metrics like MTTR are irrelevant to business leaders, and it emphasizes that IT teams need network observability to bridge this gap.

Read Blog

Blog October 8, 2025

Tag(ging)—You’re It: How to Leverage AppNeta Monitoring Data for Maximum Insights

Find out about tagging capabilities in AppNeta. Get strategies for making the most of tagging and see how it can be a game-changer for your operations teams.

Read Blog