As IT infrastructures become increasingly complex to monitor and manage –with new compelling technologies such as virtual machines, software-defined networks and containers overlaid onto existing technology stacks– IT operations teams face the additional challenge of nearly unmanageable ticket volumes. Ticket prioritization, correlation, redundancies and sheer speed of ticket generation become problems in and of themselves.
Addressing these problems takes time, can increase mean time to repair (MTTR) and increase team stress for both your most skilled IT Ops practitioners and less experienced L1/L2 staff.
AIOps and Observability from Broadcom, through Custom Situations, offers an elegant solution to tackle these ticketing challenges. Custom Situations provides a programmatic way to cluster and filter ‘noisy’ alarms and help teams focus on issues that really matter. Using custom rules and thresholds established by your teams, Custom Situations assesses every alarm and issue present in the environment. As a result, individual practitioners and the organization overall expend less time and energy as a consistent approach for ticket prioritization is adopted. Over time, the rules and thresholds established by the team can be tuned as the organization learns and optimizes work, or as requirements change.
DevOps teams and SREs play an important and early role to prepare for Custom Situations. This begins with identifying IT infrastructure elements and applications which contribute to IT or business services that generate large numbers of tickets. For each IT domain (application, infrastructure and network), teams identify the specific entities (hosts, devices, databases, application components, etc) that are logically related to the service. This mapping exercise yields important insights and sets the stage for closer cross-domain collaboration.
With this information in-hand, they should then work with business stakeholders to define and establish service level objectives (SLOs) for their services. Service Level Objectives (SLOs) define the acceptable levels of reliability and performance, including thresholds for key metrics. Next, they can configure alarms based on these SLOs to trigger when the performance or reliability of a service deviates from the desired levels.
Click here to learn more about Service Observability and SLI/SLO capabilities within DX Operational Intelligence.
SLO alarms and filtering is an excellent first level of prioritization for all underlying infrastructure and application alarms plaguing IT systems. As important as alarm filtering is, teams need more. They need to:
Custom Situations provides powerful capabilities which allow IT administrators to filter and correlate alarms and take actions against them based on the criteria identified. By using Custom Situations to proactively monitor your infrastructure and automate responses to issues, you can reduce the number of alarms that demand individual attention and prioritization by practitioners while improving your overall incident management results.
With Customer Situations, intuitive rules-based configurations allow you to tune the solution to meet the needs of your team and for your specific infrastructure. Rule definitions enable flexible filtering of events and alarms, and establish adaptive clustering criteria suitable for the ever changing nature of the modern infrastructure. The situation's lifetime, which helps control the problem's tenure, is also managed using rules. With this context, let’s dive into the details of creating Custom Situations. Complete documentation can be found here.
To use Custom Situations to manage infrastructure with fewer tickets, start by identifying common issues that result in tickets being created. These could be anything from disk space running low to CPU usage spiking to network connectivity issues. By identifying these issues, you can create Custom Situations to monitor for them and automate responses when they occur.
Based on the issues identified above, set filters for alarms that represent the issues so that clustering can be performed on them. Most alarm fields are supported for the filtering, including fields from other Broadcom solutions such as UIM, DX APM, and DX NetOps.
Custom Situations can be created based on clustering criteria such as entity and message (supports matching percentage), and service hierarchy. For instance, you can define business services that span the layers of the infrastructure, including network, virtual infrastructure (virtual machines, containers), and applications. When a problem occurs, which could be in any layer, it would impact services dependent on the associated infrastructure layers. You can use AIOps and Observability from Broadcom to define services hierarchy and the impacted entities across the monitoring infrastructure.
These three time controls determine the lifetime of the Custom Situations.
Stabilization Time: Situation stabilizes if no updates are received within the time window.
Auto Extend: When a new alarm is added to a cluster, the stabilization time would be extended by the specified time value.
Max Stabilization Time: Situation cannot be extended beyond this time.
Below are three common scenarios for using Custom Situations.
Consider the following MQ entities monitored using the APM Infrastructure Agent. These follow a specific hierarchy as presented by the pipe-separated paths.
SuperDomain|apdveapi01|Infrastructure|Agent|Queue Managers|host1|QM1|Queues|QM1_Queue1
SuperDomain|apdveapi01|Infrastructure|Agent|Queue Managers|host1|QM1|Queues|QM1_Queue2
SuperDomain|apdveapi01|Infrastructure|Agent|Queue Managers|host1|QM1|Queues|QM1_Queue3
Here, alarms can be clustered based on Entity Matching Percentage values. The matching logic considers '|' as a separator between tokens or levels. After ignoring the restricted word 'SuperDomain', the number of tokens in each of the entity paths above is nine. To cluster alarms based on token ‘host1’ so that alarms, irrespective of the queue, are clustered by the host value, the matching percentage should be 55% (5/9). If you increase the percentage to 66% for example, it would match 6 out of 9 tokens and would cluster alarms based on the Queue Manager name. Thus, alarms can be clustered at the correct level based on the percentage of the match on APM entity paths.
You can filter alarms that pertain to a specific service or that are part of a service hierarchy. For example, refer to the service hierarchy below. Alarms that belong to the services ‘App3’ and ‘App4’ and their child services of ‘Network’ and ‘container’ can be clustered together via a service criterion.
To achieve this:
This definition will include alarms across these 4 services in a single cluster. This helps practitioners visualize their services based on dependencies and aggregates the components across applications, infrastructure, and network for simplified ticketing and triage for better insights.
For more information and examples, refer to our documentation here.