Broadcom Software Academy Blog

Six Tips to Reduce Noise in IT Operations

Written by Adeesh Fulay | Apr 2, 2024 3:01:24 PM
Key Takeaways
  • Prioritize alerts to focus on critical issues, reducing distraction from non-essential notifications.
  • Utilize automated incident management to streamline responses and minimize manual intervention.
  • Implement machine learning for improved anomaly detection, enhancing operational efficiency and accuracy.

“We are drowning in noise all day long! Please help us!”

-Every IT operations team

Rich monitoring data is more important than ever for IT operations to manage the range of technology platforms and inter-connected systems the business runs on. One natural result of this is there are more signals and more noise that vie for operator attention.

Every team wants to reduce noise, while avoiding the possibility that important signals will be lost or overlooked. Still, the volume of alarms, events, and notifications generated by various IT systems makes it difficult for IT teams to identify and prioritize critical issues. Excessive noise leads to email/pager fatigue, distracts operators, and increases the probability that a critical signal will go unnoticed.

What causes noise: Why it is challenging to separate noise from signals

Enterprises have a mix of legacy systems and applications, and modern technologies such as micro-service-based applications running on containers and other cloud-native technologies. Everything is connected and dynamic. Theoretically, with ample time, expertise, and perspective, operators could manually find the signals that matter. In practice, there are many challenges.

Lack of context and limited understanding of dependencies

With interconnected IT systems, an issue in one component is likely to trigger alarms from multiple, dependent components. For example, application performance can be impacted due to noisy neighbor problems on an overloaded virtual infrastructure. Evaluating each alarm and investigating each suspect requires broad contextual understanding of upstream, downstream and neighboring dependences. Without this, simple alarm suppression is likely to suppress important signals.

Alerting policies and thresholds don’t match alerting needs

Some IT systems and applications are more important than others. Monitoring policies and alarm thresholds should be tailored accordingly, whether guided by the critical nature of an application or by type of IT environment (for example, whether a system is running in development, test, performance, stage, or production).

If policies and thresholds are poorly set or not updated dynamically, non-critical internal applications may spawn as many alarms as mission critical customer-facing applications for a given situation. Teams will face the unenviable prospect of choosing what to ignore, or they will struggle as they attempt to attend to everything.

Lack of automation

Sifting through large volumes of alarms to find meaningful signals takes time. During this work, the offending component generates more alarms: creating more noise and more signals. Well-tuned systems, supported with trustworthy automation, can help teams find and remediate the root cause more quickly. This breaks the alarm storm cycle. Additionally, manually removing aged or certain groups of alarms can be a chore that never gets done since new important issues always take priority.

How do we combat noise?

With advanced algorithms and sophisticated analytics, AIOps can turn the challenge of too much monitoring data into a benefit. Using predictive and interpretive AI, AIOps and observability solutions from Broadcom help to reduce noise in IT operations by identifying and prioritizing issues that require attention. Below are five suggestions to keep noise to a minimum and your IT environments healthy by using DX Operational Intelligence.

1. Reduce alarm backlog by automatically clearing aged alarms

Some enterprises have a threshold review board to ensure that threshold values for different environments are set appropriately. This tends to add bureaucracy, which can be slow and reactive. Consequently, thresholds become out of date, causing many alarms to be ignored. These alarms can linger in monitoring systems for days and months, adding to the chaos.

With DX Operational Intelligence, teams benefit from dynamic thresholding (see tip #4 below). Teams can also configure automated policies to clear alarms based on alarm parameters, such as the age of the alarm. For example, many customers have set policies to automatically close unattended alarms older than 15 days.

If automatically closing alarms seems too drastic, automated policies can be configured to take other actions, such as changing the severity of the alarm, assigning it to another person or team, or sending a notification to a distribution list.

The sample policy captured above will clear alarms that meet the following criteria:

  • Created more than 15 days ago
  • Are still in New or Updated state
  • Do not have Critical or Major severity
  • Have not been ticketed
  • Have not been assigned to anyone

With this policy, alarms that meet each of these criterion are classified as unimportant and cleared automatically to cleanse the backlog.

2. Use maintenance windows to suppress unnecessary alarms

Maintenance Windows in DX Operational Intelligence allow system administrators to schedule periods during which maintenance tasks can be performed. During these periods, notifications for affected devices are automatically suppressed, and alarms generated during this period are marked with a maintenance flag.

The Maintenance Window defined above will suppress alarm notifications generated every Saturday from 3:26 pm to 4:26 pm.  

3. Use Alarm Clustering  to group related or duplicate alarms

Intelligently grouping related or duplicate alarms reduces alarm fatigue and enables faster incident resolution. By analyzing patterns and correlations within alarm clusters, DX Operational Intelligence can efficiently and accurately provide contextual understanding and actionable insights into the root causes of issues which can be difficult for operators to deduce.

Here are two scenarios that highlight the power of Alarm Clustering:

  • Situations, a capability of DX Operational Intelligence, allows users to leverage a fully automated, ML-based clustering mechanism to group similar alarms and reduce the number of tickets that would propagate downstream to level 2 or 3 support staff or SRE teams. This capability also supports clustering based on such criteria as message text, device name, severity, service context, and so on for a fully customized experience. Additional information on this topic can be found in our Custom Situations blog

Here, this Situations view shows overall alarm noise reduction of nearly 85%. Each Situation cluster has a severity assigned to it and is labeled with dominant characteristics to help explain the type of raw alarms it contains.

  • Hotspot Inspector, another capability within DX Operational Intelligence, quickly isolates the suspects associated with a specific problem. This type of correlation provides operators with a broader perspective on the full scope of dependencies that may require attention. A dependency may point the operator to a root cause or may allow the operator to better grasp the impact, which helps properly prioritize work. 

This screen shows the analysis conducted by Hotspot Inspector on an application issue. It shows suspect alarms along with the topology, transactions, metrics, and logs in context.

4. Use Intelligent Alarming

Anomaly detection to avoid “set-and-forget” thresholds

DX Operational Intelligence correlates and analyzes data from multiple sources to identify issues, such as anomalous behavior that requires attention. Instead of generating alarms for every minor issue, the solution uses machine learning algorithms to produce dynamic baselines and then generates alarms when it detects abnormal behavior. This approach dynamically takes into account seasonality and trending changes, which is not possible with static thresholds. Anomaly Detection can help engineers identify emerging situations, such as unusual resource usage, or more immediate issues, like an unexpected spike in traffic. 

The example above shows a policy that generates an anomalous alarm when a condition exceeds the AI-generated baseline threshold by 80% in five out of six occurrences.

Predictive monitoring—capacity forecasting

When problems occur,  IT operations staff often describe anecdotes of reactive firefighting or critical response team formation. Alternatively, predicting issues before they occur helps to reduce cost and stress, and improves overall IT performance.

Since many issues arise from unforeseen resource (capacity) needs, capacity forecasting can help organizations shift from reactive to proactive or preemptive management.

Using historical data and models to identify trends, Capacity Analytics within DX Operational Intelligence helps you determine the required capacity for infrastructure resources, such as CPU, memory, storage, network, and more. This helps ensure the operational continuity of enterprise workloads. Teams benefit by:

  • Predicting capacity for peak seasons
  • Understanding when more resources will be needed, which helps teams plan accordingly
  • Procuring additional resources only when required
  • Efficiently managing infrastructure and networks
  • Making better use of resources by identifying those that are underutilized

The screenshot above shows a list of resources ranked by capacity utilization to help operators focus on those that have or may soon breach critical thresholds. 

This screenshot shows capacity utilization measured by Average Wait Time to illustrate the actual three-month historical data compared to the six-month forecast derived by DX Operational Intelligence. You can switch between different forecast periods: days, months, and even a year.

5. Use Automated Alarm Insights

After initially configuring policies and thresholds, teams sometimes neglect them, only re-assessing them after problems arise or when alarm volumes increase or decrease unexpectedly.

With Automated Insights, DX Operational Intelligence will highlight potential issues with configured policies or threshold values. For example, if there is a spike in certain types of alarms, the solution will flag this policy directly within the Alarm Console. The solution will also highlight defined thresholds that warrant attention based on their age or on monitored data. This encourages operators to keep these settings current.

6. Add business awareness to IT practices

Adding business awareness refers to the practice of monitoring and analyzing systems and applications with a focus on understanding how their performance and behavior affect business objectives and outcomes. It extends traditional observability practices, which primarily focus on technical metrics and system health by incorporating business metrics (such as KPIs, SLIs, and SLOs) and context into the monitoring and analysis process.

Alignment with business goals ensures that IT operations and monitoring efforts are prioritized based on business KPIs like revenue, customer satisfaction, user engagement, or time-to-market—not just technical metrics like CPU usage or response times.

To learn more, be sure to review our blog about modeling business services.

Here are examples that you can apply in your own organization:

  • Rather than alarming on all server downtimes, you might prioritize alarms and notification policies associated with critical business functions.
  • If a critical alarm on an important business service is not handled within a specified time, a policy can automatically escalate it to another team, use notification channels to generate tickets or emails, or use webhooks to make calls to another tool.
  • You can create alarm queues for critical business services so alarms associated with the service are triaged at the highest priority.
  • You can use service level indicators (SLIs) and service level objectives (SLOs) that reflect both technical and business requirements. These metrics define acceptable levels of service quality from both technical and business perspectives and serve as a basis for prioritization and measuring performance.

This dashboard provides an overview of the health of 140 services defined in DX Operational Intelligence. This view helps identify the locations that are affected by poor performance and directs the user to the specific suspects that are exceeding defined thresholds. A services orientation to managing health and performance complements traditional incident and alarm triage.

Summary

Each of these suggestions can be adopted individually. Addressing alarm noise and helping teams focus on signals that warrant attention is an ongoing effort. With DX Operational Intelligence, much of this work can be handled through automation, and attending to policies and thresholds in ways that benefit large groups of users in programmatic ways. Because the source of alarm noise challenges varies from one organization to the next, consider each of these options separately so you can implement those that make the greatest difference for your teams.

For additional detail on these approaches, visit the following areas for technical documentation:

  1. Auto-close Aged Alarms
  2. Maintenance Windows
  3. Situations—Alarm Clustering
  4. Inspector
  5. Configure Alarms—Set Up Anomaly Alarms
  6. Capacity Analytics
  7. Service Creation