April 2, 2024

Six Tips to Reduce Noise in IT Operations

Key Takeaways

Prioritize alerts to focus on critical issues, reducing distraction from non-essential notifications.
Utilize automated incident management to streamline responses and minimize manual intervention.
Implement machine learning for improved anomaly detection, enhancing operational efficiency and accuracy.

“We are drowning in noise all day long! Please help us!”

-Every IT operations team

Rich monitoring data is more important than ever for IT operations to manage the range of technology platforms and inter-connected systems the business runs on. One natural result of this is there are more signals and more noise that vie for operator attention.

Every team wants to reduce noise, while avoiding the possibility that important signals will be lost or overlooked. Still, the volume of alarms, events, and notifications generated by various IT systems makes it difficult for IT teams to identify and prioritize critical issues. Excessive noise leads to email/pager fatigue, distracts operators, and increases the probability that a critical signal will go unnoticed.

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 1

What causes noise: Why it is challenging to separate noise from signals

Enterprises have a mix of legacy systems and applications, and modern technologies such as micro-service-based applications running on containers and other cloud-native technologies. Everything is connected and dynamic. Theoretically, with ample time, expertise, and perspective, operators could manually find the signals that matter. In practice, there are many challenges.

Lack of context and limited understanding of dependencies

With interconnected IT systems, an issue in one component is likely to trigger alarms from multiple, dependent components. For example, application performance can be impacted due to noisy neighbor problems on an overloaded virtual infrastructure. Evaluating each alarm and investigating each suspect requires broad contextual understanding of upstream, downstream and neighboring dependences. Without this, simple alarm suppression is likely to suppress important signals.

Alerting policies and thresholds don’t match alerting needs

Some IT systems and applications are more important than others. Monitoring policies and alarm thresholds should be tailored accordingly, whether guided by the critical nature of an application or by type of IT environment (for example, whether a system is running in development, test, performance, stage, or production).

If policies and thresholds are poorly set or not updated dynamically, non-critical internal applications may spawn as many alarms as mission critical customer-facing applications for a given situation. Teams will face the unenviable prospect of choosing what to ignore, or they will struggle as they attempt to attend to everything.

Lack of automation

Sifting through large volumes of alarms to find meaningful signals takes time. During this work, the offending component generates more alarms: creating more noise and more signals. Well-tuned systems, supported with trustworthy automation, can help teams find and remediate the root cause more quickly. This breaks the alarm storm cycle. Additionally, manually removing aged or certain groups of alarms can be a chore that never gets done since new important issues always take priority.

How do we combat noise?

With advanced algorithms and sophisticated analytics, AIOps can turn the challenge of too much monitoring data into a benefit. Using predictive and interpretive AI, AIOps and observability solutions from Broadcom help to reduce noise in IT operations by identifying and prioritizing issues that require attention. Below are five suggestions to keep noise to a minimum and your IT environments healthy by using DX Operational Intelligence.

1. Reduce alarm backlog by automatically clearing aged alarms

Some enterprises have a threshold review board to ensure that threshold values for different environments are set appropriately. This tends to add bureaucracy, which can be slow and reactive. Consequently, thresholds become out of date, causing many alarms to be ignored. These alarms can linger in monitoring systems for days and months, adding to the chaos.

With DX Operational Intelligence, teams benefit from dynamic thresholding (see tip #4 below). Teams can also configure automated policies to clear alarms based on alarm parameters, such as the age of the alarm. For example, many customers have set policies to automatically close unattended alarms older than 15 days.

If automatically closing alarms seems too drastic, automated policies can be configured to take other actions, such as changing the severity of the alarm, assigning it to another person or team, or sending a notification to a distribution list.

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 2

The sample policy captured above will clear alarms that meet the following criteria:

Created more than 15 days ago
Are still in New or Updated state
Do not have Critical or Major severity
Have not been ticketed
Have not been assigned to anyone

With this policy, alarms that meet each of these criterion are classified as unimportant and cleared automatically to cleanse the backlog.

2. Use maintenance windows to suppress unnecessary alarms

Maintenance Windows in DX Operational Intelligence allow system administrators to schedule periods during which maintenance tasks can be performed. During these periods, notifications for affected devices are automatically suppressed, and alarms generated during this period are marked with a maintenance flag.

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 3

The Maintenance Window defined above will suppress alarm notifications generated every Saturday from 3:26 pm to 4:26 pm.

3. Use Alarm Clustering to group related or duplicate alarms

Intelligently grouping related or duplicate alarms reduces alarm fatigue and enables faster incident resolution. By analyzing patterns and correlations within alarm clusters, DX Operational Intelligence can efficiently and accurately provide contextual understanding and actionable insights into the root causes of issues which can be difficult for operators to deduce.

Here are two scenarios that highlight the power of Alarm Clustering:

Situations, a capability of DX Operational Intelligence, allows users to leverage a fully automated, ML-based clustering mechanism to group similar alarms and reduce the number of tickets that would propagate downstream to level 2 or 3 support staff or SRE teams. This capability also supports clustering based on such criteria as message text, device name, severity, service context, and so on for a fully customized experience. Additional information on this topic can be found in our Custom Situations blog.

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 4

Here, this Situations view shows overall alarm noise reduction of nearly 85%. Each Situation cluster has a severity assigned to it and is labeled with dominant characteristics to help explain the type of raw alarms it contains.

Hotspot Inspector, another capability within DX Operational Intelligence, quickly isolates the suspects associated with a specific problem. This type of correlation provides operators with a broader perspective on the full scope of dependencies that may require attention. A dependency may point the operator to a root cause or may allow the operator to better grasp the impact, which helps properly prioritize work.

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 5

This screen shows the analysis conducted by Hotspot Inspector on an application issue. It shows suspect alarms along with the topology, transactions, metrics, and logs in context.

4. Use Intelligent Alarming

Anomaly detection to avoid “set-and-forget” thresholds

DX Operational Intelligence correlates and analyzes data from multiple sources to identify issues, such as anomalous behavior that requires attention. Instead of generating alarms for every minor issue, the solution uses machine learning algorithms to produce dynamic baselines and then generates alarms when it detects abnormal behavior. This approach dynamically takes into account seasonality and trending changes, which is not possible with static thresholds. Anomaly Detection can help engineers identify emerging situations, such as unusual resource usage, or more immediate issues, like an unexpected spike in traffic.

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 6

The example above shows a policy that generates an anomalous alarm when a condition exceeds the AI-generated baseline threshold by 80% in five out of six occurrences.

Predictive monitoring—capacity forecasting

When problems occur, IT operations staff often describe anecdotes of reactive firefighting or critical response team formation. Alternatively, predicting issues before they occur helps to reduce cost and stress, and improves overall IT performance.

Since many issues arise from unforeseen resource (capacity) needs, capacity forecasting can help organizations shift from reactive to proactive or preemptive management.

Using historical data and models to identify trends, Capacity Analytics within DX Operational Intelligence helps you determine the required capacity for infrastructure resources, such as CPU, memory, storage, network, and more. This helps ensure the operational continuity of enterprise workloads. Teams benefit by:

Predicting capacity for peak seasons
Understanding when more resources will be needed, which helps teams plan accordingly
Procuring additional resources only when required
Efficiently managing infrastructure and networks
Making better use of resources by identifying those that are underutilized

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 7

The screenshot above shows a list of resources ranked by capacity utilization to help operators focus on those that have or may soon breach critical thresholds.

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 8

This screenshot shows capacity utilization measured by Average Wait Time to illustrate the actual three-month historical data compared to the six-month forecast derived by DX Operational Intelligence. You can switch between different forecast periods: days, months, and even a year.

5. Use Automated Alarm Insights

After initially configuring policies and thresholds, teams sometimes neglect them, only re-assessing them after problems arise or when alarm volumes increase or decrease unexpectedly.

With Automated Insights, DX Operational Intelligence will highlight potential issues with configured policies or threshold values. For example, if there is a spike in certain types of alarms, the solution will flag this policy directly within the Alarm Console. The solution will also highlight defined thresholds that warrant attention based on their age or on monitored data. This encourages operators to keep these settings current.

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 9

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 10

6. Add business awareness to IT practices

Adding business awareness refers to the practice of monitoring and analyzing systems and applications with a focus on understanding how their performance and behavior affect business objectives and outcomes. It extends traditional observability practices, which primarily focus on technical metrics and system health by incorporating business metrics (such as KPIs, SLIs, and SLOs) and context into the monitoring and analysis process.

Alignment with business goals ensures that IT operations and monitoring efforts are prioritized based on business KPIs like revenue, customer satisfaction, user engagement, or time-to-market—not just technical metrics like CPU usage or response times.

To learn more, be sure to review our blog about modeling business services.

Here are examples that you can apply in your own organization:

Rather than alarming on all server downtimes, you might prioritize alarms and notification policies associated with critical business functions.
If a critical alarm on an important business service is not handled within a specified time, a policy can automatically escalate it to another team, use notification channels to generate tickets or emails, or use webhooks to make calls to another tool.
You can create alarm queues for critical business services so alarms associated with the service are triaged at the highest priority.
You can use service level indicators (SLIs) and service level objectives (SLOs) that reflect both technical and business requirements. These metrics define acceptable levels of service quality from both technical and business perspectives and serve as a basis for prioritization and measuring performance.

ESD_FY24_Academy-Blog.6 Tips To Reduce Noise In IT Operations.Figure 11

This dashboard provides an overview of the health of 140 services defined in DX Operational Intelligence. This view helps identify the locations that are affected by poor performance and directs the user to the specific suspects that are exceeding defined thresholds. A services orientation to managing health and performance complements traditional incident and alarm triage.

Summary

Each of these suggestions can be adopted individually. Addressing alarm noise and helping teams focus on signals that warrant attention is an ongoing effort. With DX Operational Intelligence, much of this work can be handled through automation, and attending to policies and thresholds in ways that benefit large groups of users in programmatic ways. Because the source of alarm noise challenges varies from one organization to the next, consider each of these options separately so you can implement those that make the greatest difference for your teams.

For additional detail on these approaches, visit the following areas for technical documentation:

Tag(s): AIOps , DX OI , DX APM

Adeesh Fulay

Adeesh is the Head of Engineering for DX Operational Intelligence & Data Platform at Broadcom.

Other resources you might be interested in

Blog October 8, 2025

Nobody Cares About Your MTTR

This post outlines why IT metrics like MTTR are irrelevant to business leaders, and it emphasizes that IT teams need network observability to bridge this gap.

Read Blog

Blog October 8, 2025

Tag(ging)—You’re It: How to Leverage AppNeta Monitoring Data for Maximum Insights

Find out about tagging capabilities in AppNeta. Get strategies for making the most of tagging and see how it can be a game-changer for your operations teams.

Read Blog

Office Hours October 6, 2025

Rally Office Hours: October 2, 2025

The Rally Model Context Protocol (MCP) Server acts as a standardized interface for AI models and developer tools. Learn about this exciting new feature then follow the weekly Q&A session with Rally...

View Recording

Blog October 1, 2025

Why 1% Packet Loss Is the New 100% Outage

In an era of real-time apps and multiple clouds, the old rules about 'acceptable' network errors no longer apply. See why you need end-to-end observability.

Read Blog

Office Hours September 30, 2025

Rally Office Hours: September 25, 2025

Rally Office Hours delivers an essential product tip: Learn to transition from Legacy Custom Pages to powerful Custom Views. Plus, Q&A insights.

View Recording

Blog September 26, 2025

Defining the Network Engineer of Tomorrow

Read this post and see why the most important investment isn't in new hardware, but in transforming your team from device managers to service delivery experts.

Read Blog

Blog September 26, 2025

Harnessing AppNeta’s Browser- and HTTP-based Workflows to Track User Experience

AppNeta’s browser- and HTTP-based workflows let you see what users actually experience. Preempt issues before they become headaches for your end users.

Read Blog

Blog September 26, 2025

“Rego U” Recap: Why SPM Is Still Hot

Rego Consulting’s Annual Conference underscored why strategic portfolio management (SPM) is still essential. Leverage SPM to bridge strategy and execution.

Read Blog

Blog September 23, 2025

What's New in AutoSys 24.1: Built for the Modern Automation Landscape

See how AutoSys 24.1 is designed to streamline your daily tasks, accelerate troubleshooting, and simplify how you integrate with the latest technologies.

Read Blog