December 15, 2021

Anomaly Detection

IT Operations has a wide spectrum of roles and responsibilities. The positions range from level 1 (L1) operators to Site Reliability Engineers (SREs) and everything in between. L1 operators, for example, are (often) almost exclusively reactive. They feed off the constant stream of incidents reported by clients and events that are reported by monitoring and alerting systems. This is in contrast to SREs, who work at the other end of the spectrum. Their job is to be proactive and work to make sure that development teams are self-sufficient and able to plan for and execute tasks at scale.

All of these operational activities need to find, diagnose, and resolve any issues that occur within the systems and services they support. These systems underpin the company's day-to-day business needs. Detecting an anomaly will have a bigger impact on reducing the mean time to repair (MTTR) than almost any other activity. Timeliness is vital. The faster an anomaly can be detected, the sooner it can be acted upon.

MTTR is a cornerstone metric used to measure how effectively a team is functioning. The trick isn’t just to detect every anomaly as fast as possible; it‘s to find where it originated so that the proper corrective action can be taken. You don’t want to replace the entire house every time a lightbulb is broken.

Anomaly Detection Is Not a Small Task

Alarms of all types can be – and are – generated at all levels of an infrastructure, from events where free disk space within the SAN falls below a set threshold, to hitting an ec2 quota in an AWS account when a system is scaling up for unexpected load, to errors that occur within one of the applications being hosted. As part of the traditional siloed approach to IT operations, each of these areas would have a set of tools that were purposely built for the task at hand and able to run in isolation.

Thankfully, those silos are much more collaborative than they used to be. Centralized tools that consolidate logs, metrics, and reporting are being introduced. This consolidation, combined with a large increase in the usage of cloud technologies, has led to an exponential increase in the volume of data that is available on these centralized observability platforms. Scaling the number of humans to keep up with the increasing amount of data to be processed is not sustainable. It has proven to be both expensive and inefficient. There are multiple aspects of streamlining data processing, including using traditional data reporting tools to mine the data for trends and to set up alerts based on specific keywords. Nonetheless, it requires constant attention and maintenance to make effective use of any identified trends and common errors pulled from the data.

AIOps Will Supercharge Detection and Remediation

AIOps is gaining momentum as an approach that has many benefits for detecting anomalies. AIOps adds machine learning capabilities to the observability tools that organizations are already using. These AIOps-enabled solutions have built their algorithms around best practices and know how to watch all of the logs, metrics, and alarms that are being streamed into the observability suite. In addition to isolating trends, AIOps solutions automatically adapt to and report on activity that veers from the trends that it has discovered. These AIOps-enabled solutions can also be fine-tuned and enhanced to automatically execute remediation for specific error conditions. This minimizes the number of events that the L1 operators need to react to, thus freeing up time for more valuable activities.

AIOps can be used on all forms of data, not just common log formats. The platforms can be taught how to read new types of data – from SQL databases to JSON files to raw text files (and everything in between). This capability allows the model to correlate the log and metric data across all input sources faster and with more accuracy than humans are able to achieve. For example, it can detect an increased error rate of a specific service, even if it is spread across instances over hundreds of nodes. A human wouldn’t see that pattern without investigation prompted by a major incident.

The AIOps platform can even be tied into a change management solution. For example, it could retrieve the last known changes to that application and present the data as a package. The appropriate team could then initiate the best course of action, whether it be a rollback or calling support at an upstream provider.

The one caveat to these benefits is that nothing is perfect. As trends are built and patterns are learned by the solution, there is a chance that false positives will be created. These false positives are corrected over time, as the platform is fine tuned. But as new applications and services are added to the environment, they have to be learned. In the process, they will likely generate false positives. Therefore, false positives may never completely go away.

Summary and Next Steps

Anomaly detection is a core tenet of ensuring that IT Ops organizations have the ability to maintain a consistently low mean time to repair. Introducing AIOps into the service management practice increases any tool’s capability of becoming a true enterprise-scale observability solution. The ability to learn trends and pre-process information as it arrives as well as provide context, offer potential solutions, and even execute remediation tasks while deciding which team is best to handle the anomaly (from an L1 operator to the SRE team or even the security engineering team) is critical.

For additional information on AIOps products as well as research papers from leading advisory companies like Gartner, visit Broadcom’s AIOps landing page to help you start down the path to bringing AIOps into your IT operations.

Tag(s): AIOps , DX OI

Vince Power

Vince Power is an Enterprise Architect with a focus on digital transformation built with cloud enabled technologies. He has extensive experience working with Agile development organizations delivering their applications and services using DevOps principles including security controls, identity management, and test...

Other Resources You might be interested In

Blog August 22, 2025

Handling Incomplete User Stories at the End of an Iteration

When a team reaches the end of an iteration, some user stories may not be completed. This post details causes and options for managing these scenarios.

Read Blog

Blog August 20, 2025

What’s Hiding in Your Wiring Closets?

See why you must move from periodic audits to a state of perpetual awareness. Track every change, validate it against policy, and understand its impact.

Read Blog

Blog August 15, 2025

All Network Monitoring Tools Are Created Equal, Right?

See how observability platforms provide a unified view across multi-vendor environments and correlate network configuration changes with performance issues.

Read Blog

Blog August 15, 2025

Scale Observability, Streamline Operations with AppNeta Monitoring Policies

This post reveals how, with AppNeta’s monitoring policies, you can leverage a powerful framework for scalable, flexible, and accurate network observability.

Read Blog

Course August 14, 2025

AppNeta: Current Network Violation Map Dashboard

Learn how to configure and use the Current Network Violation Map dashboard in AppNeta to identify geographic regions impacted by WAN performance issues.

Go to Training

Course August 14, 2025

AppNeta On-Prem: Minimize Unplanned Downtime

Learn how to configure the AppNeta On-Prem environment following best practices for high availability and disaster recovery to maintain service continuity and minimize unplanned downtime.

Go to Training

Office Hours August 12, 2025

Rally Office Hours: August 7, 2025

Get tips on how to use the Capacity Planning feature in Rally, then follow the weekly Q&A session with Rally product experts.

View Recording

Blog August 11, 2025

dSeries Version 25.0 Boosts Insights, Security, and Operational Efficiency

Discover how ESP dSeries Workload Automation 25.0 represents a significant leap forward, making workload automation more secure, visible, and efficient.

Read Blog

Blog August 7, 2025

What Your SD-WAN Isn't Telling You

SD-WAN's limited view blinds it to underlay issues. Augment SD-WAN with end-to-end visibility to validate decisions and diagnose root causes for network resilience.

Read Blog