Broadcom Software Academy Blog

An Introduction to Anomaly Detection

Written by Abhinav Shroff | Aug 24, 2021 6:00:00 AM

In early 1900, Sakashi Toyoda invented a loom that automatically stops when the thread breaks, limiting the need for someone to watch the machine constantly. This approach was later named “Jidoka” and became one of the two pillars of the TPS (Toyota Production System) with just-in-time production representing the second pillar.

With modern manufacturing, Jidoka involves self-monitoring devices that automatically detect anomalies and safely stop, allowing technicians to inspect and make adjustments as early as possible. Ultimately, Jidoka limits waste and enhances both quality and efficiency.

In many ways, managing today’s digital processes is not so different from what we are used to with traditional manufacturing. As in manufacturing, unnoticed anomalies can lead to defects, performance glitches, or downtime, which can have significant consequences on business operations. When it comes to detecting anomalies within digital environments, what makes things even more complicated is the growing volume, speed, and complexity of IT infrastructures.

What is an anomaly?

If you look at a common definition of an anomaly, it is about something different, abnormal, peculiar, or not easily classified—a kind of deviation from the norm. In the context of IT operations, where it is relatively easy to measure most aspects of operational performance, an anomaly can be seen as an undesirable change within data patterns that represents a departure from business as usual.

Within IT operations, there are all sorts of anomalies. It can be helpful to classify anomalies into two main categories. The first category is known anomalies. A good example of these can be a CPU spike due to end-of-month computations or a peak in website traffic generated by a marketing campaign. Unknown anomalies are another kind of animal, and because they are not well understood, it is the most exciting category. Unknown anomalies include phenomena such as a sudden drop of application activity, or an unexpected system condition that results from a complex confluence of events.

Traditional rules-based systems are effective at detecting recurring patterns of data that signify a known anomaly. Still, they require considerable effort to configure thresholds, and they can also lead to large amounts of false positives. That’s where analytics and machine learning come in. Detecting unknown anomalies requires dynamic baselining. In this way, you can determine normal activity under given circumstances, and then detect behaviors that do not align with the dynamic baseline.

What is anomaly detection?

Anomaly detection is the process of identifying the events that represent a deviation from the normal behavior of the dataset, based on historical trends. Events identified as anomalous can point to a critical incident, like a glitch in hardware, an intrusion attack on a system, or unprecedented system usage. Today, with the help of machine learning algorithms, it is possible to achieve continuous dynamic baselining, which enables the identification of anomalous events, without human intervention.

Operations teams can leverage anomaly detection for a range of use cases, including the following:

  • Determining the anomalous consumption of resources, such as CPU, memory, network, or storage across the monitored landscape.
  • Identifying the workloads that see a sudden spike in usage and that need team attention to ensure continued availability and performance.
  • Pinpointing an anomalous increase in the response times of critical microservice workloads.

Anomaly Detection for IT Operations: The Challenges

Monitoring applications, infrastructure elements, and networks is a baseline requirement for any enterprise-grade operations team. It is how teams can get a peek into the condition of the digital infrastructure and the workload deployed on it. Through the monitoring of the digital landscape, enterprises end up collecting massive volumes of data in the form of metrics and logs.

While vast amounts of granular data can provide a wealth of information, the volume, variety, and velocity of data generated makes it impractical for meaningful human consumption. Manually sifting through this huge volume of data and determining exactly which events are anomalous can be like finding a needle in a haystack.

How DX Operational Intelligence Helps

DX Operational Intelligence delivers anomaly detection based on machine learning algorithms that consume metrics collected across applications, infrastructure, and network. These metrics can be sourced from DX Application Performance Management, DX Unified Infrastructure Manager, and DX NetOps, as well as any third-party vendor metrics ingested by RESTMon. DX Operational Intelligence offers operations teams an end-to-end solution, from identifying anomalies automatically to acting on anomaly alarms. Below are the capabilities provided:

Detecting Anomalies Based on a Group of Metrics

DX Operational Intelligence features a metric group configuration that enables teams to filter and configure metric groups based on metric source and to activate specific groups for anomaly detection.

Fine-Tuning Anomaly Alarms

As part of the metric group configuration process, teams can adjust the way alarms are related to an anomaly. Alarms can be raised while a metric is above or below the threshold that is derived dynamically by the algorithm. It is also possible to determine how many occurrences of an anomaly need to happen over a specific time frame in order to raise an alarm.

Viewing Anomaly Data

Through the performance analytics capabilities in DX Operational Intelligence, teams can do deeper analysis of the metric exhibiting abnormal behavior. The solution provides intuitive, graphical views to explore discrete time-series values.

Acting on Anomaly Alarms

DX Operational Intelligence provides an integrated, seamless interface to view and act on anomaly alarms, such as acknowledging or assigning an alarm to a colleague.

Conclusion

Now’s the time we can apply learnings from prior generations as well as machine learning. In the same way Jikoda revolutionized the manufacturing industry more than a century ago, DX Operational Intelligence is revolutionizing IT operations. With the solution, operations teams can employ end-to-end, self-monitoring approaches. Is now the right time to modernize your monitoring approach?

To deep dive on the anomaly detection capabilities in DX Operational Intelligence, you can check this detailed tutorial.