IT Operations has a wide spectrum of roles and responsibilities. The positions range from level 1 (L1) operators to Site Reliability Engineers (SREs) and everything in between. L1 operators, for example, are (often) almost exclusively reactive. They feed off the constant stream of incidents reported by clients and events that are reported by monitoring and alerting systems. This is in contrast to SREs, who work at the other end of the spectrum. Their job is to be proactive and work to make sure that development teams are self-sufficient and able to plan for and execute tasks at scale.
All of these operational activities need to find, diagnose, and resolve any issues that occur within the systems and services they support. These systems underpin the company's day-to-day business needs. Detecting an anomaly will have a bigger impact on reducing the mean time to repair (MTTR) than almost any other activity. Timeliness is vital. The faster an anomaly can be detected, the sooner it can be acted upon.
MTTR is a cornerstone metric used to measure how effectively a team is functioning. The trick isn’t just to detect every anomaly as fast as possible; it‘s to find where it originated so that the proper corrective action can be taken. You don’t want to replace the entire house every time a lightbulb is broken.
Anomaly Detection Is Not a Small Task
Alarms of all types can be – and are – generated at all levels of an infrastructure, from events where free disk space within the SAN falls below a set threshold, to hitting an ec2 quota in an AWS account when a system is scaling up for unexpected load, to errors that occur within one of the applications being hosted. As part of the traditional siloed approach to IT operations, each of these areas would have a set of tools that were purposely built for the task at hand and able to run in isolation.
Thankfully, those silos are much more collaborative than they used to be. Centralized tools that consolidate logs, metrics, and reporting are being introduced. This consolidation, combined with a large increase in the usage of cloud technologies, has led to an exponential increase in the volume of data that is available on these centralized observability platforms. Scaling the number of humans to keep up with the increasing amount of data to be processed is not sustainable. It has proven to be both expensive and inefficient. There are multiple aspects of streamlining data processing, including using traditional data reporting tools to mine the data for trends and to set up alerts based on specific keywords. Nonetheless, it requires constant attention and maintenance to make effective use of any identified trends and common errors pulled from the data.
AIOps Will Supercharge Detection and Remediation
AIOps is gaining momentum as an approach that has many benefits for detecting anomalies. AIOps adds machine learning capabilities to the observability tools that organizations are already using. These AIOps-enabled solutions have built their algorithms around best practices and know how to watch all of the logs, metrics, and alarms that are being streamed into the observability suite. In addition to isolating trends, AIOps solutions automatically adapt to and report on activity that veers from the trends that it has discovered. These AIOps-enabled solutions can also be fine-tuned and enhanced to automatically execute remediation for specific error conditions. This minimizes the number of events that the L1 operators need to react to, thus freeing up time for more valuable activities.
AIOps can be used on all forms of data, not just common log formats. The platforms can be taught how to read new types of data – from SQL databases to JSON files to raw text files (and everything in between). This capability allows the model to correlate the log and metric data across all input sources faster and with more accuracy than humans are able to achieve. For example, it can detect an increased error rate of a specific service, even if it is spread across instances over hundreds of nodes. A human wouldn’t see that pattern without investigation prompted by a major incident.
The AIOps platform can even be tied into a change management solution. For example, it could retrieve the last known changes to that application and present the data as a package. The appropriate team could then initiate the best course of action, whether it be a rollback or calling support at an upstream provider.
The one caveat to these benefits is that nothing is perfect. As trends are built and patterns are learned by the solution, there is a chance that false positives will be created. These false positives are corrected over time, as the platform is fine tuned. But as new applications and services are added to the environment, they have to be learned. In the process, they will likely generate false positives. Therefore, false positives may never completely go away.
Summary and Next Steps
Anomaly detection is a core tenet of ensuring that IT Ops organizations have the ability to maintain a consistently low mean time to repair. Introducing AIOps into the service management practice increases any tool’s capability of becoming a true enterprise-scale observability solution. The ability to learn trends and pre-process information as it arrives as well as provide context, offer potential solutions, and even execute remediation tasks while deciding which team is best to handle the anomaly (from an L1 operator to the SRE team or even the security engineering team) is critical.
For additional information on AIOps products as well as research papers from leading advisory companies like Gartner, visit Broadcom’s AIOps landing page to help you start down the path to bringing AIOps into your IT operations.