Topology for Incident Causation and Machine Learning within AIOps

Written by Jörg Mertin | Aug 27, 2024 2:11:54 PM

Key Takeaways

Harness topology visibility to determine an incident’s root cause and impacted components.
Link topology components to metrics, events, and alarms to gain new opportunities for business observability.
Employ AIOps and Observability solutions to gain an understanding of the ICT infrastructure and autonomously discover changes.

Our thinking and use of topology within AIOps and Observability solutions from Broadcom has advanced significantly in recent years, while solidly building on our innovative domain tools.

We’re providing a blog post series to communicate these innovations, advancements, and benefits for IT operations. In this blog post, we continue where the previous blog post left off. In this post, we explain how topology enables analysis for incident causation with automated and machine learning within AIOps from Broadcom.

We present the demand for topology completeness that is required for this causation and the demanding qualities that are required by monitoring.

Topology for incident causation and AIOps machine learning

A topology enables incident causation analysis:

The determination of an incident’s root cause and impacted components within time.

A topology enables incident causation analysis with the following steps:

Detecting culprits. Culprit components, i.e., those components exhibiting abnormal behavior, are identified by continually analyzing incoming metrics, events, and alarms and detecting abnormal data.
Isolating the root cause component. Through a culprit’s relationships and dependencies, the root cause component is sought upstream and towards the bottom layer of the IT stack.
Determining impacted components. Through a culprit’s relationships and dependencies, impacted components are sought downstream and towards the top layer of the IT stack.
Establishing an incident’s path. The path from root cause to impacted components is the incident’s path.
Establishing an incident’s context. The set of components within the incident path, their metrics, properties, alarms, and events, is the incident’s context.
Establishing an incident’s timeframe. The overall, combined timeframe of the incident context’s metrics, alarms, and events is the incident’s timeframe.
Establishing an incident’s pattern. Within an incident’s timeframe metric, alerts, events, and topology changes can be ordered chronologically and the incident’s progression can be visualized, assessed, and analyzed. This progression can also be used as a pattern for future incident analysis and eventually tied to successful and failed resolution for subsequent assisted or automated resolution.

Indeed, automatic incident causation analysis is achieved, and, equally important, machine learning is achieved through topology. This is foundational to AIOps’ understanding of the environment observed, and essential for operations teams’ productivity.

As we will detail in an upcoming blog, services for business observability, AIOps’ understanding of topology is exploited through service analytics and by extending a topology with a defined service.

Topology benefits

Incident contexts, as derived through automatic incident causation, are foundational to AIOps addressing several problems. These include:

Correlating changes in the systems with service statistics.
Performing transaction traces and analysis comprising infrastructure indicators. This provides complete and consistently accurate understanding for IT teams of their ICT infrastructure.
Providing observability for IT practitioners who may lack the time or expertise for deep diagnosis required to understand transaction problems by automatically detecting problems and providing incident contexts.
Meeting the needs of business and IT teams in organizations, who increasingly expect to be able to map business transactions to infrastructure data for more precise triaging and better prioritization.

Improving skills and deepening experience and expertise can’t happen quickly enough but is imperative and is dependent upon innovative technology.

Whether on an individual employee-, team-, or cross-team level, IT estates are too complex and interconnected for organizations to achieve productivity gains without substantial assists from monitoring technology and observability solutions.

Causality helps users of all experience and skill levels with incident insights, enabling them to take the remediation actions needed.
Causality helps users work efficiently within their domain team to collaborate constructively with teams in other IT domains, agile and expert teams, and with line-of-business stakeholders.

Improved IT staff productivity is achieved by utilizing topology for machine learning so that incident paths, contexts, and patterns can be utilized for guided and effective triage and timely collaboration with expert teams.

Topology foundational qualities

Completeness is the single quality that determines if a topology is sufficient for the automatic AIOps capabilities mentioned above.

Component completeness: A topology that lacks components cannot detect all incidents, as it would have “blind spots” that would render automatic analysis incomplete, unreliable, or even impossible.
Flow completeness: A topology that lacks correlations, relationships, or dependencies cannot—or cannot reliably—determine causality.

There are only two, albeit highly demanding, requirements for completeness.

A monitoring technology or tool capable of gathering the data into transaction traces necessary for synthesizing a holistic, fully connected topology.
An observability platform capable of real-time normalization, unification, and correlation of received data of components, relationships, and context to automatically synthesize and maintain a topology. This topology must be solely derived from trusted transactional trace data.

The latter requirement, with its ability to synthesize current, accurate, and consistent topology, is a bedrock of AIOps and Observability from Broadcom.

Trust across teams, founded on a trusted and complete topology, is a most important outcome of the topology synthesis. The prerequisite here is trusted and current monitoring data. This is in contrast to CMDBs, which are usually generated and documented manually. CMDBs, though they provide a different value for IT teams, are static and often difficult to maintain and keep current.

And, with trusted topology, organizations can observe, monitor, and operate systems and services reliably with quality outcomes.

This also removes the struggle for true end-to-end monitoring analysis of even complex systems and digital services, allowing unproblematic collaboration for IT teams and indirectly for business teams.

For example, transaction tracing and analysis comprising infrastructure indicators provides complete and consistently accurate understanding for IT teams of their ICT infrastructure through a topology.

Monitoring foundational qualities

The former requirement, the ability to gather monitoring data, is addressed by best-in-class technologies from years of innovation and proven results by Broadcom and our original byte-code injection.

DX APM dynamically and efficiently discovers all flow and processing, with contextual details, during run-time by intercepting the executions of all transactions.
Discovered entities are added to transaction traces with negligible impact on individual transactions—protecting user experience and minimizing resource consumption.
Transaction traces are captured for every transaction but only externalized selectively when warranted—protecting the integrity of the monitored application.

Transaction traces are captured tier-by-tier, and, as explained above, must be correlated and selectively externalized in their entirety to ensure a complete topology.

DX APM has advanced mechanisms for intercepting transaction execution to propagate correlation data across tiers for this correlation.
A fault within one tier will ensure, through repetitive reverse propagation (towards callers), that externalization of all previous parts of a spanning transaction trace is triggered.

Metrics received from monitoring tools are continuously analyzed in real-time to identify abnormal behavior.

Abnormal data is detected utilizing advanced statistics to avoid the need to configure thresholds, while still accounting for seasonality—making anomaly detection fully automatic.

This mechanism enhances incident contexts for root cause analysis and verification:

Monitoring agents are signaled to selectively externalize transaction traces for detected culprits, ensuring the availability of the incident’s context for assisted triage.

Looking ahead

Uses and benefits of topology as AIOps and observability continue to evolve. These areas include:

Incident patterns. The timeline of an incident describes its progression. This information can be used to identify similar incidents or symptoms, or used to enhance pattern recognition to compare a current incident with prior incidents and their remediation to discover insights that may speed remediation of developing incidents.
Incident prediction. When anomalies occur, anomaly patterns can be correlated to other incident patterns to help predict situations and to allow teams to take preemptive corrective actions.
Automated healing. When a historical incident pattern matches a current problem or anomaly, actions used to successfully remediate the prior incidents can be quickly identified, staged, or automatically applied to an emerging issue when experience or certainty warrants.

As IT estates evolve, the needs and benefits for IT to link topology components to metrics, events, and alarms with critical detail and context, including services, will be increasingly important and yield new opportunities for business observability.

Topology for observability assurance—observe with confidence

The constant change of ICT infrastructures—as supporting technologies are added, removed, and updated to meet evolving business needs and fluid customer expectations—is addressed by the solution’s understanding of the monitored ICT infrastructure and its ability to autonomously discover changes.

This understanding includes the service topology for service observability, which underpins and enhances the collaboration for IT teams and indirectly for business teams:

Changes in the systems are correlated and aggregated to the service level.
Observability helps IT practitioners who may lack the time or expertise for deep diagnosis required to understand transaction problems, by automatically detecting problems and anomalies and furnishing incident contexts.
Business transactions are mapped to infrastructure data for more precise triaging and better prioritization through the topology—as business and IT teams increasingly expect.

The benefits and extensive value from trusted and complete topology include causation analysis and observability assurance.

"In concept, topology is simple. Deriving topology and extracting trusted information from it -in real-time for complex IT estates- is enormously complex and immensely valuable."

Henrik Nissen Ravn
AIOps Technologist | APM Champion

You can find the first blog in this series on topology here.

View full post