Note: This post was co-authored by Adam Frary |
Key Takeaways
|
|
Our thinking and use of topology within AIOps and Observability solutions from Broadcom has advanced significantly in recent years while solidly building on our innovative domain tools.
We’re looking to communicate these innovations, advancements, and benefits for IT operations. In this blog, we continue where the previous blog left off to explain the boundary blame concept and mechanism to obtain a sufficiently complete topology.
In our previous two blogs on confident observability and incident causation, we explained why topology completeness is an all-important foundational bedrock for AIOps in terms of flow components and processing components.
In this blog we detail what completeness entails within the context of this all-important principle:
We detail why minimal overhead, although crucial, is not on its own a sufficient principle. Smart mechanisms must be in place to dynamically control and limit resource consumption, while obtaining transactional visibility for topology completeness.
Completeness is augmented with visibility to qualify the weighing between minimal overhead and topology completeness.
Boundary blame is detailed as a guiding monitoring concept and measure of visibility and completeness versus monitoring impact and overhead.
In its most simplistic form, monitoring can be done at two levels:
The driver for topology completeness is visibility:
Because topology is solely synthesized from them, transaction traces are a vital data source.
To qualify the considerations of monitoring overhead versus transaction traces for synthesizing a complete topology requirements are correspondingly detailed:
The former detail entails that repetitive traces can be omitted. The latter detail entails that only on transaction execution completion, can the need for its trace be determined. This is a most severe and critical requirement towards monitoring, as it by extension entails that all transactions must be traced.
To address this requirement, the overhead of obtaining transaction traces can helpfully be split into two distinct parts:
Capture: The collection of components, correlations, relationships, and dependencies with contextual details -by intercepting transaction executions to capture data for transaction traces.
Externalization: The sending of transaction traces into AIOps for topology synthesis and analysis.
By virtue of the lean agent’s efficiency, the capture of a trace has significantly less overhead than the externalization of a trace.
AIOps governs the conflicting requirements of trace capture and externalization for transaction visibility and topology completeness versus minimal overhead and resource consumption as follows.
Extremely lean algorithms and data structures are employed within the lean agent to minimize the overhead of trace capture to allow the lean agent to capture traces for each and every transaction –they are kept internally until transactions complete:
The lean agent analyzes captured traces to dynamically adjust tracing by component inclusion to ensure they each contain a frontend and a backend as well as exit points and entry points.
The lean agent also dynamically analyzes if execution times of controllers sufficiently account for overall execution time, to adjust tracing to dynamically include more controllers until that is the case. Further, a user can set the sensitivity of this algorithm to heighten visibility or to lower overhead.
This guarantee, analysis, and mechanisms are important as they guarantee incident detection and operational guidance while protecting from excessive overhead.
Integral signal and control mechanisms between monitoring agents and AIOps analytics are employed to limit the externalization to those needed for topology and causation analysis:
The rationale of boundary blame is to strike a balance as perfect as possible between capturing sufficient monitoring data and minimizing the cost of capture and externalization. The principle of boundary blame is to include the near optimal set of components into a topology necessary to assign incident blame (cause) within 1) a tier boundary and 2) a component within that tier:
Boundary blame is best understood by considering this simple schematic depiction of the execution paths within a tier (it is simplified as trivial components are omitted):
Then consider the following components (as detailed in our first blog, Topology for Confident Observability):
These components give coverage for all significant components of a tier (and they indirectly cover related components. As, for example, A covers D, as its execution includes execution of D).
Now, consider the call-to-callee connectedness these components entail:
The first four component types provide tier visibility into transaction execution flow by including:
The controller type adds visibility into processing by including:
Coverage and connectedness ensures that any incident occurring within any component can be attributed to a tier and to a significant component within the tier. The aim of using these boundary blame component concepts is to qualify adequate topology coverage and connectedness of the ICT infrastructure.
Therefore, these components are collectively referred to as blame components.
Boundary blame is a monitoring concept that guides teams. It is also a mechanism (implemented within DX APM) that ensures transactional visibility, which again ensures topology completeness. Thus, adhering to boundary blame is a foundational paradigm for AIOps from Broadcom.
Boundary blame ensures a balanced monitoring for efficient and effective gathering of necessary and accurate data. Our advanced mechanisms to carefully capture, assemble, and externalize warranted transaction traces proves that. With AIOps, that paradigm doesn’t change.
Boundary blame is a powerful mechanism that DX APM has employed, enhanced, and refined for decades –its latest incarnation is embedded within the lean agent. It has proven that impact to monitored components can indeed be minimal and tolerable, most often negligible, to meet the demands for efficiency and effectiveness in monitoring.
As explained, the lean agent is capable of dynamically ensuring boundary blame.
Without innovative application of these concepts, either visibility suffers, cost of capture becomes untenable or both. To achieve sufficient observability in modern ICT infrastructures, striking an appropriate balance described above is even more important as it now enables advanced AIOps on complex environments.
Needless to say, businesses need to observe, monitor, and operate systems and services reliably with quality outcomes. Yet, many organizations lack or struggle with true end-to-end monitoring analysis. With complex systems and digital services, this is problematic for IT teams and indirectly for business teams.
ICT infrastructures constantly change as supporting technologies are added, removed, and updated to meet evolving business needs and fluid customer expectations.
AIOps from Broadcom understands the monitored environment and its dynamics, utilizing topology. The solution actively assists in continually improving observability and fostering best practices by autonomously discovering and including ICT infrastructure changes in the topology.
If you missed the prior blogs:
And coming next in our topology series, “Services for Business Observability.”
"In concept, topology is simple. Deriving topology and extracting trusted information from it -in real-time for complex IT estates- is enormously complex and immensely valuable." |