<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=1110556&amp;fmt=gif">
Skip to content
    August 6, 2024

    Topology for Confident Observability and Digital Resilience

    Note: This post was co-authored by Adam Frary
    Key Takeaways
    • Employ continuous AIOps capabilities to facilitate accurate causal analysis, incident detection, and noise reduction.
    • Automate the capture of topology to keep visibility current and access lower-level details.
    • Establish capabilities for reliably observing, monitoring, and operating systems and services.

    In recent years, we’ve significantly advanced how we think about and use topology within AIOps and Observability solutions from Broadcom, while solidly building on our innovative domain tools.

    We’re eager to share these innovations, advancements, and benefits for IT operations. In this blog post, we level-set on the topic of topology, clarify several important concepts, and discuss the decisive role topology plays in delivering powerful capabilities for AIOps and Observability from Broadcom.

    We also introduce how topology, when used to obtain service observability, can be a foundation to support mandates described in the Digital Operational Resiliency Act within the European Union.

    Topology for confident observability

    Topology is foundational in enabling the advanced analytics and machine learning capabilities of AIOps and Observability from Broadcom.

    Consider the vast set components of your IT estate, including hardware, software, logical, and virtual components. Add all the relationships and dependencies—the way components interact with and depend on one another. This IT estate is referred to as an information and communications technology (ICT) infrastructure.

    Now consider this infrastructure mapped visually: all components as vertices; all relationships and dependencies as connecting edges. This is a topology: a fully connected inventory of vertices and edges reflecting an ICT infrastructure, with added descriptive attributes for augmenting contextualization. AIOps from Broadcom automatically synthesizes and maintains a topology solely from monitoring data as a bedrock element.

    The topology concept is introduced to provide an overview as well as gain options to focus on individual parts, making even vast ICT infrastructures manageable and understandable.

    Primarily, however, topology enables AIOps to do automatic observability. This topology provides context for every component, which enables holistic causation analysis. In this way, teams can identify which components caused the issue and what the impact is. Importantly, this elevates monitoring of individual components to enable holistic observability of an entire infrastructure.

    A topology is not a still snapshot of your ICT infrastructure, but indeed a constantly tracked, accurate, and augmented reflection of it within a registry. Components and their relationships and dependencies are detected, derived, and maintained in real-time. Topologies are created by automatically and continuously analyzing, normalizing, unifying, and correlating data gathered from monitoring your ICT infrastructure. This keeps the topology current, accurate, and consistent.

    During topology synthesis and maintenance, all AIOps platform data, metrics, alarms, and events are interlinked, enabling cross-referencing for analytics, visualization, and navigation.

    Topology benefits

    Through topology, our AIOps solution can provide significant benefits to users:

    • Visualize the topology to convey augmented observability at any level of status, health, risk, availability, responsiveness, experience, activity, as well as additional IT and business KPIs.

    • Analyze every incident, event, alarm, and change within its topological context to allow for the inclusion of relationships and dependencies for automatic and holistic analysis and machine learning for business-aligned aggregation.

    • Prioritize anomalies and problems by user experience and business service impact.

    These benefits address business and IT teams’ increasing expectations to be able to map business transactions and experience to infrastructure data for more precise triaging and better, business-aligned prioritizations and insights.

    This is driven by automatic, continuous AIOps capabilities:

    • Accurate causal analysis. Causal analysis becomes more thorough and complete utilizing an accurate and complete topology which helps reduce MTTR.
    • Reactive issue detection. Incidents with user or business services impact, i.e. problems, can be recognized and prioritized for reactive, immediate attention and resolution.
    • Proactive issue detection. Incidents that are not yet affecting users may be developing anomalies. When potential issues are “brewing” within your infrastructure, they can be recognized and prioritized for proactive attention and resolution by the appropriate team.
    • Issue evidence for assisted root cause analysis. An incident’s topological context is used to present evidence for incident visualization and root cause confirmation.
    • Accurate noise reduction. The topological context of incoming alarms are utilized for clustering alarms, highlighting alarms based on root cause components, and suppressing alarms that are less relevant.

    Topology is beneficial for users, but it’s even more valuable for AIOps solutions. As outlined above, the algorithms that AIOps uses for analytics functions are based on the application of topology.

    In our next blog, “Topology for Incident Causation and Machine Learning,” we will detail how the above capabilities are realized through incident contexts and automatically established by exploiting topology.

    These capabilities allow AIOps from Broadcom to provide observability for IT practitioners, who may lack the time or expertise required for deep diagnosis. The solution helps users understand transaction issues by automatically detecting problems and anomalies. This enables teams to mature monitoring, moving from reactive to proactive approaches.

    Regardless of experience and skill levels, users are assisted with the right insights to take the actions needed. Effectively, these benefits and capabilities accelerate skills improvement and deepen experience and expertise, which most often can’t happen quickly enough. This helps users work more efficiently within their own domain team and to collaborate with teams in other IT domains and with line-of-business representatives.

    These benefits have vast ramifications, not least for the alignment of IT outputs with business outcomes. In other words, this helps close the IT-to-business gap and realize business observability.

    As we cover in our third blog, “Efficiency with Silos: Unify,” AIOps adoption is likely to change how teams communicate and collaborate.

    In our fourth blog, “Services for Business Observability,” we will discuss how causation is a foundational capability for business observability, as offered by service analytics.

    Level-setting topology

    An ICT infrastructure exists to most cost-effectively support the execution of business service transactions. Every transaction traverses the infrastructure through its network in multiple steps for its execution at appropriate server tiers. 
    The below key concepts have proven effective and essential for synthesizing a topology for AIOps as the topology must fully encompass all significant components:

    • Tiers. Tiers are the separate execution environments (such as servers, services, pods, containers, messaging, streaming, and databases) where transactions execute. By our convention, execution flow is left to right—as indicated by the white arrows. A component’s upstream components are all those reached following arrows, downstream is the opposite.

      ESD_FY24_Academy-Blog.Topology for Confident Observability and Digital Resilience.Figure 1

      An ICT infrastructure is the summation of its tiers as tiers that are most often monitored individually and (spanning) transaction traces are “stitched” together, and correlated for a holistic, fully connected topology.  

    • Layers. A tier is separated into layers of application (software components), infrastructure, and network simply to segregate components for separation and clarity functionally, logically, or technologically allowing specialized focus and usage.

      A layer is a flexible concept and more layers may be introduced for further segregation. For example, a service layer may be added as a new top layer. Or a cluster layer may be added for the components of a cluster. Or a cloud layer created for the components of a cloud that again may differ among vendors.

      Layers offer focused perspectives, or sub-views, ideal for collaboration among technical, operational, and business teams on a common ground based on a shared topology reality.

      ESD_FY24_Academy-Blog.Topology for Confident Observability and Digital Resilience.Figure 2

    • Components. App components receive transactions (or sub-transactions) for execution within tiers or for distribution/passing to subsequent tiers. Components are contextualized by arbitrary attributes (also known as properties). Components are the vertices in the tier drawing.

    • Front-ends. An app component receiving transactions from another tier is a front-end. Thus a front-end is a significant flow component –frontends are blue in the tier drawing. 

      The foremost front-end component is the component nearest to the user and is therefore said to expose the user experience.

    • Back-ends. An app component sending transactions to another tier is a backend. A component only receiving transactions is a backend—back-ends are brown in the tier drawing. Within a tier execution, the flow is front-end → back-end. Between tiers, execution flow is back-end → front-end. Thus, a back-end is a significant flow component. Backends are farthest from the user and are therefore ultimate sources of incident cause.

    • Controllers. The app components within a tier doing (sync or async) significant transaction processing are its controllers, they are therefore significant—they are yellow and green.

    • Relationships. App components within a tier may have dynamic caller to callee relationships as established by transactions’ execution’s call-paths (aka flows). Relationships are the arrows.

      Since relationships reflect a caller’s call of a callee within transaction execution, relationships can be utilized to upstream (i.e., arrow direction) follow callee to caller impact to determine the origin of impact—aka root cause. And relationships can be followed downstream (i.e., opposite arrow direction) to determine impacted front-ends and are therefore significant.

      These paths, from root cause component to impacted components are incident paths. The set of components within the incident paths is the incident context.

    • Correlations. As said, a topology is the summation, or correlation, of its tiers. Correlations are the inter-tier relationships established by transactions’ traversal tier-to-tier across the network—and are depicted by purple arrows in the tier drawing. Correlations, since they reflect tier to ­tier flows, are crucial in obtaining a fully connected topology that includes all tiers. Often, correlations are not distinguished from relationships as tiers are only indirectly present in a topology (through attributes of a tier’s components and dependencies).

    • Dependencies. Components within the same tier are dependent on one another. This can include the server, the NICs, the RAM, the disks, the apps, and so on. Components of higher layers depend on components of lower layers—logically, physically, or virtually.

      Thus, dependencies can be utilized to determine cause components at the lowest level and impacted components at the highest level and are therefore significant.

    • End-to-end relationships. Collectively, the caller-to-callee flow from the first front-end to the last backend—connected throughout.

    • Top-to-bottom dependencies. Collectively, the top-to-bottom dependencies from top layer through layers to the bottom—connected throughout.

    • Services. A service is a component that enriches a topology by associating the components that participate in the delivery of an IT or business service as a named service—these associations are service dependencies. Services allow for the reference to and aggregation of a well-defined and meaningful subset of an ICT infrastructure. More on services in our upcoming blog Services for Business Observability.

    The automatic capabilities mentioned above simply cannot function reliably or function at all if any significant components are missing from the topology. This omission would render the component as disconnected, impairing the ability to do causation analysis, as explained in the next section.

    The topology we build utilizing these concepts offers the intuitive capabilities of automatic visualization, analysis, and prioritization, allowing IT teams to communicate and better manage the increasing complexity and details of ICT infrastructures. They can obtain and understand the end-to-end overview and top-to-bottom details of their IT estate. 

    Topology for observability assurance—observe with confidence 

    I would also emphasize the challenges with keeping the topology up to date, especially in modern deployment solution architecture, such as cloud, and Kubernetes, where the topology can be changing more rapidly. 

    Needless to say, businesses need to observe, monitor, and operate systems and services reliably, with quality outcomes. Yet, many organizations struggle to get true end-to-end monitoring analysis. With complex systems and digital services, this is problematic for IT teams and for business teams. 

    ICT infrastructures, especially those with modern architectures, such as cloud environments and clusters, constantly change as supporting technologies are added, removed, and updated. This constant change is vital to meet evolving business needs and fluid customer expectations, but it poses serious challenges with keeping the topology current and even accessing lower-level topology details. Automatic topology capture resolves this.

    AIOps from Broadcom understands the dynamics of monitored environments. It does this by utilizing topology. The solution actively assists in continually improving observability and fostering best practices by autonomously discovering changes within the ICT infrastructure and including them in the topology.

    We elaborate on this in our next blog, “Topology for Incident Causation and AIOps Machine Learning.”


    "Topology is the foundation of AIOps and observability. Topological understanding is the genius behind the solution."

    Henrik Nissen Ravn
    AIOps Technologist | APM Champion


     

    Tag(s): AIOps , DX OI , DX APM

    Jörg Mertin

    Jörg Mertin, a Master Solution Engineer on the AIOps and Observability team, is a self-learner and technology enthusiast. A testament to this is his early adopter work to learn and evangelize Linux in the early 1990s. Whether addressing coordinating monitoring approaches for full-fledged cloud deployments or a...

    Other posts you might be interested in

    Explore the Catalog
    September 6, 2024

    CrowdStrike: Are Regulations Failing to Ensure Continuity of Essential Services?

    Read More
    August 28, 2024

    Monitoring the Monitor: Achieving High Availability in DX Unified Infrastructure Management

    Read More
    August 27, 2024

    Topology for Incident Causation and Machine Learning within AIOps

    Read More
    August 23, 2024

    Elevate Your Database Performance: The Power of Custom Query Monitoring With DX UIM

    Read More
    August 16, 2024

    Enhancing IT Monitoring with DX UIM 23.4 Cumulative Update 2

    Read More
    August 2, 2024

    Ensure Full Stack Observability Between Mainframe and Cloud/Container Applications with AIOps from Broadcom

    Read More
    July 26, 2024

    Objective Monitors in the Context of Active Directory (AD) Servers

    Read More
    July 18, 2024

    The Unreasonable Effectiveness of Simplicity in IT Operations Strategy

    Read More
    July 16, 2024

    Embark on the Observability Journey

    Read More