<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=1110556&amp;fmt=gif">
Skip to content
    August 28, 2024

    Monitoring the Monitor: Achieving High Availability in DX Unified Infrastructure Management

    Key Takeaways
    • Eliminate single points of failure by ensuring every component has a backup and a seamless transition process.
    • Implement comprehensive high availability strategies in DX Unified Infrastructure Management (DX UIM) to mitigate risks and ensure continuous operation.
    • Utilize failover configurations to facilitate continuous monitoring and prevent service interruptions.

    DX Unified Infrastructure Management (DX UIM) from Broadcom is a comprehensive solution for monitoring an organization’s entire IT infrastructure from a single platform.  DX UIM provides IT administrators and operations teams with a centralized view of their infrastructure to ensure availability and performance of servers, network devices, storage systems, virtualization environments, applications, and cloud services.

    While most organizations have monitoring in place, the question of redundancy often remains. Monitoring applications are critical; without them, any downtime in IT infrastructure could be catastrophic for business operations. Imagine losing a hub in a regional network—robots would continue collecting data and raising alarms, but with no destination to send them to, the organization would be effectively blind. This scenario, though potentially disastrous, is easily preventable.

    As businesses increasingly rely on complex applications, monitoring their availability and performance becomes crucial. But what about the monitoring system itself?

    Should we monitor the monitor?

    This question has arisen a few times over my years but not nearly as much as I would expect. Much of that is to do with the reliability of monitoring systems like DX UIM and their platforms. However, ensuring the monitoring system is always operational is essential. DX UIM allows you to monitor itself, but redundancy is necessary to safeguard against catastrophic failures.

    High availability in DX UIM

    It is vital for systems to remain operational and accessible over extended periods, minimizing downtime and disruptions. This approach aims to eliminate single points of failure by ensuring every component has a backup and a seamless transition process. In this context, DX UIM stands as a well-established and mature solution, having undergone extensive development and rigorous testing over the years. Its core components consistently demonstrate robust functionality, reliability, and resilience. However, to truly achieve high availability, it is crucial to address potential points of failure within the broader system architecture. Thus, implementing comprehensive high availability strategies is essential to mitigate risks and ensure continuous operation.

    High availability involves designing and implementing systems to keep them operational and accessible for long periods, with minimal downtime or disruptions. The goal is to avoid any single point of failure by having backups and a plan for seamless transitions. In this light, DX UIM is a tried-and-true solution. However, software alone cannot guarantee uninterrupted service, as potential points of failure still exist within the broader system architecture. Therefore, implementing comprehensive high availability strategies is essential to mitigate risks and ensure continuous operation.

    An example of a high availability DX UIM architecture:

    ESD_FY24_Academy-Blog.Monitoring the Monitor - Achieving High Availability in DX Unified Infrastructure Management.Figure 1

    • Database: The foundation of DX UIM's design should include a highly available database system. Technologies like SQL Server Always On availability groups provide synchronous updates between database copies, ideally located at separate geographical sites to ensure performance and redundancy.
    • Primary Hub: The Primary Hub is a critical point of failure in DX UIM. To prevent total outages, a Secondary Hub, identical to the Primary, is established. Equipped with an HA probe, it monitors the Primary Hub's responsiveness and takes over if necessary, ensuring continuity.
    • Robots: Each robot (agent) in DX UIM can be configured with a Secondary Hub for connection, providing a safety net if the primary connection fails. This built-in high availability feature ensures continuous monitoring.
    • Remote Hubs: It is best practice to connect only the Operator Console robot and CABI robot to the Primary Hub, with others assigned to remote hubs configured in pairs. This setup allows up to 2000 robots per hub, with a backup hub ready to take over if needed.
    • Operator Console: The web-based front end, hosted on separate robots for scalability and high availability, should be load-balanced among users. In case of a Primary Hub failure, these robots automatically switch to the Secondary Hub, maintaining service continuity.
    • Alarms: The NAS probe, responsible for event notifications, runs on both the Primary and Secondary Hubs. NAS replication ensures that all events are mirrored in the Secondary NAS, preventing duplication and ensuring readiness for failover scenarios.
    • Monitoring configuration: Stored in the database, configurations are also locally saved on each robot, allowing monitoring to continue even if network connectivity is temporarily lost.

    So, we can see that every component of DX UIM can be replicated, duplicated and, in many cases, failover to a degree that we can expect near 100% availability of the monitoring system.

    High availability is crucial for businesses and organizations that rely heavily on their IT infrastructure to deliver services, maintain customer satisfaction, and prevent revenue loss. Industries such as finance, healthcare, e-commerce, and telecommunications, where downtime can have severe consequences, place a strong emphasis on implementing high availability solutions including the monitoring solution.

    The trade-off

    In my opinion, there is always a trade-off between risk and cost. While rebuilding a hub in a new virtual machine can be quick, the cost of additional hubs for redundancy is decreasing, making a highly available solution increasingly viable and recommended.

    Tag(s): AIOps , DX UIM

    Rowan Collis

    Rowan Collis has worked within the AIOPS monitoring space for 18 years. With 15 years of UIM/Nimsoft experience, Rowan is the most experienced UIM consultant worldwide and always willing to help customers maximize their investment in UIM.

    Other Resources You might be interested In

    icon
    Blog August 20, 2025

    What’s Hiding in Your Wiring Closets?

    See why you must move from periodic audits to a state of perpetual awareness. Track every change, validate it against policy, and understand its impact.

    icon
    Blog August 15, 2025

    All Network Monitoring Tools Are Created Equal, Right?

    See how observability platforms provide a unified view across multi-vendor environments and correlate network configuration changes with performance issues.

    icon
    Blog August 15, 2025

    Scale Observability, Streamline Operations with AppNeta Monitoring Policies

    This post reveals how, with AppNeta’s monitoring policies, you can leverage a powerful framework for scalable, flexible, and accurate network observability.

    icon
    Course August 14, 2025

    AppNeta: Current Network Violation Map Dashboard

    Learn how to configure and use the Current Network Violation Map dashboard in AppNeta to identify geographic regions impacted by WAN performance issues.

    icon
    Course August 14, 2025

    AppNeta On-Prem: Minimize Unplanned Downtime

    Learn how to configure the AppNeta On-Prem environment following best practices for high availability and disaster recovery to maintain service continuity and minimize unplanned downtime.

    icon
    Office Hours August 12, 2025

    Rally Office Hours: August 7, 2025

    Get tips on how to use the Capacity Planning feature in Rally, then follow the weekly Q&A session with Rally product experts.

    icon
    Blog August 11, 2025

    dSeries Version 25.0 Boosts Insights, Security, and Operational Efficiency

    Discover how ESP dSeries Workload Automation 25.0 represents a significant leap forward, making workload automation more secure, visible, and efficient.

    icon
    Blog August 7, 2025

    What Your SD-WAN Isn't Telling You

    SD-WAN's limited view blinds it to underlay issues. Augment SD-WAN with end-to-end visibility to validate decisions and diagnose root causes for network resilience.

    icon
    Blog August 7, 2025

    How DX NetOps Topology Streamlines and Optimizes Triage

    DX NetOps Topology gives you the context and clarity to stay ahead of problems and keep your networks running smoothly. Troubleshoot quickly and seamlessly.