August 28, 2024

Monitoring the Monitor: Achieving High Availability in DX Unified Infrastructure Management

Key Takeaways

Eliminate single points of failure by ensuring every component has a backup and a seamless transition process.
Implement comprehensive high availability strategies in DX Unified Infrastructure Management (DX UIM) to mitigate risks and ensure continuous operation.
Utilize failover configurations to facilitate continuous monitoring and prevent service interruptions.

DX Unified Infrastructure Management (DX UIM) from Broadcom is a comprehensive solution for monitoring an organization’s entire IT infrastructure from a single platform. DX UIM provides IT administrators and operations teams with a centralized view of their infrastructure to ensure availability and performance of servers, network devices, storage systems, virtualization environments, applications, and cloud services.

While most organizations have monitoring in place, the question of redundancy often remains. Monitoring applications are critical; without them, any downtime in IT infrastructure could be catastrophic for business operations. Imagine losing a hub in a regional network—robots would continue collecting data and raising alarms, but with no destination to send them to, the organization would be effectively blind. This scenario, though potentially disastrous, is easily preventable.

As businesses increasingly rely on complex applications, monitoring their availability and performance becomes crucial. But what about the monitoring system itself?

Should we monitor the monitor?

This question has arisen a few times over my years but not nearly as much as I would expect. Much of that is to do with the reliability of monitoring systems like DX UIM and their platforms. However, ensuring the monitoring system is always operational is essential. DX UIM allows you to monitor itself, but redundancy is necessary to safeguard against catastrophic failures.

High availability in DX UIM

It is vital for systems to remain operational and accessible over extended periods, minimizing downtime and disruptions. This approach aims to eliminate single points of failure by ensuring every component has a backup and a seamless transition process. In this context, DX UIM stands as a well-established and mature solution, having undergone extensive development and rigorous testing over the years. Its core components consistently demonstrate robust functionality, reliability, and resilience. However, to truly achieve high availability, it is crucial to address potential points of failure within the broader system architecture. Thus, implementing comprehensive high availability strategies is essential to mitigate risks and ensure continuous operation.

High availability involves designing and implementing systems to keep them operational and accessible for long periods, with minimal downtime or disruptions. The goal is to avoid any single point of failure by having backups and a plan for seamless transitions. In this light, DX UIM is a tried-and-true solution. However, software alone cannot guarantee uninterrupted service, as potential points of failure still exist within the broader system architecture. Therefore, implementing comprehensive high availability strategies is essential to mitigate risks and ensure continuous operation.

An example of a high availability DX UIM architecture:

ESD_FY24_Academy-Blog.Monitoring the Monitor - Achieving High Availability in DX Unified Infrastructure Management.Figure 1

Database: The foundation of DX UIM's design should include a highly available database system. Technologies like SQL Server Always On availability groups provide synchronous updates between database copies, ideally located at separate geographical sites to ensure performance and redundancy.
Primary Hub: The Primary Hub is a critical point of failure in DX UIM. To prevent total outages, a Secondary Hub, identical to the Primary, is established. Equipped with an HA probe, it monitors the Primary Hub's responsiveness and takes over if necessary, ensuring continuity.
Robots: Each robot (agent) in DX UIM can be configured with a Secondary Hub for connection, providing a safety net if the primary connection fails. This built-in high availability feature ensures continuous monitoring.
Remote Hubs: It is best practice to connect only the Operator Console robot and CABI robot to the Primary Hub, with others assigned to remote hubs configured in pairs. This setup allows up to 2000 robots per hub, with a backup hub ready to take over if needed.
Operator Console: The web-based front end, hosted on separate robots for scalability and high availability, should be load-balanced among users. In case of a Primary Hub failure, these robots automatically switch to the Secondary Hub, maintaining service continuity.
Alarms: The NAS probe, responsible for event notifications, runs on both the Primary and Secondary Hubs. NAS replication ensures that all events are mirrored in the Secondary NAS, preventing duplication and ensuring readiness for failover scenarios.
Monitoring configuration: Stored in the database, configurations are also locally saved on each robot, allowing monitoring to continue even if network connectivity is temporarily lost.

So, we can see that every component of DX UIM can be replicated, duplicated and, in many cases, failover to a degree that we can expect near 100% availability of the monitoring system.

High availability is crucial for businesses and organizations that rely heavily on their IT infrastructure to deliver services, maintain customer satisfaction, and prevent revenue loss. Industries such as finance, healthcare, e-commerce, and telecommunications, where downtime can have severe consequences, place a strong emphasis on implementing high availability solutions including the monitoring solution.

The trade-off

In my opinion, there is always a trade-off between risk and cost. While rebuilding a hub in a new virtual machine can be quick, the cost of additional hubs for redundancy is decreasing, making a highly available solution increasingly viable and recommended.

Tag(s): AIOps , DX UIM

Rowan Collis

Rowan Collis has worked within the AIOPS monitoring space for 18 years. With 15 years of UIM/Nimsoft experience, Rowan is the most experienced UIM consultant worldwide and always willing to help customers maximize their investment in UIM.

Other resources you might be interested in

Blog October 30, 2025

This Halloween, the Scariest Monsters Are in Your Network

See how network observability can help you identify and tame the zombies, vampires, and werewolves lurking in your network infrastructure.

Read Blog

Blog October 29, 2025

Your Root Cause Analysis is Flawed by Design

Discover the critical flaw in your troubleshooting approaches. Employ network observability to extend your visibility across the entire service delivery path.

Read Blog

Blog October 29, 2025

Whose Fault Is It When the Cloud Fails? Does It Matter?

In today's interconnected environments, it is vital to gain visibility into networks you don't own, including internet and cloud provider infrastructures.

Read Blog

Blog October 29, 2025

The Future of Network Configuration Management is Unified, Not Uncertain

Read this post and discover how Broadcom is breathing new life into the trusted Voyence NCM, making it a core part of its unified observability platform.

Read Blog

Office Hours October 23, 2025

Rally Office Hours: October 9, 2025

Discover Rally's new AI-powered Team Health Widget for flow metrics and drill-downs on feature charts. Plus, get updates on WIP limits and future enhancements.

View Recording

Course October 23, 2025

AAI - Navigating the Interface and Refining Data Views

This course introduces you to AAI’s interface and shows you how to navigate efficiently, work with tables, and refine large datasets using search and filter tools.

Go to Training

Office Hours October 23, 2025

Rally Office Hours: October 16, 2025

Rally's new AI-driven feature automates artifact breakdown - transforming features into stories or stories into tasks - saving time and ensuring consistency.

View Recording

Blog October 22, 2025

What’s New in Network Observability for Fall 2025

Discover how the Fall 2025 release of Network Observability by Broadcom introduces powerful new capabilities, elevating your insights and automation.

Read Blog

eBook October 22, 2025

Modernizing Monitoring in a Converged IT-OT Landscape

The energy sector is shifting, driven by rapid grid modernization and the convergence of IT and OT networks. Traditional monitoring tools fall short.

Read eBook