Monitoring the Monitor: Achieving High Availability in DX Unified Infrastructure Management

Written by Rowan Collis | Aug 28, 2024 3:05:04 PM

Key Takeaways

Eliminate single points of failure by ensuring every component has a backup and a seamless transition process.
Implement comprehensive high availability strategies in DX Unified Infrastructure Management (DX UIM) to mitigate risks and ensure continuous operation.
Utilize failover configurations to facilitate continuous monitoring and prevent service interruptions.

DX Unified Infrastructure Management (DX UIM) from Broadcom is a comprehensive solution for monitoring an organization’s entire IT infrastructure from a single platform. DX UIM provides IT administrators and operations teams with a centralized view of their infrastructure to ensure availability and performance of servers, network devices, storage systems, virtualization environments, applications, and cloud services.

While most organizations have monitoring in place, the question of redundancy often remains. Monitoring applications are critical; without them, any downtime in IT infrastructure could be catastrophic for business operations. Imagine losing a hub in a regional network—robots would continue collecting data and raising alarms, but with no destination to send them to, the organization would be effectively blind. This scenario, though potentially disastrous, is easily preventable.

As businesses increasingly rely on complex applications, monitoring their availability and performance becomes crucial. But what about the monitoring system itself?

Should we monitor the monitor?

This question has arisen a few times over my years but not nearly as much as I would expect. Much of that is to do with the reliability of monitoring systems like DX UIM and their platforms. However, ensuring the monitoring system is always operational is essential. DX UIM allows you to monitor itself, but redundancy is necessary to safeguard against catastrophic failures.

High availability in DX UIM

It is vital for systems to remain operational and accessible over extended periods, minimizing downtime and disruptions. This approach aims to eliminate single points of failure by ensuring every component has a backup and a seamless transition process. In this context, DX UIM stands as a well-established and mature solution, having undergone extensive development and rigorous testing over the years. Its core components consistently demonstrate robust functionality, reliability, and resilience. However, to truly achieve high availability, it is crucial to address potential points of failure within the broader system architecture. Thus, implementing comprehensive high availability strategies is essential to mitigate risks and ensure continuous operation.

High availability involves designing and implementing systems to keep them operational and accessible for long periods, with minimal downtime or disruptions. The goal is to avoid any single point of failure by having backups and a plan for seamless transitions. In this light, DX UIM is a tried-and-true solution. However, software alone cannot guarantee uninterrupted service, as potential points of failure still exist within the broader system architecture. Therefore, implementing comprehensive high availability strategies is essential to mitigate risks and ensure continuous operation.

An example of a high availability DX UIM architecture:

Database: The foundation of DX UIM's design should include a highly available database system. Technologies like SQL Server Always On availability groups provide synchronous updates between database copies, ideally located at separate geographical sites to ensure performance and redundancy.
Primary Hub: The Primary Hub is a critical point of failure in DX UIM. To prevent total outages, a Secondary Hub, identical to the Primary, is established. Equipped with an HA probe, it monitors the Primary Hub's responsiveness and takes over if necessary, ensuring continuity.
Robots: Each robot (agent) in DX UIM can be configured with a Secondary Hub for connection, providing a safety net if the primary connection fails. This built-in high availability feature ensures continuous monitoring.
Remote Hubs: It is best practice to connect only the Operator Console robot and CABI robot to the Primary Hub, with others assigned to remote hubs configured in pairs. This setup allows up to 2000 robots per hub, with a backup hub ready to take over if needed.
Operator Console: The web-based front end, hosted on separate robots for scalability and high availability, should be load-balanced among users. In case of a Primary Hub failure, these robots automatically switch to the Secondary Hub, maintaining service continuity.
Alarms: The NAS probe, responsible for event notifications, runs on both the Primary and Secondary Hubs. NAS replication ensures that all events are mirrored in the Secondary NAS, preventing duplication and ensuring readiness for failover scenarios.
Monitoring configuration: Stored in the database, configurations are also locally saved on each robot, allowing monitoring to continue even if network connectivity is temporarily lost.

So, we can see that every component of DX UIM can be replicated, duplicated and, in many cases, failover to a degree that we can expect near 100% availability of the monitoring system.

High availability is crucial for businesses and organizations that rely heavily on their IT infrastructure to deliver services, maintain customer satisfaction, and prevent revenue loss. Industries such as finance, healthcare, e-commerce, and telecommunications, where downtime can have severe consequences, place a strong emphasis on implementing high availability solutions including the monitoring solution.

The trade-off

In my opinion, there is always a trade-off between risk and cost. While rebuilding a hub in a new virtual machine can be quick, the cost of additional hubs for redundancy is decreasing, making a highly available solution increasingly viable and recommended.

View full post