August 28, 2024
Monitoring the Monitor: Achieving High Availability in DX Unified Infrastructure Management
Written by: Rowan Collis
Key Takeaways
|
|
DX Unified Infrastructure Management (DX UIM) from Broadcom is a comprehensive solution for monitoring an organization’s entire IT infrastructure from a single platform. DX UIM provides IT administrators and operations teams with a centralized view of their infrastructure to ensure availability and performance of servers, network devices, storage systems, virtualization environments, applications, and cloud services.
While most organizations have monitoring in place, the question of redundancy often remains. Monitoring applications are critical; without them, any downtime in IT infrastructure could be catastrophic for business operations. Imagine losing a hub in a regional network—robots would continue collecting data and raising alarms, but with no destination to send them to, the organization would be effectively blind. This scenario, though potentially disastrous, is easily preventable.
As businesses increasingly rely on complex applications, monitoring their availability and performance becomes crucial. But what about the monitoring system itself?
Should we monitor the monitor?
This question has arisen a few times over my years but not nearly as much as I would expect. Much of that is to do with the reliability of monitoring systems like DX UIM and their platforms. However, ensuring the monitoring system is always operational is essential. DX UIM allows you to monitor itself, but redundancy is necessary to safeguard against catastrophic failures.
High availability in DX UIM
It is vital for systems to remain operational and accessible over extended periods, minimizing downtime and disruptions. This approach aims to eliminate single points of failure by ensuring every component has a backup and a seamless transition process. In this context, DX UIM stands as a well-established and mature solution, having undergone extensive development and rigorous testing over the years. Its core components consistently demonstrate robust functionality, reliability, and resilience. However, to truly achieve high availability, it is crucial to address potential points of failure within the broader system architecture. Thus, implementing comprehensive high availability strategies is essential to mitigate risks and ensure continuous operation.
High availability involves designing and implementing systems to keep them operational and accessible for long periods, with minimal downtime or disruptions. The goal is to avoid any single point of failure by having backups and a plan for seamless transitions. In this light, DX UIM is a tried-and-true solution. However, software alone cannot guarantee uninterrupted service, as potential points of failure still exist within the broader system architecture. Therefore, implementing comprehensive high availability strategies is essential to mitigate risks and ensure continuous operation.
An example of a high availability DX UIM architecture:
- Database: The foundation of DX UIM's design should include a highly available database system. Technologies like SQL Server Always On availability groups provide synchronous updates between database copies, ideally located at separate geographical sites to ensure performance and redundancy.
- Primary Hub: The Primary Hub is a critical point of failure in DX UIM. To prevent total outages, a Secondary Hub, identical to the Primary, is established. Equipped with an HA probe, it monitors the Primary Hub's responsiveness and takes over if necessary, ensuring continuity.
- Robots: Each robot (agent) in DX UIM can be configured with a Secondary Hub for connection, providing a safety net if the primary connection fails. This built-in high availability feature ensures continuous monitoring.
- Remote Hubs: It is best practice to connect only the Operator Console robot and CABI robot to the Primary Hub, with others assigned to remote hubs configured in pairs. This setup allows up to 2000 robots per hub, with a backup hub ready to take over if needed.
- Operator Console: The web-based front end, hosted on separate robots for scalability and high availability, should be load-balanced among users. In case of a Primary Hub failure, these robots automatically switch to the Secondary Hub, maintaining service continuity.
- Alarms: The NAS probe, responsible for event notifications, runs on both the Primary and Secondary Hubs. NAS replication ensures that all events are mirrored in the Secondary NAS, preventing duplication and ensuring readiness for failover scenarios.
- Monitoring configuration: Stored in the database, configurations are also locally saved on each robot, allowing monitoring to continue even if network connectivity is temporarily lost.
So, we can see that every component of DX UIM can be replicated, duplicated and, in many cases, failover to a degree that we can expect near 100% availability of the monitoring system.
High availability is crucial for businesses and organizations that rely heavily on their IT infrastructure to deliver services, maintain customer satisfaction, and prevent revenue loss. Industries such as finance, healthcare, e-commerce, and telecommunications, where downtime can have severe consequences, place a strong emphasis on implementing high availability solutions including the monitoring solution.
The trade-off
In my opinion, there is always a trade-off between risk and cost. While rebuilding a hub in a new virtual machine can be quick, the cost of additional hubs for redundancy is decreasing, making a highly available solution increasingly viable and recommended.
Rowan Collis
Rowan Collis has worked within the AIOPS monitoring space for 18 years. With 15 years of UIM/Nimsoft experience, Rowan is the most experienced UIM consultant worldwide and always willing to help customers maximize their investment in UIM.
Other posts you might be interested in
Explore the Catalog
September 16, 2024
Streamline Your Maintenance Modes: Automate DX UIM with UIMAPI
Read More
September 6, 2024
CrowdStrike: Are Regulations Failing to Ensure Continuity of Essential Services?
Read More
August 23, 2024
Elevate Your Database Performance: The Power of Custom Query Monitoring With DX UIM
Read More
August 16, 2024
Enhancing IT Monitoring with DX UIM 23.4 Cumulative Update 2
Read More
July 26, 2024
Objective Monitors in the Context of Active Directory (AD) Servers
Read More
May 3, 2024
Infrastructure Observability Can Help Navigate Cloud Repatriation
Read More
April 16, 2024
DX UIM 23.4: Improved Zero-Touch Monitoring, Updated MCS Architecture
Read More
January 11, 2024
Upgrade to DX UIM 23.4 During Broadcom Support’s Designated Weekend Upgrade Program
Read More
January 9, 2024