August 18, 2022
High-Scale Monitoring: Lessons from Broadcom’s DX UIM Deployment
Written by: Rick Hirst
While many people know us from our semiconductor and infrastructure software solutions, few have visibility into what goes on behind the scenes to support Broadcom’s global business. Within the Broadcom Software division, the Broadcom Global Technology Organization (GTO) is responsible for managing an extensive IT infrastructure, one that spans 18 data centers, 100 sites, and 400 R&D labs.
This team administers tens of thousands of servers and virtual machines. In addition, the GTO is responsible for the infrastructure that supports the company’s more than 40 SaaS offerings, which means ensuring maximum availability, integrity, and security is critical. A few years ago, the GTO deployed DX Unified Infrastructure Management (DX UIM), which yielded a number of significant advantages.
Instead of having more than 100 tools deployed across various siloed teams, DX UIM enabled cohesive monitoring that provided better insights, faster, more effective troubleshooting, and greater cost efficiency. In the process of running the DX UIM implementation, we’ve learned some important lessons. In the following sections, I’ve highlighted some of our key takeaways.
Harness Multi-Tenancy
Many think of multi-tenancy as being for a service provider environment that supports multiple clients, and it can certainly be employed in these scenarios. In addition, it can also be very useful for enterprises like ours, which have many different divisions and groups.
With DX UIM, we can gain the efficiencies of using a shared infrastructure, while effectively enabling different teams to use the solution. The solution provides a wide range of users clean, tailored views. While many groups may use the system, through logging into their specific accounts, users will only see a targeted list of views associated with their groups.
With DX UIM, we can use different categories for organizing views:
- Accounts, which is how different teams are defined.
- Groups, which define how devices are monitored, and how information is viewed.
- Origins, which are assigned to specific accounts. Each team will have its own set of origins.
- User tags, which are used to assist in identifying which team a device belongs to, and how it is split up within a team.
Minimize Use of Remote Monitoring
Through remote monitoring, teams can monitor a device without having an agent installed on the device. In some circumstances, remote monitoring may be advantageous. For example, agents can’t be installed in some legacy devices, and remote monitoring makes it possible to track these systems.
In addition, it can save a team time to avoid having to do probe installation. However, for our team, we have agents installed as part of a device’s build process, so this isn’t a big factor either way. On the other hand, remote monitoring presents some limitations:
- Scale. Compared to local monitoring, the number of devices a specific instance can support is limited.
- Permissions. Establishing and maintaining the required permissions for remote access can be complicated, particularly in Windows environments.
- Depth. Remote monitoring approaches offer limited monitoring depth. While this can be OK for an operating system for example, for more complex elements like applications, this can be impractical.
Manage Origins Carefully
As outlined above, origins enable each team to specify the source of devices in their area. From the time a device is first discovered and assigned an origin, it is important to manage origins carefully. This is particularly true in multi-tenant environments.
If a device has one origin that is inadvertently associated with multiple groups, it may be difficult for teams to determine who has ownership. As part of this, teams need to establish a process for managing how hub connections are set up. By default, if a primary hub goes down, DX UIM will publish to a secondary. It is important to ensure that, as part of this, data isn’t published to a hub that’s in an incorrect group.
Streamline Onboarding and Offboarding
At any given time, we have teams that may be managing around 10,000 servers. In environments of this scale, we have devices coming online and offline all the time.
To manage this kind of scale, it is important to have a simple, repeatable process for onboarding and offboarding. At a high level, our onboarding process is composed of three steps:
- Deploy. As part of this, the hub and user tag #1 are set and the item is moved to the “orphans” group.
- Lookup the item on CMDB. This lookup populates user tag #2 and moves the item to the appropriate group.
- Apply monitoring.
In addition, it is also important to establish an effective offboarding process when a device goes offline, for example if it is retired or reimaged. One particular area to be mindful about is that it can be difficult to ensure a device is cleanly and completely removed.
For example, there can be delays between when a device is removed and when a remote profile is removed. As a result, what can happen is, after a device is removed, the profile pinging the device hasn’t been removed, so the device gets added again. It is important to manage these factors so teams can ensure devices are removed cleanly and won’t mistakenly reappear.
Harness Reporting and Validation
While you can get a warm feeling after installing an agent on a server and seeing that it’s started to monitor, there’s a big difference between monitoring and monitoring correctly. (For more info, see our recent blog, Expert Series: Broadcom IT Shares Their View on the Difference Between Monitoring and Monitoring Correctly.)
As part of this, it is essential to employ reporting to validate monitoring is operating appropriately. Leveraging dashboards and metric views can be an effective way to establish the visibility required. Toward this end, we’ve used APIs and a simple web form that teams can submit details in, including entering server names, which will populate the required information. As part of validation, it is also important to establish self-monitoring to ensure DX UIM is functioning correctly.
Employ Monitoring Configuration Service and Groups
It would be impossible to manage an environment like ours without some form of automation and policy-based management. We’ve relied extensively on Monitoring Configuration Service (MCS), which is effectively supported by how we’ve established our group structure. We use groups for monitoring, reporting, and maintenance.
We have host groups, which are a series of app-specific groups that define the role of a server. Specific monitoring can therefore be applied based on the nature of the apps that are installed. Physical location can also be an important way to group. For example, if a specific site is being taken down for maintenance, all the devices in that location can be put into maintenance mode to ensure a lot of false alarms aren’t generated.
Conclusion
By implementing DX UIM, we’ve been able to make significant strides across a number of areas, including both operational and cost efficiency. In this way, our teams have been able to more effectively optimize service levels of our large-scale, business-critical IT infrastructures. To learn more, be sure to review the case study, “DX UIM Enables Broadcom to Scale, Optimize Infrastructure Monitoring.”
Rick Hirst
Rick Hirst's background is in monitoring, systems administration and infrastructure architecture. He has been working with UIM for the last 14 years, working with large customers in implementing and managing UIM in varied environments.
Other posts you might be interested in
Explore the Catalog
Blog
December 13, 2024
Full-Stack Observability with OpenTelemetry and DX Operational Observability
Read More
Blog
December 6, 2024
Power Up Your Alarms! Enriched UIM Alarms for Added Intelligence
Read More
Blog
November 26, 2024
Topology: Services for Business Observability
Read More
Blog
November 22, 2024
Regular Expressions That I Use Regularly
Read More
Blog
November 22, 2024
Cloud Application Performance: Common Reasons for Slow-Downs
Read More
Blog
November 4, 2024
Unlocking the Power of UIMAPI: Automating Probe Configuration
Read More
Blog
October 4, 2024
Capturing a Complete Topology for AIOps
Read More
Blog
October 4, 2024
Fantastic Universes and How to Use Them
Read More
Blog
September 26, 2024