August 18, 2022

High-Scale Monitoring: Lessons from Broadcom’s DX UIM Deployment

While many people know us from our semiconductor and infrastructure software solutions, few have visibility into what goes on behind the scenes to support Broadcom’s global business. Within the Broadcom Software division, the Broadcom Global Technology Organization (GTO) is responsible for managing an extensive IT infrastructure, one that spans 18 data centers, 100 sites, and 400 R&D labs.

This team administers tens of thousands of servers and virtual machines. In addition, the GTO is responsible for the infrastructure that supports the company’s more than 40 SaaS offerings, which means ensuring maximum availability, integrity, and security is critical. A few years ago, the GTO deployed DX Unified Infrastructure Management (DX UIM), which yielded a number of significant advantages.

Instead of having more than 100 tools deployed across various siloed teams, DX UIM enabled cohesive monitoring that provided better insights, faster, more effective troubleshooting, and greater cost efficiency. In the process of running the DX UIM implementation, we’ve learned some important lessons. In the following sections, I’ve highlighted some of our key takeaways.

Harness Multi-Tenancy

Many think of multi-tenancy as being for a service provider environment that supports multiple clients, and it can certainly be employed in these scenarios. In addition, it can also be very useful for enterprises like ours, which have many different divisions and groups.

With DX UIM, we can gain the efficiencies of using a shared infrastructure, while effectively enabling different teams to use the solution. The solution provides a wide range of users clean, tailored views. While many groups may use the system, through logging into their specific accounts, users will only see a targeted list of views associated with their groups.

With DX UIM, we can use different categories for organizing views:

Accounts, which is how different teams are defined.
Groups, which define how devices are monitored, and how information is viewed.
Origins, which are assigned to specific accounts. Each team will have its own set of origins.
User tags, which are used to assist in identifying which team a device belongs to, and how it is split up within a team.

Minimize Use of Remote Monitoring

Through remote monitoring, teams can monitor a device without having an agent installed on the device. In some circumstances, remote monitoring may be advantageous. For example, agents can’t be installed in some legacy devices, and remote monitoring makes it possible to track these systems.

In addition, it can save a team time to avoid having to do probe installation. However, for our team, we have agents installed as part of a device’s build process, so this isn’t a big factor either way. On the other hand, remote monitoring presents some limitations:

Scale. Compared to local monitoring, the number of devices a specific instance can support is limited.
Permissions. Establishing and maintaining the required permissions for remote access can be complicated, particularly in Windows environments.
Depth. Remote monitoring approaches offer limited monitoring depth. While this can be OK for an operating system for example, for more complex elements like applications, this can be impractical.

Manage Origins Carefully

As outlined above, origins enable each team to specify the source of devices in their area. From the time a device is first discovered and assigned an origin, it is important to manage origins carefully. This is particularly true in multi-tenant environments.

If a device has one origin that is inadvertently associated with multiple groups, it may be difficult for teams to determine who has ownership. As part of this, teams need to establish a process for managing how hub connections are set up. By default, if a primary hub goes down, DX UIM will publish to a secondary. It is important to ensure that, as part of this, data isn’t published to a hub that’s in an incorrect group.

Streamline Onboarding and Offboarding

At any given time, we have teams that may be managing around 10,000 servers. In environments of this scale, we have devices coming online and offline all the time.

To manage this kind of scale, it is important to have a simple, repeatable process for onboarding and offboarding. At a high level, our onboarding process is composed of three steps:

Deploy. As part of this, the hub and user tag #1 are set and the item is moved to the “orphans” group.
Lookup the item on CMDB. This lookup populates user tag #2 and moves the item to the appropriate group.
Apply monitoring.

In addition, it is also important to establish an effective offboarding process when a device goes offline, for example if it is retired or reimaged. One particular area to be mindful about is that it can be difficult to ensure a device is cleanly and completely removed.

For example, there can be delays between when a device is removed and when a remote profile is removed. As a result, what can happen is, after a device is removed, the profile pinging the device hasn’t been removed, so the device gets added again. It is important to manage these factors so teams can ensure devices are removed cleanly and won’t mistakenly reappear.

Harness Reporting and Validation

While you can get a warm feeling after installing an agent on a server and seeing that it’s started to monitor, there’s a big difference between monitoring and monitoring correctly. (For more info, see our recent blog, Expert Series: Broadcom IT Shares Their View on the Difference Between Monitoring and Monitoring Correctly.)

As part of this, it is essential to employ reporting to validate monitoring is operating appropriately. Leveraging dashboards and metric views can be an effective way to establish the visibility required. Toward this end, we’ve used APIs and a simple web form that teams can submit details in, including entering server names, which will populate the required information. As part of validation, it is also important to establish self-monitoring to ensure DX UIM is functioning correctly.

Employ Monitoring Configuration Service and Groups

It would be impossible to manage an environment like ours without some form of automation and policy-based management. We’ve relied extensively on Monitoring Configuration Service (MCS), which is effectively supported by how we’ve established our group structure. We use groups for monitoring, reporting, and maintenance.

We have host groups, which are a series of app-specific groups that define the role of a server. Specific monitoring can therefore be applied based on the nature of the apps that are installed. Physical location can also be an important way to group. For example, if a specific site is being taken down for maintenance, all the devices in that location can be put into maintenance mode to ensure a lot of false alarms aren’t generated.

Conclusion

By implementing DX UIM, we’ve been able to make significant strides across a number of areas, including both operational and cost efficiency. In this way, our teams have been able to more effectively optimize service levels of our large-scale, business-critical IT infrastructures. To learn more, be sure to review the case study, “DX UIM Enables Broadcom to Scale, Optimize Infrastructure Monitoring.”

Tag(s): AIOps , DX UIM

Rick Hirst

Rick Hirst's background is in monitoring, systems administration and infrastructure architecture. He has been working with UIM for the last 14 years, working with large customers in implementing and managing UIM in varied environments.

Other resources you might be interested in

Blog October 8, 2025

Nobody Cares About Your MTTR

This post outlines why IT metrics like MTTR are irrelevant to business leaders, and it emphasizes that IT teams need network observability to bridge this gap.

Read Blog

Blog October 8, 2025

Tag(ging)—You’re It: How to Leverage AppNeta Monitoring Data for Maximum Insights

Find out about tagging capabilities in AppNeta. Get strategies for making the most of tagging and see how it can be a game-changer for your operations teams.

Read Blog

Office Hours October 6, 2025

Rally Office Hours: October 2, 2025

The Rally Model Context Protocol (MCP) Server acts as a standardized interface for AI models and developer tools. Learn about this exciting new feature then follow the weekly Q&A session with Rally...

View Recording

Blog October 1, 2025

Why 1% Packet Loss Is the New 100% Outage

In an era of real-time apps and multiple clouds, the old rules about 'acceptable' network errors no longer apply. See why you need end-to-end observability.

Read Blog

Office Hours September 30, 2025

Rally Office Hours: September 25, 2025

Rally Office Hours delivers an essential product tip: Learn to transition from Legacy Custom Pages to powerful Custom Views. Plus, Q&A insights.

View Recording

Blog September 26, 2025

Defining the Network Engineer of Tomorrow

Read this post and see why the most important investment isn't in new hardware, but in transforming your team from device managers to service delivery experts.

Read Blog

Blog September 26, 2025

Harnessing AppNeta’s Browser- and HTTP-based Workflows to Track User Experience

AppNeta’s browser- and HTTP-based workflows let you see what users actually experience. Preempt issues before they become headaches for your end users.

Read Blog

Blog September 26, 2025

“Rego U” Recap: Why SPM Is Still Hot

Rego Consulting’s Annual Conference underscored why strategic portfolio management (SPM) is still essential. Leverage SPM to bridge strategy and execution.

Read Blog

Blog September 23, 2025

What's New in AutoSys 24.1: Built for the Modern Automation Landscape

See how AutoSys 24.1 is designed to streamline your daily tasks, accelerate troubleshooting, and simplify how you integrate with the latest technologies.

Read Blog