This post is part of a series featuring customers, partners, and experienced DX Unified Infrastructure Management (DX UIM) practitioners. We’ve asked these expert users to share their knowledge with the broader DX UIM community.
Today, we’re featuring Kathy Solomon, the Unix Systems Administrator for the R&D Support organization within Broadcom IT. Kathy is responsible for monitoring all R&D devices and met with us to discuss the challenges of the job, how she uses DX UIM, and share lessons learned.
What are the challenges you face in your job and as an organization?
We have different sets of servers. For some sets of servers, we need to be alerted right away when there is a problem, and for other sets, we just need to be able to check status periodically. We only want tickets to be created for the servers that require prompt action. For the less critical servers, we want to be able to look at status and see issues when we have time.
One of the challenges is the sheer scale of servers - my team manages over 10,000 servers, with about 10% needing prompt action.
Another challenge is sharing the monitoring environment with other tracks within IT. Each team has its own SLA and therefore needs its devices monitored in a particular way; at first, those requirements may appear incompatible.
Which DX UIM capabilities are your favorites? Which capabilities provide you with the biggest benefits?
Groups and user tags are our favorites because they allow us to scale. We use groups to consistently apply monitoring profiles and alarm policies to similar servers. The tags drive group membership.
We also love the ability to have API calls to the database to onboard devices. We use API calls to produce dashboards showing which devices are being monitored and which have valid profile deployments. Like groups, this feature allows us to manage at a larger scale.
How do you provide product feedback to the DX UIM product team?
We talk to the DX UIM support team and my assigned SE.
What tips would you like to share with customers?
One thing that is key in planning a DX UIM implementation is organizing servers into groups that need the same monitoring profiles. Then, you should define profiles for each group so that your devices are consistently monitored and you get predictable results. You can then make changes at the group level for efficiency.
We do all that through Monitoring Configuration Service (MCS). If each server needed to be managed at an individual level, it would not be possible. Once you define the profile and alarm policies upfront, you save yourself a lot of time down the road.
To automate group membership, leverage user tags. If needed, you can set one user tag to specify the track to which the server belongs. We pull information from the CMDB and populate another user tag. Dynamic grouping is driven by these user tags, and each server is automatically placed into groups based upon its function, domain, and physical location; this process reduces human error and drives consistency in how devices are monitored because monitoring profiles are deployed automatically based upon group membership. There is a big difference between monitoring and monitoring correctly.
You can also define additional groups that don’t have associated monitoring profiles for status and maintenance. For example, we can prevent ticketing for a group of 3000 devices during a planned maintenance window, and then as the end of the window nears we can assess to see which servers need a little more prodding prior to handing them back to our customers.
While DX UIM supports agentless monitoring, I highly recommend using agent-based. Agent-based monitoring gives you deeper insight and tighter control of the systems and configurations you are monitoring. You should include the DX UIM agent as part of the standard build and configure the robot configuration file as part of that standard build. Then you are two steps ahead of where you would have been, and it deploys monitoring profiles for you. All that is left is validating that everything is working. You can monitor remotely without an agent in special cases such as for legacy OS versions.
Are there other best practices you would like to share with other DX UIM users?
As your monitoring environment matures, be prepared to tweak your monitoring configuration, always at the group level. For example, you may find that you need to adjust thresholds or add additional file system types to your filter for disk monitoring. You might even find that there’s a key difference between devices within one of your groups that requires different thresholds; when this happens, create a new group so you can consistently apply profiles.
Make the most of groups for identifying error conditions and problem areas.