Key Takeaways
|
|
DX Unified Infrastructure Management (DX UIM) from Broadcom is a comprehensive solution for monitoring an organization’s entire IT infrastructure. The product provides IT administrators and operations teams with a centralized view of their infrastructure to ensure availability and performance of servers, network devices, storage systems, virtualization environments, applications, and cloud services.
Process monitoring involves tracking and analyzing the activity of running programs or processes to identify performance issues, troubleshoot problems, and ensure system stability on any OS platform.
Often a small, focused task, process monitoring can deliver a massive impact—especially in high-stakes environments involving database management and large-scale applications. As one customer experienced recently (described below), not making full use of process monitoring can result in “observability gaps” and lead to outsized problems that could have been easily preempted. Setting up proper process monitoring can save hours of troubleshooting, and prevent downtime for critical business applications.
IT teams generally know which processes matter most. Informed by recent fire fighting incidents, tribal knowledge sharpened by past experience, or direct monitoring requests received from internal departments, monitoring teams tend to know which processes they need to, or want to, monitor.
Confidence wanes when things change. Could process monitoring gaps occur when important processes are hosted on a newly deployed system? Could processes slowly but surely consume more resources over time without anyone noticing? Could system updates and patches adversely affect running processes?
A large enterprise customer recently contacted Broadcom support. The team had noticed that a single process running on a server with over 800GB RAM was using all available memory. To make matters more urgent, this server was considered critical to a core business unit as well as IT security: This server hosted a critical end-user facing application and an Oracle Virtual Private Database.
To loosely quote the customer, “At first—over the first hour, this process was using 75% of the 800GB.”
As a result of the memory consumption, the server kicked off an n-kill (kill) process or script on the server, causing it to shut down all non-root processes. The process was a data security protection process or agent, which is run by root. The kill process, a script on the server, started taking down the non-root processes, which then proceeded to take down Oracle.
The command center (NOC) became aware of the problem: The entire customer-facing application was subsequently off-line. To recover, the customer's security team was forced to temporarily disable the data protection software. They quickly deployed the necessary process monitoring, learned a difficult lesson, and can now prevent a repeat of this type of issue.
This was a stress-inducing experience for all involved. More positively, the customer expects the rededication to process monitoring, across all important processes running in their infrastructure, will elevate everyone’s confidence (IT teams and business owners) that this scenario will not occur again.
In this case, the security team and the business unit that owned the customer-facing application asked their DX UIM systems administrator if they could monitor all processes on the system. While this may appear to be overly cautious, closing this observability gap was extremely important to the business.
The monitoring team went even further. To benefit from automation within DX UIM, the DX UIM systems admin configured auto-discovery and monitoring of all processes on the server. They also manually set a threshold, (Maximum Memory Value > 25 GB) to generate notifications of any impending issues. In DX UIM, baselining is used to understand the typical behavior of IT assets, such as memory usage for specific processes over time. Once a baseline is established, dynamic thresholds can be applied, allowing teams to set proper thresholds and send more accurate alerts when an asset deviates from its expected behavior.
This customer is now a shining example of the power of process monitoring, and is taking advantage of essential monitoring capabilities in DX UIM.
In conjunction with the processes probe, customers can also leverage the CPU, Disk, and Memory Performance Monitoring probe (cdm) for early warnings when top processes consume significant server resources.
This probe monitors performance and resource load on the system with the robot. It generates alarms that are based on configured threshold values and can trigger corrective actions. cdm also generates trending quality of service (QoS) data.
Customers use cdm for tasks such as:
The cdm probe and processes probe are complementary. In addition to using the processes probe to monitor all or some processes that end up using a high percentage of physical memory available, the cdm probe may be used as well. For example, the cdm probe can be used to leverage the “Memory usage option” and include the “Top memory consuming processes in alarm” (up to 10 processes), for the alarm that is generated.
Recent enhancements to the processes probe include:
This customer’s experience highlights the importance of comprehensive monitoring, especially for high-stakes environments hosting critical applications and databases. Here, a single process consumed such a large portion of memory it led to system instability and an Oracle database crashing.
The decision to implement full process monitoring, even beyond the typical application-specific processes, is a prudent one in cases where you simply cannot afford downtime or resource hogging by a rogue process. In this case study, the customer's team recognized the need for a proactive monitoring strategy and are making greater use of DX UIM processes and the cdm probe for better oversight. Setting alarms for the top memory-consuming processes can act as a proactive “early warning system” before critical resources are fully consumed.
For example, teams can take advantage of baselining and dynamic thresholding to set the proper thresholds for proactive monitoring and for sending early warning alarms when critical IT assets are not functioning as expected. Stay tuned for more information on this topic and enhancements to the Monitoring Governance report to include baselining of the QoS metric in a coming release.
For additional details, please refer to the following techdoc and knowledge articles: