Observability and IT Monitoring Governance (Part 4 of 4)

Written by Steve Danseglio | Sep 16, 2025 11:34:32 PM

Key Takeaways

Learn about a real-world example that reveals why strong monitoring governance is an imperative.
Discover how baselines, KPIs, and thresholds represent three cornerstones of monitoring governance.
See how IT teams can more effectively identify, remediate, and pre-empt exceptions to normal or typical behavior.

Following parts one, two, and three of this blog series, this post offers a short, real-world example that shines light on why strong monitoring governance is a must have.

Event monitoring—Real-world example of the need for monitoring governance

A security team requested monitoring of the Windows Security log for a specific event ID. The business justification was clear to all involved; the consequences were not: For this organization, security logs contain millions of events.

For this event ID, each event represents a user’s password that never expires. According to the team making the request, the purpose of this required monitoring was to highlight those systems on which the password was set to Never Expire. These events are logged whenever user accounts in Active Directory are modified. This monitoring requires filtering on the event IDs in which Never Expire is being set. The event contains "Changed Attributes," for example, “Don't Expire Password” enabled.

Due to event frequency, monitoring large logs like this commonly consumes significant memory resources, which can often cause unexpected delays in the receipt of alarms.

Unfortunately, this type of monitoring request is a common occurrence. Consequently, strong monitoring governance is needed to assess the resource demands by considering the scope and frequency of the monitored event. Without due diligence, monitoring may generate thousands of alerts that teams learn to ignore. Alerts like these may never be acknowledged nor addressed. Over time, they accumulate in the “alarm body.” Then, as the number of events and alarms grows, administrators may be tempted to automatically close alarms, set them to invisible, or even delete them. This raises the risk of blind spots and makes the eventual task of alarm cleanup more challenging.

What would you do in this situation? What was the business goal or purpose? Was there another way to handle this issue?

Baselining, KPIs, and thresholding: What, why, and when?

Even before you’ve established monitoring governance, you can and should begin baselining. Baseline the normal behavior of any key IT asset to know what “healthy” looks like.

Baselining establishes normal levels for the health, performance, and availability of any IT asset or business service. Derived from historical data, a baseline represents the normal behavior of a KPI or key metric over a given time period. It can also account for exceptional periods, such as seasonality.

For example, if a team is tracking web server response time as a KPI, it could establish a baseline by analyzing the response data over the past month or quarter. This enables the team to understand seasonal fluctuations and normal versus high load application or system performance. This approach can be applied to groups of IT assets also, such as web server farms, virtualized servers, clusters, or any other logical grouping.

A comprehensive understanding of your IT assets requires establishing baseline data values through rigorous quantitative methods, including collecting a significant sample size, such as four weeks of data. This involves collecting a meaningful sample of performance and usage metrics for each IT asset so that the normal or typical healthy behavior of any asset can be determined and recorded. IT assets include hardware, software, applications, transactions, network traffic, queries, end-user activity, scheduled jobs—and virtually anything else that can be monitored and measured over time. Baselining is key to anomaly detection.

This data-driven approach enables monitoring administrators to:

Set precise threshold values. Instead of relying on guesswork or assumptions, administrators can define thresholds based on statistically significant deviations from established baselines. This reduces the risk of false positives and ensures that alerts are triggered by actual anomalies.
Avoid subjective bias. Using quantitative data, teams eliminate personal judgments and avoid prolonged debates when interpreting asset behavior.
Establish expertise on asset behavior. Applying AIOps to monitoring data helps monitoring teams and business managers gain deep insights into the unique characteristics and normal operating patterns of any IT asset. This expertise empowers these groups to anticipate potential issues, optimize performance, and establish a more reliable and predictable IT infrastructure. Leaders and their teams become the ultimate experts!
Thresholds can be set as static values or set dynamically. DX Unified Infrastructure Management (DX UIM) caters to your needs to reduce alarm noise by allowing baseline-driven monitoring. This helps alert teams to emerging issues that warrant attention. Alternatively, Time-To and Time-Over Threshold settings provide other options for alarm prediction and control. Note: Smart thresholds can prompt action from IT teams long before an alarm indicating an outage or service degradation is generated.

Monitoring Governance Report: New baseline feature

The DX UIM development team is working to add a baseline column to the Monitoring Governance Report. With the baseline and threshold data values side by side, administrators will be able to compare and adjust thresholds to increase precision and reduce alarm noise. With this enhancement, administrators can affirm that baseline data is live, continuously updated, and consistently collected.

Working hand-in-hand, baselines, KPIs (covered below), and thresholds are three cornerstones of monitoring governance. This synergistic trio also helps IT shift from reactive, fire-fighting mode to proactive IT management. Also, with baselining, teams get direct line-of-sight to root cause so they are less likely to be overwhelmed by superfluous noise. Instead, teams receive timely, highly precise signals that highlight genuine IT asset incidents.

Connecting baselining, KPIs, and thresholds to observability and monitoring governance

The goal of this work is to help IT identify, remediate, and pre-empt exceptions to normal or typical behavior of IT assets and services in support of business goals.

KPI definition and validation

Monitoring governance should ensure that the right KPIs are selected—not just those that are easy to measure. At a minimum, the list of KPIs should include metrics for:

Application availability, response, and performance
Critical application availability and response times
End user/customer transaction response time
Fast/slow-developing increases in CPU, memory, and disk consumption
Disk full, disk failures/data loss
Identifying and closing of all business-critical monitoring gaps
Any potential causes of critical system downtime
SLA/SLO compliance

Working with application stakeholders, the monitoring governance team validates that all KPIs are meaningful, measurable, and aligned with strategic business objectives.

More broadly, business-specific KPIs may align with a range of critical IT areas, such as systems. Here are some examples: applications, programs, executed scripts, commands or queries run to monitor the number of transactions per hour, number of jobs completed per hour, number of jobs completed per day, number of widgets sold, number of users accessing the main business application, and so on.

KPIs ensure you measure what truly has an impact on business outcomes. Once identified and validated, KPIs of any type can be included in executive dashboards and reports to provide powerful operational observability across the enterprise.

Monitoring governance best practices include periodic KPI reviews for revalidating KPIs, seeking answers to these types of questions:

Are the original KPIs selected still relevant?
Have new critical business applications been deployed recently?
Are we achieving our defined targets and monitoring goals with the current set of KPIs?
Can we track those KPIs related to negative or positive business impact?

Monitoring governance creates constructive feedback loops that help monitoring teams add value and remove risk to the business.

Individual technology teams usually require more detailed, technically related KPIs. These KPIs may not easily be found in public online resources. They are usually developed over time, based on operational experience. These often map to or support the higher-level KPIs prioritized by senior management. You can start with three to five KPIs per technology or monitoring category. Your technology domain teams are an excellent source of this information and can set the stage for more collaboration and information sharing with the monitoring governance team. Leverage KPIs, baselines, and thresholds for systems, network, storage, cloud, and application-specific groups:

Servers/systems
Network devices
Processes
Server virtualization
Databases
Networks
Storage devices
SAN/NAS
Cloud/hybrid cloud

Business-IT alignment KPIs work with monitoring governance to bridge technical and business worlds. Along with baseline data, when KPIs and thresholds have been established, DX Infrastructure Observability receives these initial monitoring values and metrics. These provide a frame of reference to evaluate and begin differentiating symptoms from potential root causes. Coupled with monitoring governance, process of elimination (POE) also serves as a crucial practice for pinpointing root causes of application or system failures. In contrast to being misled, POE guides you to the underlying cause through symptom dismissal.

A real-world example of POE

Common symptoms of a failing application may include high CPU usage for a specific number of minutes, slow memory leaks, network congestion, too many concurrent database connections, or other factors. On its own, this information may remain insufficient for effective monitoring governance, since it may not have been enriched with business context. It may not have been graphed yet to visualize any direct or indirect correlation or perhaps even inverse correlation with other business processes or technical events. POE works best when baselines, KPIs, and thresholds are already in place to give context and correlation.

The graph below shows a real-world example of significant SAN latency in milliseconds (ms), for every read and write for several SAN disk volumes. Another graph (mapped against the SAN latency) taken at the same time, depicted a drop in the number of jobs processed per hour. The jobs were proactively monitored by end-to-end (E2E) transaction monitoring. The results showed this inverse correlation, that is, a dip in the other E2E chart).

In this case, SAN/NAS response in ms per each read/write, exhibited latency. The monitoring team and the monitoring governance team engaged the SAN team to collaborate. They reviewed the collected metrics, the increase in SAN/NAS latency in ms per each read/write, and the drop in transactions completed per hour. Their investigation pointed to one job running at the exact same time each morning.

These charts revealed that the underlying cause of latency for the monitored SAN disk volumes was a daily antivirus scan that proved to be a “top-talker” on the internal network. The scan has an adverse impact on a critical freight transportation job that was also being monitored as part of the overall IT stack model that showed all supporting infrastructure/application tiers. Websphere application performance was monitored as well to complete the end-to-end picture of the application environment.

Scheduling the scan job at the time of the day when jobs needed to be processed caused the jobs to slow dramatically to a few per hour instead of the normal baseline of 60 jobs completed per hour. The powerful combination of proactive end-user transaction monitoring with baselining and full-stack observability revealed the root area (SAN) and underlying cause (mis-scheduled AV scan).

Implement monitoring governance: Get started today!

If you have a monitoring environment that has already become difficult to manage and administer (even well before you reach a few thousand alarms!)—significant time and effort may be required to implement full monitoring governance. You can still start now by following these steps:

Examine current alarms with an eye on unnecessary alarms and noise reduction.
Start tracking your alarm policies and configuration across different categories, such as critical apps, systems, storage, and so on.
Perform background research and collaborate with business units and tech teams on developing, documenting, and tracking KPIs.
Enable baselining for the most important metrics.
Set a date to configure thresholds against the developed baselines after four weeks has passed.
Configure or adjust existing team notifications as needed.
Select three to five KPIs per application/system/technology, as mentioned in part two of this blog series.
Identify business units, applications, systems, and other areas that badly need proactive monitoring implemented. Use historical incidents or the ticket base when assessing target areas.
Start developing and maintaining a cross-tier, infrastructure-to-application view for deep insight.
Familiarize yourself with the Monitoring Governance Report and start using it today!

For detailed insights on how DX UIM and the Monitoring Governance Report can help you get the best out of your environment, see this Monitoring Governance Report Demo.

View full post