Key Takeaways
|
|
DX Operational Observability (DX O2), our next-gen AIOps and Observability product, continues to provide new features and enhancements for practitioners across IT. DX O2 delivers a host of enhancements designed to empower IT operations, DevOps, and SRE teams.
In this post, I introduce five powerful enhancements, outline steps to get started, and describe some of the benefits, which include deeper insights, improved efficiencies, and a more unified observability experience. Here are the five enhancements:
Alarm Enrichment is a powerful new capability that allows users to define rules that enrich alarms after the alarm creation. With additional context attached to alarms, IT teams tasked with remediating issues can prioritize better, and respond faster and more effectively.
With configurable rules, teams can use Alarm Enrichment to:
Alarm Enrichment helps you customize your alarm payload to improve both the clarity and relevance of alarms for your teams.
By adding relevant metadata and custom fields to alarms, teams receiving the alarms can quickly understand the nature of the issue, the environment in which it occurred, and the potential impact. This removes guesswork that can occur when interpreting raw alarm data.
Enrichment allows you to fine-tune which alarms should take precedence. You can use custom fields like Geolocation, Environment, ESX Host, and Cluster Name to classify alarms and focus on the most critical ones first.
With enriched alarms, automated workflows and response mechanisms are more precise. The additional context lets you trigger the right notifications and remediation actions, whether that’s notifying the right team or triggering an automated recovery process.
Getting started with Alarm Enrichment rules in DX O2 is easy. Follow these steps to create enrichment rules:
New alarms that match the filtering criteria defined in the enrichment rule will now be enriched with the specified attributes and values.
These attributes are available when creating a policy, as in the Message Templates, so that they can be shared via email or as Slack messages.
For additional information, refer to the documentation.
A deterioration in the health of a critical service sets alarm bells ringing and puts pressure on IT teams to quickly resolve the issue and restore normalcy. To understand dependencies, suspected causes, and other impacted services, IT teams can use Triage Inspector, which provides these insights at-a-glance.
Triage Inspector improves how teams manage service health. The ability to launch Triage Inspector directly from the Service Health bar chart makes the troubleshooting process more streamlined than ever before.
For single time periods: If you notice a degradation in service health (depicted as a dip represented by a single bar), simply hover over the bar and select Triage. This will launch Triage Inspector, where you can dive into the details for the timeframe the bar represents.
For multiple time periods: If you need to investigate multiple time periods, select Triage and then drag the crosshair to select the desired range. Triage Inspector will then open and display the context for the entire chosen duration.
Triage Inspector displays a comprehensive view of all relevant alarms. This gives all users a clearer, more focused view of system health so they can address issues more effectively.
To learn more, refer to the documentation.
Alarms APIs provide a comprehensive set of endpoints to manage and interact with alarms programmatically. The latest release of Alarm Actions APIs enable automation of alarm workflows, ensuring efficient incident management and response.
The following new APIs are available to use:
To learn more, refer to the documentation.
Enterprise IT teams often struggle with observability gaps. With dynamic and complex environments, teams may not be aware or lack visibility when the inventory of IT resources that should be monitored changes. This creates work for IT and prevents the organization from shifting from reactive to proactive monitoring. Infrastructure Discovery solves this problem by automatically detecting and mapping all infrastructure components—such as servers and VMs, removing the need for manual tracking, while ensuring monitoring tools have accurate, up-to-date information. This is crucial, especially for managing modern, cloud-native environments that are inherently dynamic.
This new capability streamlines the process of identifying and cataloging assets across your infrastructure. It supports two primary discovery types:
Installation: Download and deploy the Discovery Agent on your target systems. For host discovery, install the agent on your Windows or Linux machines. For cloud discovery, configure it to access your AWS or Azure environments.
Configuration: Set up the necessary credentials and permissions to allow the agent to access and discover resources within your environments.
Integration: Once configured, the Discovery Agent will feed the discovered asset information into DX O2 so that observability gaps will be closed and analysis of alarms and the health and performance of IT systems will be more accurate.
Monitor: Keep track of active and in-progress discovery jobs along with historical ones with their status.
You can learn more about this feature by referring to the respective documentation below:
To reduce noise for their teams by providing them a view that helps them perform their jobs faster and with ease, Administrators can now set a default alarm queue for specific roles. With this configured, when a user opens the All-Alarms page, they will see the alarm queue that was configured as the default for their role.
Tenant Administrators can configure the default queue for any role.
User personalization: If a user prefers a different queue, the user can override the default queue. User settings take precedence over the default configuration defined by the Tenant Administrator.
You can learn more about this feature by referring to the documentation.