Broadcom Software Academy Blog

Harness Continuous Observability to Continuously Predict Deployment Risk

Written by Shamim Ahmed | Sep 20, 2022 9:34:11 PM

Introduction

In my previous blog, I discussed how continuous observability can be used to deliver continuous reliability. We also discussed the problem of high change failure rates in most enterprises, and how teams fail to proactively address failure risk before changes go into production.

This is because manual assessment of change risk is both labor intensive and time consuming, and often contributes to deployment and release delays. For example, in a large financial services enterprise, change approval lead time averaged about 3.5 days with more than seven approvers involved.

Traditional processes are not only time consuming, they can also be unreliable. That is because they often rely on static algorithmic techniques that are not customized to the unique characteristics of each application or system—and it is laborious to manually develop unique algorithms for each different system.

Observability solutions are key to addressing this problem. Generally speaking, “observability” solutions in the industry tend to focus on operational data. However, I’d contend that teams need to shift observability left. By bringing observability into all stages of the CI/CD lifecycle, teams can glean insights from pre-production data, in conjunction with production data. Pre-production environments are more data rich than production environments (in terms of variety), however, most of this data is not correlated and mined for insights.

For example, it is possible to obtain reliability insights from patterns of code changes in source code management systems. Teams can also gain insights from test results and performance monitoring. This data can be correlated with past failure and reliability history from production.  

In this blog, I will discuss more about how we can better predict risks associated with changes, and improve both reliability and overall quality of software systems by employing observability with pre-production data.

System of Intelligence for supporting Continuous Observability in CI/CD

Continuous observability in the CI/CD pipeline requires us to setup “systems of intelligence” (SOI, see figure below). SOIs are where we continuously collect and analyze pre-production data along the CI/CD lifecycle. Pre-production data is then correlated with production data to generate a variety of insights when applications change.

Data can be gathered from a variety of tools that form a part of typical DevOps pipelines. This can include the following:

  • Agile development and requirements management (such as Rally)
  • Unit testing (like jUnit)
  • Code scanning and code quality (like Sonar)
  • Source code management (like Git)
  • Infra-as-code (like Chef and Puppet)
  • Build (like Maven and Gradle)
  • Continuous integration (like Jenkins)
  • Testing (like jMeter, Selenium)
  • Defect management (like Microfocus ALM)
  • Deployment (like Nolio)
  • Release management (like Continuous Delivery Director)
  • Operational monitoring (like New Relic)
  • Customer experience (like Zendesk)
  • ITSM tools (like ServiceNow)

For purposes of reliability and quality analytics, machine-learning-based models are built by correlating pre-production data with problems that have arisen in past production deployments. This can include system failures, escaped high severity defects, service degradation incidents, deployment failures, customer experience issues, and SLA violations.

In my experience, I find that a generic machine learning model is not well suited for all applications. It is best to build a model that’s aligned with the specific characteristics of the application.

Note that in some cases, it may not be possible to construct a valid, accurate model that aligns with performance criteria. For example, this may be due to the application data being captured. In these cases, it would not be possible to use the mechanism outlined above to make accurate predictions. Teams must have suitable data that enables effective model generation.

Change and Deployment Risk Prediction Using an Observability System of Intelligence

Once the machine learning model is created with past pre-production data, teams can use this model and data to predict the deployment risk for new changes.

Risks may be predicted at various levels of product or software granularity:

  1. Code level. This is the lowest level of granularity at which change risk may be predicted. This is done for every pull request (or commit) performed by the developer.

    The data used for this prediction is typically tied to code-level activities, including code change/churn, change impact analysis, unit test and code coverage, code complexity, code scans, and code reviews. The risk prediction also flags the risk factors that need attention, such as untested code change blocks. This prediction is done in near real time, so the developer gets fast feedback to address these risk factors.

    If the commit includes information about associated backlog items (such as a user story or defect), the risk can be also be flagged as part of the backlog item. In this case, we can also leverage the business criticality of the backlog (if available) for the risk score. Note that a single backlog item may have multiple pull requests/commits associated with it. If so, the risk prediction takes into account all of the commits associated with the backlog item.
  2. Application component level. This is computed based on all the code changes that roll up to the specific product component, such as a micro-service.

    In addition to the data used in (a), this leverages additional data, such as results from build verifications, regression testing, and binary scans. This is useful for product owners and scrum masters who want visibility into risks at the application-component level.

    Teams can extend the relationship between code changes and backlog items, where multiple stories aggregate to a feature in the backlog. (Alternatively, sometimes a “parent story” can include multiple stories.) The corresponding risk is calculated for the affected feature.
  3. Overall application/product level. This is computed based on all component changes that roll up to the associated application.

    In addition to the data used in (b), this level leverages data from such areas as integration and PEN testing. This is useful for product managers and scrum masters who want visibility into overall application-level risks.
  4. System level. This is computed based on all application changes that roll up to a given system (or business service) that includes multiple applications.

    In addition to the data used in (b), this approach leverages data from such areas as system testing. This is useful for release managers and site reliability engineers who want visibility into overall system-level risks.

Evolving to Continuous Risk Prediction

In the previous section, I discussed risk prediction at different levels of application software granularity. In this section, I’ll outline how we make the risk prediction process “continuous” and tie it to the different phases of the CI/CD lifecycle. In a continuous risk prediction system, the risk score is updated continuously as the software proceeds from the left to the right of the lifecycle—from development to production.

At every stage, we understand the risk factors, and address them continuously to progressively reduce the risk until the software gets to production. In each phase, risk prediction helps to provide guidance—in near real time—on whether the build should be promoted or not. This decision can either be made manually or automatically by pushing the data to a decision system. Even more importantly, this also provides insights on how to address risks.

This mechanism enhances the continuous feedback process with precise, data-driven insights, which are so important in DevOps.

Let’s look at how this works (see figure below).

  1. Development phase. In this phase, the risk analysis is based on code changes made by developers. This is described in item III (a) of the previous section
  2. Build phase. In this phase, the risk analysis is based on all the code changes that are included as part of the build. The risk prediction leverages additional data from build verification and regression tests (especially SLO violations from performance regression tests and flaky tests), and numbers of unaddressed, high-severity defects.
  3. Integration test phase. In this phase, risk analysis is refined based on additional data from integration tests, such as test coverage, defects detected, and so on.
  4. System test phase. In this phase, the risk analysis is refined based on additional data from system tests.
  5. Pre-production phase. In this phase, the risk analysis is refined based on additional data from pre-production tests (such as pre-release tests or canary tests). This data is useful for making final go/no-go production release decisions.
  6. Production phase. In this phase, we collect data on actual high-severity defects and incidents. The prediction model for the affected applications may be updated with new data (both from production and pre-production) after every release event, or may be updated periodically in batch mode, for example, monthly.

The periodic update of the prediction model using the latest data is another key nuance to the continuous approach to risk prediction, which ensures that the prediction model automatically reflects the characteristics of the application as it evolves over time.

The mechanism described above is shown for an individual pipeline, such as a microservice or application component. When multiple pipelines are part of a unified release (such as in an Agile Release Train in SAFe), the risk score for the entire ART is computed based on the risk scores of the individual pipelines.

Benefits of Continuous Risk Prediction

The key overall benefit of continuous risk prediction is to reduce change failure rate and to reduce manual overhead associated with change approvals in a CI/CD pipeline.

Continuous risk prediction benefits a variety of personas, as described below.

While there are specific benefits for these personas, the key unstated benefit is the data-driven cross-persona collaboration that it enables across the lifecycle. Many of these personas often work in the context of their specific tools, processes, and metrics.

Despite the availability of DevOps dashboard tools (like Hygieia) it is sometimes challenging to enable collaboration across silos to enable decision making. That’s because these tools do not enable teams to correlate data across the various sources they are drawing from. This leads me to the final point I want to make about the difference between just having data versus having data that provides actionable insights.

Beyond DORA Metrics—Continuous Delivery Intelligence for Supporting the “Third Way”

DORA metrics provide great indicators of the performance of an application team. However, such metrics are good for descriptive analytics, that is, they describe “what is happening now.”

A continuous observability solution like the one we describe in this blog goes beyond descriptive analytics to provide valuable insights into “why did it happen.” (This can include diagnostic analytics.) For example, this can help teams understand “why do we have such a high change-failure rate?”

And further, they provide even more valuable predictive and prescriptive analytics, offering insights into “what changes do I need to make and what impact will they have?”

The latter are keys for driving better continuous improvement in DevOps. The “Third Way of DevOps” is about creating a culture of continual experimentation and learning. Continuous observability solutions provide a structured, data-driven approach that supports this cultural shift.

Up Next: Observability-Driven Continuous Quality

This blog has described how teams can use observability-based, continuous risk prediction to reduce change failure rates.

However, change failure is just one dimension of application and system quality. For example, it does not fully address how we optimize customer experiences, which are influenced by other factors, such as ease of use and value. In order to get true value from DevOps and continuous delivery initiatives, we need to establish practices for predictively attaining quality—we call this “continuous quality.” I will discuss this in my next blog.