April 29, 2021

Best Practices for Defining SLAs

While it would be ideal for a development team to be able to guarantee availability and reliability of their application 100% of the time, it is generally not a feasible objective. Instead, organizations commit to a slightly lower level of service availability and reliability through the creation of service-level agreements (SLAs). These SLAs are meant to set appropriate customer expectations for a service, while detailing what the repercussions will be if the organization fails to live up to these expectations.

There are best practices for developing SLAs that are both feasible and indicative of a high level of reliability. But before we can define an SLA, we need to understand how it will be measured. Read on for an overview of system reliability, SLA best practices, and the process for measuring the success of these agreements.

Performance, Availability, and SLAs

Most SLAs between service providers and customers revolve around system performance and availability. They are put in place to assure the customer that they can rely on the product and to instill a sense of accountability within the development organization by providing consequences that are invoked when the service provider fails to live up to the terms of the agreement.

If SLAs aren’t measurable, then it’s impossible to evaluate them appropriately. Defining realistic service availability and performance expectations in a clear and measurable manner often begins with the development of service-level objectives (SLOs) that are measurable through the use of service-level indicators (SLIs). Let’s take a look at how this works.

Defining SLOs and SLIs

An SLO is an objective that is meant to serve as a benchmark performance goal for the development team. By monitoring the status of an SLO, development teams can ensure that their system is operating at a level of reliability that’s acceptable to the business and its customers. SLOs sound very similar to SLAs, except they don’t have any repercussions associated with them; they do not represent an agreement with a paying customer. The objectives are measured through the use of SLIs, and these SLIs provide the basis for indicating whether or not the SLO is being met.

Let’s imagine that an organization sets performance SLOs that read as follows:

99.9% of properly formatted requests must complete successfully over the course of a year.

99.9% of requests must complete in under 300 milliseconds over the course of a year.

So how can these be monitored? Well, with the proper data, a DevOps team can measure the success of their objectives. These data points are the SLIs. In the case of the SLOs displayed above, SLIs would likely include the error rate for requests being made to the service and the length of time each request takes to complete. (Large sets of request times can then be used to evaluate the percentage of requests that take longer than 300 milliseconds to complete.)

These indicators simplify the process for measuring and monitoring the success of the goals set by the SLOs. They help DevOps teams ensure a high level of system reliability for their end users.

SLOs will evolve over time as a system matures. From these refined SLOs, SLAs can be created with promises made to the customer in the event they are not met. For example, an SLA built off of one of the SLOs above may read as follows:

If we don’t meet our promise of successfully completing 99.9% of properly formatted requests, we commit to providing two additional months of service free of charge and promise to address the failure to meet our goal to the best of our ability.

Best Practices for Defining SLAs

Now that we understand the basics behind the creation of SLAs, let’s get into the best practices for developing these agreements.

Don’t Over Promise

In software development, it’s natural to over promise. This is often due to the complex nature of the system being developed—it’s hard to account for every challenge that may (or, more likely, will) arise. The same is true when setting standards for SLAs. You may think to yourself that 99.99% availability over the course of a year is achievable, and it may be. But you’ll likely be better off building in a buffer for the unexpected, and dropping the expectation to 99.95%. The customer will likely be happy with that number and you will have built in a little leeway for uncertainty in the process.

Ensure Your Agreements Allow for Innovation

It is also important to ensure that your SLAs allow for innovation. Application changes inherently bring risk along with them. But the solution is not to build a highly reliable service and then never touch it again, just because you are currently meeting your performance and availability goals.

Change and experimentation within your service is necessary to increase value to the end user and to keep a product viable in its marketplace. Instead of avoiding change, the answer is to keep the risks associated with an evolving code base in mind, and to estimate the impact it will have so that it can be accounted for when setting SLAs.

Set Internal SLOs that Outpace Your SLAs

Another good practice to implement when defining SLAs is to set internal SLOs that hold service availability and reliability to a higher standard than your customer-facing SLAs. In doing so, the development team will be driven by a sense of responsibility to maintain system availability or reliability levels as indicated by the internal SLO. Additionally, this objective can be monitored appropriately, providing the DevOps team with advance notice of a potential problem prior to violating the slightly lower level specified in the SLA.

Summary

In short, SLAs, SLOs, and SLIs are created in an effort to make system reliability measurable. This serves to instill a sense of accountability for system stability within the development organization, while also making customers feel more comfortable and confident in the product.

While it is true that higher levels of availability and reliability are better than lower levels, development teams should be realistic in setting availability and performance objectives for their system so as to not over promise or deter innovation.

Tag(s): AIOps , DX APM

Scott Fitzpatrick

Scott Fitzpatrick is a Fixate IO Contributor and has 8 years of experience in software development.

Observability Data: Ingestion Pipeline Best Practices

Blog June 17, 2025

Achieve Operational Efficiency with DX Operational Observability

Blog June 16, 2025

Monitor Your Kubernetes Cluster: Get Started in Four Minutes

Blog June 6, 2025

DX Operational Observability: Five New, Powerful Capabilities

Blog May 16, 2025

Customer Appreciation, India Tour, CA World Memories, and an Invitation

Blog May 9, 2025

Process Monitoring — Huge Value from a Quick Task

Blog May 2, 2025

Observe VMWare vCenter Cluster and Cloud with Confidence: Achieve Full Stack Observability with DX Operational Observability (DX O2)

Blog April 28, 2025

Extending the Capabilities of DX Unified Infrastructure Management: Release 23.4 CU4

Blog March 24, 2025

Best Practices for Defining SLAs

Performance, Availability, and SLAs

Defining SLOs and SLIs

Best Practices for Defining SLAs

Don’t Over Promise

Ensure Your Agreements Allow for Innovation

Set Internal SLOs that Outpace Your SLAs

Summary

Scott Fitzpatrick

Other posts you might be interested in

Observability Data: Ingestion Pipeline Best Practices

Achieve Operational Efficiency with DX Operational Observability

Monitor Your Kubernetes Cluster: Get Started in Four Minutes

DX Operational Observability: Five New, Powerful Capabilities

Customer Appreciation, India Tour, CA World Memories, and an Invitation

Process Monitoring — Huge Value from a Quick Task

Observe VMWare vCenter Cluster and Cloud with Confidence: Achieve Full Stack Observability with DX Operational Observability (DX O2)

Extending the Capabilities of DX Unified Infrastructure Management: Release 23.4 CU4

DX Operational Observability and Native Integration of Synthetics: Enable Synthetics for Proactive Issue Identification and Remediation