Best Practices for Defining SLAs

While it would be ideal for a development team to be able to guarantee availability and reliability of their application 100% of the time, it is generally not a feasible objective. Instead, organizations commit to a slightly lower level of service availability and reliability through the creation of service-level agreements (SLAs). These SLAs are meant to set appropriate customer expectations for a service, while detailing what the repercussions will be if the organization fails to live up to these expectations.

There are best practices for developing SLAs that are both feasible and indicative of a high level of reliability. But before we can define an SLA, we need to understand how it will be measured. Read on for an overview of system reliability, SLA best practices, and the process for measuring the success of these agreements.

Performance, Availability, and SLAs

Most SLAs between service providers and customers revolve around system performance and availability. They are put in place to assure the customer that they can rely on the product and to instill a sense of accountability within the development organization by providing consequences that are invoked when the service provider fails to live up to the terms of the agreement.

If SLAs aren’t measurable, then it’s impossible to evaluate them appropriately. Defining realistic service availability and performance expectations in a clear and measurable manner often begins with the development of service-level objectives (SLOs) that are measurable through the use of service-level indicators (SLIs). Let’s take a look at how this works.

Defining SLOs and SLIs

An SLO is an objective that is meant to serve as a benchmark performance goal for the development team. By monitoring the status of an SLO, development teams can ensure that their system is operating at a level of reliability that’s acceptable to the business and its customers. SLOs sound very similar to SLAs, except they don’t have any repercussions associated with them; they do not represent an agreement with a paying customer. The objectives are measured through the use of SLIs, and these SLIs provide the basis for indicating whether or not the SLO is being met.

Let’s imagine that an organization sets performance SLOs that read as follows:

99.9% of properly formatted requests must complete successfully over the course of a year.

99.9% of requests must complete in under 300 milliseconds over the course of a year.

So how can these be monitored? Well, with the proper data, a DevOps team can measure the success of their objectives. These data points are the SLIs. In the case of the SLOs displayed above, SLIs would likely include the error rate for requests being made to the service and the length of time each request takes to complete. (Large sets of request times can then be used to evaluate the percentage of requests that take longer than 300 milliseconds to complete.)

These indicators simplify the process for measuring and monitoring the success of the goals set by the SLOs. They help DevOps teams ensure a high level of system reliability for their end users.

SLOs will evolve over time as a system matures. From these refined SLOs, SLAs can be created with promises made to the customer in the event they are not met. For example, an SLA built off of one of the SLOs above may read as follows:

If we don’t meet our promise of successfully completing 99.9% of properly formatted requests, we commit to providing two additional months of service free of charge and promise to address the failure to meet our goal to the best of our ability.

Best Practices for Defining SLAs

Now that we understand the basics behind the creation of SLAs, let’s get into the best practices for developing these agreements.

Don’t Over Promise

In software development, it’s natural to over promise. This is often due to the complex nature of the system being developed—it’s hard to account for every challenge that may (or, more likely, will) arise. The same is true when setting standards for SLAs. You may think to yourself that 99.99% availability over the course of a year is achievable, and it may be. But you’ll likely be better off building in a buffer for the unexpected, and dropping the expectation to 99.95%. The customer will likely be happy with that number and you will have built in a little leeway for uncertainty in the process.

Ensure Your Agreements Allow for Innovation

It is also important to ensure that your SLAs allow for innovation. Application changes inherently bring risk along with them. But the solution is not to build a highly reliable service and then never touch it again, just because you are currently meeting your performance and availability goals.

Change and experimentation within your service is necessary to increase value to the end user and to keep a product viable in its marketplace. Instead of avoiding change, the answer is to keep the risks associated with an evolving code base in mind, and to estimate the impact it will have so that it can be accounted for when setting SLAs.

Set Internal SLOs that Outpace Your SLAs

Another good practice to implement when defining SLAs is to set internal SLOs that hold service availability and reliability to a higher standard than your customer-facing SLAs. In doing so, the development team will be driven by a sense of responsibility to maintain system availability or reliability levels as indicated by the internal SLO. Additionally, this objective can be monitored appropriately, providing the DevOps team with advance notice of a potential problem prior to violating the slightly lower level specified in the SLA.

Summary

In short, SLAs, SLOs, and SLIs are created in an effort to make system reliability measurable. This serves to instill a sense of accountability for system stability within the development organization, while also making customers feel more comfortable and confident in the product.

While it is true that higher levels of availability and reliability are better than lower levels, development teams should be realistic in setting availability and performance objectives for their system so as to not over promise or deter innovation.