<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=1110556&amp;fmt=gif">
Skip to content
    April 29, 2021

    Best Practices for Defining SLAs

    While it would be ideal for a development team to be able to guarantee availability and reliability of their application 100% of the time, it is generally not a feasible objective. Instead, organizations commit to a slightly lower level of service availability and reliability through the creation of service-level agreements (SLAs). These SLAs are meant to set appropriate customer expectations for a service, while detailing what the repercussions will be if the organization fails to live up to these expectations.

    There are best practices for developing SLAs that are both feasible and indicative of a high level of reliability. But before we can define an SLA, we need to understand how it will be measured. Read on for an overview of system reliability, SLA best practices, and the process for measuring the success of these agreements.

    Performance, Availability, and SLAs

    Most SLAs between service providers and customers revolve around system performance and availability. They are put in place to assure the customer that they can rely on the product and to instill a sense of accountability within the development organization by providing consequences that are invoked when the service provider fails to live up to the terms of the agreement.

    If SLAs aren’t measurable, then it’s impossible to evaluate them appropriately. Defining realistic service availability and performance expectations in a clear and measurable manner often begins with the development of service-level objectives (SLOs) that are measurable through the use of service-level indicators (SLIs). Let’s take a look at how this works.

    Defining SLOs and SLIs

    An SLO is an objective that is meant to serve as a benchmark performance goal for the development team. By monitoring the status of an SLO, development teams can ensure that their system is operating at a level of reliability that’s acceptable to the business and its customers. SLOs sound very similar to SLAs, except they don’t have any repercussions associated with them; they do not represent an agreement with a paying customer. The objectives are measured through the use of SLIs, and these SLIs provide the basis for indicating whether or not the SLO is being met.

    Let’s imagine that an organization sets performance SLOs that read as follows:

    99.9% of properly formatted requests must complete successfully over the course of a year.

    99.9% of requests must complete in under 300 milliseconds over the course of a year.

    So how can these be monitored? Well, with the proper data, a DevOps team can measure the success of their objectives. These data points are the SLIs. In the case of the SLOs displayed above, SLIs would likely include the error rate for requests being made to the service and the length of time each request takes to complete. (Large sets of request times can then be used to evaluate the percentage of requests that take longer than 300 milliseconds to complete.)

    These indicators simplify the process for measuring and monitoring the success of the goals set by the SLOs. They help DevOps teams ensure a high level of system reliability for their end users.

    SLOs will evolve over time as a system matures. From these refined SLOs, SLAs can be created with promises made to the customer in the event they are not met. For example, an SLA built off of one of the SLOs above may read as follows:

    If we don’t meet our promise of successfully completing 99.9% of properly formatted requests, we commit to providing two additional months of service free of charge and promise to address the failure to meet our goal to the best of our ability.

    Best Practices for Defining SLAs

    Now that we understand the basics behind the creation of SLAs, let’s get into the best practices for developing these agreements.

    Don’t Over Promise

    In software development, it’s natural to over promise. This is often due to the complex nature of the system being developed—it’s hard to account for every challenge that may (or, more likely, will) arise. The same is true when setting standards for SLAs. You may think to yourself that 99.99% availability over the course of a year is achievable, and it may be. But you’ll likely be better off building in a buffer for the unexpected, and dropping the expectation to 99.95%. The customer will likely be happy with that number and you will have built in a little leeway for uncertainty in the process.

    Ensure Your Agreements Allow for Innovation

    It is also important to ensure that your SLAs allow for innovation. Application changes inherently bring risk along with them. But the solution is not to build a highly reliable service and then never touch it again, just because you are currently meeting your performance and availability goals.

    Change and experimentation within your service is necessary to increase value to the end user and to keep a product viable in its marketplace. Instead of avoiding change, the answer is to keep the risks associated with an evolving code base in mind, and to estimate the impact it will have so that it can be accounted for when setting SLAs.

    Set Internal SLOs that Outpace Your SLAs

    Another good practice to implement when defining SLAs is to set internal SLOs that hold service availability and reliability to a higher standard than your customer-facing SLAs. In doing so, the development team will be driven by a sense of responsibility to maintain system availability or reliability levels as indicated by the internal SLO. Additionally, this objective can be monitored appropriately, providing the DevOps team with advance notice of a potential problem prior to violating the slightly lower level specified in the SLA.

    Summary

    In short, SLAs, SLOs, and SLIs are created in an effort to make system reliability measurable. This serves to instill a sense of accountability for system stability within the development organization, while also making customers feel more comfortable and confident in the product.

    While it is true that higher levels of availability and reliability are better than lower levels, development teams should be realistic in setting availability and performance objectives for their system so as to not over promise or deter innovation.

    Tag(s): AIOps , DX APM

    Scott Fitzpatrick

    Scott Fitzpatrick is a Fixate IO Contributor and has 8 years of experience in software development.

    Other posts you might be interested in

    Explore the Catalog
    January 11, 2024

    Upgrade to DX UIM 23.4 During Broadcom Support’s Designated Weekend Upgrade Program

    Read More
    January 9, 2024

    DX UIM 23.4 Sets a New Standard for Infrastructure Observability

    Read More
    December 29, 2023

    Leverage Discovery Server for DX UIM to Optimize Infrastructure Observability

    Read More
    December 29, 2023

    Installation and Upgrade Enhancements Delivered in DX Platform 23.3

    Read More
    December 20, 2023

    Broadcom Software Academy Wins Silver in Brandon Hall Group’s Excellence in Technology Awards

    Read More
    November 4, 2023

    Kubernetes Primer: Implementation and Administration of DX APM

    Read More
    October 5, 2023

    Upgrade to DX UIM 20.4 CU9 to Leverage New Features and Security Updates

    Read More
    October 2, 2023

    Triangulate: Add Logs to Your Monitoring Mix

    Read More
    September 25, 2023

    New DX UIM Release: Start Monitoring New Linux Distributions on Day 1

    Read More