The term Site Reliability Engineer (SRE) first appeared in Google in the early 2000s. In Google’s 2016 SRE Book, Benjamin Treynor Sloss wrote that, generally speaking, “an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).” This means that the SRE teams at Google decide how a system should run in production as well as how to make it run that way. More specifically, an SRE is responsible for ensuring maximum change velocity to a production system without violating service level objectives (SLOs). To make sure that a service is running properly, it is essential to be able to make empirical decisions based on data – in other words, it depends on observability.
Observability as about the ability to infer a systems internal state from its external outputs. This means that we try to infer what is going on within our systems or services by measuring what gets produced by the service. For example, a code in the 200s on an HTTP request indicates a successful call. We don’t see what’s happening internally in this case, but rather, we observe the external output, which is the return code.
There are many other types of output that we can view in order to infer what’s going on inside our services. Our software and hardware provide a plethora of information including metrics and logs, both of which can be monitored and analyzed in order to gain insight into the behavior of services.
Now that we have defined these two concepts, let’s explore how SREs use the information that they’ve gained through observability. There are several different ways in which this data can be useful, but among the most relevant for SREs are:
By managing risks, Site Reliability Engineers help businesses in two key areas: the cost of computing resources and opportunity costs. By determining the desired uptime and the acceptable amount of risk, we can minimize the cost of redundant computing resources. In addition, teams can focus on improving applications, paying down tech debt, or reducing operational costs.
Google suggests managing risks through error budgets, wherein service uptime is measured based on a number of nines. For example, if we set a goal of three nines of availability (or 99%), this will allow for almost nine hours of downtime per year for the service. As long as we’re above that goal, we can take more risk than we normally would in order to improve the service.
With that in mind, we need to figure out how to calculate availability. One common and effective way to do this is by measuring the percentage of successful requests in our service. So, we could consider responses in the 200s, 300s, or even 400s to be successful, and we could measure those against the 500s. As you may have guessed, we need proper observability (such as tracing and log observability) in order to implement this plan.
Often, the first thing that an SRE thinks about is incident management. This is integral to building reliability as well as improving processes and, ultimately, understanding services. It begins with the ability to see what is happening within our services and know when we are having outages. In addition, it enables more effective troubleshooting; for example, when a customer encounters timeouts, we can take a look at the metrics and see that our databases are overwhelmed and not responding.
According to the second law of thermodynamics and the concept of entropy, things will go from an ordered state to a more disordered state over time. For example, when you pour yourself a hot cup of coffee, it will begin to cool to room temperature as soon as you pour it. Well, the same thing happens in software. My perfect program, “hello world,” will become more disordered over time as I add functionality. Even if I leave it alone, it will get more disordered because I wrote it in Python 2, but now Python 3 requires the use of parentheses. Or, maybe another guy named Alex came in and changed “hello world” to “hello World” for some reason, and now it’s weird.
If we apply this idea to production software, we see that we need to know what’s happening right now, but we also need to track trends over time so that we can analyze them and better understand our service. For example, we can see if our database throws more errors around 2PM, or we can see how quickly our user base is increasing, which will help us know when we need to increase its size.
As we discussed above, SREs are the ones who decide how services should run in production. They have many tools and processes at their disposal, but they can’t make good use of any of them without observability. SREs use metrics, logs, and traces to achieve their goals and tasks including risk management, monitoring outages, and analyzing trends. Regardless of their goals, SREs need to thoroughly understand what is going on inside a service to be truly effective as part of their team and their organization. Broadcom offers valuable services such as application performance and network monitoring tools that can help you on your journey.
Read more about how SREs are using AIOps to gain observability and insights.