<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=1110556&amp;fmt=gif">
Skip to content
    January 10, 2022

    The Importance of Observability for the SRE

    What Is SRE?

    The term Site Reliability Engineer (SRE) first appeared in Google in the early 2000s. In Google’s 2016 SRE Book, Benjamin Treynor Sloss wrote that, generally speaking, “an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).” This means that the SRE teams at Google decide how a system should run in production as well as how to make it run that way. More specifically, an SRE is responsible for ensuring maximum change velocity to a production system without violating service level objectives (SLOs). To make sure that a service is running properly, it is essential to be able to make empirical decisions based on data – in other words, it depends on observability.

    What Is Observability?

    Observability as about the ability to infer a systems internal state from its external outputs. This means that we try to infer what is going on within our systems or services by measuring what gets produced by the service. For example, a code in the 200s on an HTTP request indicates a successful call. We don’t see what’s happening internally in this case, but rather, we observe the external output, which is the return code.

    There are many other types of output that we can view in order to infer what’s going on inside our services. Our software and hardware provide a plethora of information including metrics and logs, both of which can be monitored and analyzed in order to gain insight into the behavior of services.

    How Is This Helpful?

    Now that we have defined these two concepts, let’s explore how SREs use the information that they’ve gained through observability. There are several different ways in which this data can be useful, but among the most relevant for SREs are:

    • Risk management
    • Monitoring outages
    • Analyzing trends

    Risk Management

    By managing risks, Site Reliability Engineers help businesses in two key areas: the cost of computing resources and opportunity costs. By determining the desired uptime and the acceptable amount of risk, we can minimize the cost of redundant computing resources. In addition, teams can focus on improving applications, paying down tech debt, or reducing operational costs.

    Google suggests managing risks through error budgets, wherein service uptime is measured based on a number of nines. For example, if we set a goal of three nines of availability (or 99%), this will allow for almost nine hours of downtime per year for the service. As long as we’re above that goal, we can take more risk than we normally would in order to improve the service.

    With that in mind, we need to figure out how to calculate availability. One common and effective way to do this is by measuring the percentage of successful requests in our service. So, we could consider responses in the 200s, 300s, or even 400s to be successful, and we could measure those against the 500s. As you may have guessed, we need proper observability (such as tracing and log observability) in order to implement this plan.

    Monitoring Outages

    Often, the first thing that an SRE thinks about is incident management. This is integral to building reliability as well as improving processes and, ultimately, understanding services. It begins with the ability to see what is happening within our services and know when we are having outages. In addition, it enables more effective troubleshooting; for example, when a customer encounters timeouts, we can take a look at the metrics and see that our databases are overwhelmed and not responding.

    Analyzing Trends

    According to the second law of thermodynamics and the concept of entropy, things will go from an ordered state to a more disordered state over time. For example, when you pour yourself a hot cup of coffee, it will begin to cool to room temperature as soon as you pour it. Well, the same thing happens in software. My perfect program, “hello world,” will become more disordered over time as I add functionality. Even if I leave it alone, it will get more disordered because I wrote it in Python 2, but now Python 3 requires the use of parentheses. Or, maybe another guy named Alex came in and changed “hello world” to “hello World” for some reason, and now it’s weird.

    If we apply this idea to production software, we see that we need to know what’s happening right now, but we also need to track trends over time so that we can analyze them and better understand our service. For example, we can see if our database throws more errors around 2PM, or we can see how quickly our user base is increasing, which will help us know when we need to increase its size.

    Summary

    As we discussed above, SREs are the ones who decide how services should run in production. They have many tools and processes at their disposal, but they can’t make good use of any of them without observability. SREs use metrics, logs, and traces to achieve their goals and tasks including risk management, monitoring outages, and analyzing trends. Regardless of their goals, SREs need to thoroughly understand what is going on inside a service to be truly effective as part of their team and their organization. Broadcom offers valuable services such as application performance and network monitoring tools that can help you on your journey.

    Read more about how SREs are using AIOps to gain observability and insights.

    Tag(s): AIOps

    Alex Romine

    Alex Romine uses his diverse background to deliver software successfully. His degree in Molecular and Cellular Biology from the University of Illinois, and his subsequent transition into software have shown him the importance of focusing on the right things. Having gotten a late start in tech, Alex focuses on...

    Other posts you might be interested in

    Explore the Catalog
    icon
    Blog January 10, 2025

    When and How to Use Log-Based Metrics in DX Operational Observability

    Read More
    icon
    Blog December 13, 2024

    Full-Stack Observability with OpenTelemetry and DX Operational Observability

    Read More
    icon
    Blog December 6, 2024

    Power Up Your Alarms! Enriched UIM Alarms for Added Intelligence

    Read More
    icon
    Blog November 26, 2024

    Topology: Services for Business Observability

    Read More
    icon
    Blog November 22, 2024

    Regular Expressions That I Use Regularly

    Read More
    icon
    Blog November 22, 2024

    Cloud Application Performance: Common Reasons for Slow-Downs

    Read More
    icon
    Blog November 4, 2024

    Unlocking the Power of UIMAPI: Automating Probe Configuration

    Read More
    icon
    Blog October 4, 2024

    Capturing a Complete Topology for AIOps

    Read More
    icon
    Blog October 4, 2024

    Fantastic Universes and How to Use Them

    Read More