<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=1110556&amp;fmt=gif">
Skip to content
    January 10, 2022

    The Importance of Observability for the SRE

    What Is SRE?

    The term Site Reliability Engineer (SRE) first appeared in Google in the early 2000s. In Google’s 2016 SRE Book, Benjamin Treynor Sloss wrote that, generally speaking, “an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).” This means that the SRE teams at Google decide how a system should run in production as well as how to make it run that way. More specifically, an SRE is responsible for ensuring maximum change velocity to a production system without violating service level objectives (SLOs). To make sure that a service is running properly, it is essential to be able to make empirical decisions based on data – in other words, it depends on observability.

    What Is Observability?

    Observability as about the ability to infer a systems internal state from its external outputs. This means that we try to infer what is going on within our systems or services by measuring what gets produced by the service. For example, a code in the 200s on an HTTP request indicates a successful call. We don’t see what’s happening internally in this case, but rather, we observe the external output, which is the return code.

    There are many other types of output that we can view in order to infer what’s going on inside our services. Our software and hardware provide a plethora of information including metrics and logs, both of which can be monitored and analyzed in order to gain insight into the behavior of services.

    How Is This Helpful?

    Now that we have defined these two concepts, let’s explore how SREs use the information that they’ve gained through observability. There are several different ways in which this data can be useful, but among the most relevant for SREs are:

    • Risk management
    • Monitoring outages
    • Analyzing trends

    Risk Management

    By managing risks, Site Reliability Engineers help businesses in two key areas: the cost of computing resources and opportunity costs. By determining the desired uptime and the acceptable amount of risk, we can minimize the cost of redundant computing resources. In addition, teams can focus on improving applications, paying down tech debt, or reducing operational costs.

    Google suggests managing risks through error budgets, wherein service uptime is measured based on a number of nines. For example, if we set a goal of three nines of availability (or 99%), this will allow for almost nine hours of downtime per year for the service. As long as we’re above that goal, we can take more risk than we normally would in order to improve the service.

    With that in mind, we need to figure out how to calculate availability. One common and effective way to do this is by measuring the percentage of successful requests in our service. So, we could consider responses in the 200s, 300s, or even 400s to be successful, and we could measure those against the 500s. As you may have guessed, we need proper observability (such as tracing and log observability) in order to implement this plan.

    Monitoring Outages

    Often, the first thing that an SRE thinks about is incident management. This is integral to building reliability as well as improving processes and, ultimately, understanding services. It begins with the ability to see what is happening within our services and know when we are having outages. In addition, it enables more effective troubleshooting; for example, when a customer encounters timeouts, we can take a look at the metrics and see that our databases are overwhelmed and not responding.

    Analyzing Trends

    According to the second law of thermodynamics and the concept of entropy, things will go from an ordered state to a more disordered state over time. For example, when you pour yourself a hot cup of coffee, it will begin to cool to room temperature as soon as you pour it. Well, the same thing happens in software. My perfect program, “hello world,” will become more disordered over time as I add functionality. Even if I leave it alone, it will get more disordered because I wrote it in Python 2, but now Python 3 requires the use of parentheses. Or, maybe another guy named Alex came in and changed “hello world” to “hello World” for some reason, and now it’s weird.

    If we apply this idea to production software, we see that we need to know what’s happening right now, but we also need to track trends over time so that we can analyze them and better understand our service. For example, we can see if our database throws more errors around 2PM, or we can see how quickly our user base is increasing, which will help us know when we need to increase its size.

    Summary

    As we discussed above, SREs are the ones who decide how services should run in production. They have many tools and processes at their disposal, but they can’t make good use of any of them without observability. SREs use metrics, logs, and traces to achieve their goals and tasks including risk management, monitoring outages, and analyzing trends. Regardless of their goals, SREs need to thoroughly understand what is going on inside a service to be truly effective as part of their team and their organization. Broadcom offers valuable services such as application performance and network monitoring tools that can help you on your journey.

    Read more about how SREs are using AIOps to gain observability and insights.

    Tag(s): AIOps

    Alex Romine

    Alex Romine uses his diverse background to deliver software successfully. His degree in Molecular and Cellular Biology from the University of Illinois, and his subsequent transition into software have shown him the importance of focusing on the right things. Having gotten a late start in tech, Alex focuses on...

    Other Resources You might be interested In

    icon
    Office Hours August 29, 2025

    Rally Office Hours: August 21, 2025

    See how you can use AI to create a custom HTML widget in Rally, then follow the weekly Q&A session with Rally product experts.

    icon
    Blog August 22, 2025

    Handling Incomplete User Stories at the End of an Iteration

    When a team reaches the end of an iteration, some user stories may not be completed. This post details causes and options for managing these scenarios.

    icon
    Blog August 20, 2025

    What’s Hiding in Your Wiring Closets?

    See why you must move from periodic audits to a state of perpetual awareness. Track every change, validate it against policy, and understand its impact.

    icon
    Blog August 15, 2025

    All Network Monitoring Tools Are Created Equal, Right?

    See how observability platforms provide a unified view across multi-vendor environments and correlate network configuration changes with performance issues.

    icon
    Blog August 15, 2025

    Scale Observability, Streamline Operations with AppNeta Monitoring Policies

    This post reveals how, with AppNeta’s monitoring policies, you can leverage a powerful framework for scalable, flexible, and accurate network observability.

    icon
    Course August 14, 2025

    AppNeta: Current Network Violation Map Dashboard

    Learn how to configure and use the Current Network Violation Map dashboard in AppNeta to identify geographic regions impacted by WAN performance issues.

    icon
    Course August 14, 2025

    AppNeta On-Prem: Minimize Unplanned Downtime

    Learn how to configure the AppNeta On-Prem environment following best practices for high availability and disaster recovery to maintain service continuity and minimize unplanned downtime.

    icon
    Office Hours August 12, 2025

    Rally Office Hours: August 7, 2025

    Get tips on how to use the Capacity Planning feature in Rally, then follow the weekly Q&A session with Rally product experts.

    icon
    Blog August 11, 2025

    dSeries Version 25.0 Boosts Insights, Security, and Operational Efficiency

    Discover how ESP dSeries Workload Automation 25.0 represents a significant leap forward, making workload automation more secure, visible, and efficient.