Many organizations rely on service level objectives (SLOs) to help them gauge the reliability of their products. By setting SLOs that define clear and measurable reliability targets, businesses can ensure they are delivering positive end-user experiences to their customers. Clearly defined SLOs also make it much easier for businesses to understand what tradeoffs they may have to make in order to deliver those specific experiences. For instance, meeting certain SLOs might require more resources, which could drive up costs or add to the overall complexity of your service. Additionally, SLOs can help your business set a “separation of concerns,” enabling you to create clear boundaries that define what reliability expectations teams can have on one another.
You know SLOs are important, but it can be difficult to know where to get started or how to put them to full use. Similarly, you might already be using SLOs but are unsure whether you are getting the most out of them.
In this post, we’ll walk through key questions to guide you in crafting SLOs that help set useful reliability standards for your organization.
Before you begin crafting SLOs, you first need to consider what they can and cannot do for you. Though SLOs are an essential tool for ensuring reliability, SLOs alone cannot automatically make your system more reliable. Setting SLOs that are beyond your capabilities and expecting engineering teams to adhere to those standards will backfire in the long run. For example, engineers may burn themselves out by attempting to meet unrealistic standards or end up ignoring SLOs altogether because they are impossible to meet. Instead, create realistic SLOs based on your current capabilities, and then focus on work that will help make stricter SLOs achievable.
SLOs can help you determine the severity of an issue. If, for instance, availability drops below your set SLO (e.g., 99.99 percent for an extended period of time), then an on-call engineer should be alerted so that they can remediate the issue. Additionally, SLOs can help you delineate clear boundaries between each component that makes up your service. This lets teams figure out which parts of your infrastructure they need to maintain themselves and which are under the purview of another team—if an issue is severe, you immediately know who should be notified to resolve it. Finally, SLOs can guide you as you set business priorities. Based on how well you are meeting SLOs, you can assess whether your time would be better spent on building new features or on improving the reliability of your service.
As you put together SLOs, it’s important to keep in mind that different stakeholders in your organization will have different priorities and perspectives—and, therefore, different goals. If you are a Customer Success Manager (CSM), for example, you want to ensure that any SLOs set will represent a standard level of service that will keep customers satisfied. By staying focused on the business impact of SLOs, CSMs can help keep teams informed about which user journeys are most critical to customers and prioritize the SLOs that matter most.
From a more technical perspective, engineers will help ensure that SLOs are measurable and realistic. Engineers will also keep you informed if any SLOs conflict with one another and if meeting a proposed SLO would introduce unacceptably high costs. Additionally, collaborating with engineers can help you better understand what tradeoffs you may be making to meet proposed SLOs and determine how those tradeoffs can benefit your organization. For example, after meeting with engineers you may decide to set lower SLOs (i.e., allow for higher error rates up to a certain point) because doing so would allow those engineers to develop and release features at a faster clip.
It’s important to note that different stakeholder perspectives are not mutually exclusive, and that there’s often overlap between them. Effective collaboration and alignment between business and technical teams is critical to using the full potential of SLOs. For instance, a CSM can talk with customers to establish their expectations and determine what they care about most, while engineers can help strategize the most realistic path toward meeting those expectations.
Reliability can be boiled down to one fundamental question: is your service working? Whether you are building a web application, an ecommerce site, an ad server, or some other tool, if you can answer this question, then you can determine whether your system is reliable or is in need of troubleshooting.
However, “Is it working?” is a surprisingly complicated question to answer. For starters, when SREs ask whether a system is working, they are really asking whether a system is working well enough—because achieving 100 percent reliability is never going to be a realistic goal at scale. They may work with a CSM to identify which features or components of a service are most critical to your customers and therefore important for classifying your service as working.
Secondly, your definition of “working” will change depending on your context and use case. Beyond whether your service is up or down, you might also be asking questions such as whether a database is no longer keeping ACID guarantees, if latency has climbed to an unacceptable level, or if throughput is unexpectedly low.
SLOs are useful for helping simplify the “Is it working?” question by breaking it down into a collection of measurable signifiers. When creating SLOs, it is critical that you not just ask whether your service is working, but that you clearly define what “working” means in your given context.
To further determine whether your service is “working”, ask yourself who your users are and what level of performance they’ll accept from your service before turning to a competitor. The answer to those questions will inform what “working” means to your team and help you effectively set SLOs. For example, if you are hosting a video streaming service, your end users are viewers who expect to be able to watch videos at a consistently high quality with minimum delays. If your streaming service doesn’t meet those expectations, then customers are more likely to turn to alternate services. To prevent that scenario, you should set SLOs around metrics such as buffer times and bitrate drops.
Let’s say you’re hosting an ecommerce site. In this case, your users are shoppers who expect to be able to log into their accounts, view item descriptions, and check out items securely. To meet these expectations, you should set SLOs around page load and payment processing times, as well as availability. Alternatively, if you’re a team that manages a database, you should set SLOs that reflect low latencies, error rates, and connection delays for your developer-oriented users. Setting these SLOs as a database team is also a good opportunity to establish clear expectations around who should be responsible for what aspects of database usage. For example, database administrators may focus more on access control and overall maintenance, while database developers may optimize queries to boost performance.
In this post, we looked at a few key questions you should ask to set effective service level objectives and bolster the reliability of your service. We covered what you should and shouldn’t expect to accomplish with SLOs, and we emphasized that SLOs are not a magic wand but are instead a tool you can use to create realistic reliability standards. To learn even more about how you can get the most value from your SLOs, check out our best practices series on the subject.