Service level objectives, or SLOs, are a key part of the site reliability engineering toolkit. SLOs provide a framework for defining clear targets around application performance, which ultimately help teams provide a consistent customer experience, balance feature development with platform stability, and improve communication with internal and external users.
Datadog enables all the teams within an organization to track, manage, and monitor their SLOs in one place. You can search, sort, and filter all your SLOs in a comprehensive list view, and easily visualize the status of individual SLOs on your application dashboards. Datadog’s features for tracking and visualizing SLOs make it simple to monitor the real-time status of all your SLOs and communicate that status to your teams, executives, or external customers.
SLOs and SLIs
Best practices around SLOs have been pioneered by Google’s Site Reliability Engineering team—the Google SRE book and a recent webinar that we jointly hosted with Google both provide great introductions to service level objectives and service level indicators (SLIs). In short, SLOs set precise targets for your SLIs, which are the metrics that reflect the health and performance of a service. For instance, if you want to ensure that typical user requests are serviced quickly, you might use your service’s median latency as an SLI. You could then define an SLO such as, “the median latency of all user requests (as computed every minute) will be less than 250 milliseconds 99 percent of the time in any calendar month.”
To accurately track how actual performance compares to the objectives you’ve set, you need a way to not only monitor real-time performance (e.g., computing the median latency every 60 seconds and comparing it against the 250-ms threshold) but also to measure how often that threshold has been breached over longer timespans (to ensure that the 99 percent objective is met for every calendar month). Datadog tracks your SLIs in real time and visualizes their status in relation to your established SLOs, so you can see immediately how actual performance compares to your objectives for a given time period.
Manage all of your SLOs in one place
If your organization is committed to a variety of SLOs across multiple products and teams, visualizing the status of all of your SLOs in one place can help you set priorities and address issues. Datadog’s new Service Level Objectives view allows you to see the status of all of your SLOs, along with the remaining error budget for each. You can then filter the list by facets to see only the SLOs owned by a specific team or scoped to a service, time window, or any tag. Your SLOs based on Datadog monitors automatically inherit the tags associated with those monitors, and you can apply custom tags as well to make it easier to organize your SLOs by team, environment, or any other dimension.
In the SLO list view, you can start tracking a new SLO by clicking the “New SLO” button. SLOs in Datadog can be based either on existing monitors (e.g., a monitor comparing p95 latency against a target threshold) or on real-time status computed from events. Event-based SLOs are useful for monitoring the percentage of events that meet a certain definition, such as the number of non-5xx responses from a pool of backend app servers, divided by the total number of responses.
See your error budget at a glance
By default, all your SLOs in Datadog will generate and display an error budget indicating how much time your SLI can be in the red before it breaches your SLO. This is useful for quickly understanding whether you are on track to meet your targets, and whether your development velocity is appropriate for your stated performance and stability goals. Datadog automatically calculates the error budget based on the SLO and time window you specify. For example, a 98 percent SLO for a seven-day period would give you an error budget of approximately three and a half hours of substandard performance over that period.
Visualize SLO status on your dashboards
To track the status of your SLOs in context with detailed data about the relevant services or infrastructure components, you can add SLO widgets to your Datadog dashboards. You can then share your dashboards internally or externally to communicate the real-time status of your SLOs to anyone who depends on your service.
You can also visualize how often that threshold has been breached, over common SLO baselines such as the previous week, month, year, or the month to date. You can then set conditional formatting rules to, for instance, display the status in green if the threshold has been met 99 percent of the time over the month to date, and change the status to red if the threshold has been met less than 99 percent of the time.
Break down SLO status using tags
SLO widgets allow you to visualize the overall status of your SLOs, but they also show you at a glance how different segments of your infrastructure are contributing to performance. For instance, you can see the status of your SLO for an entire service, and break down the status by customer cohort or data center to easily isolate localized issues. In the example above, we’re monitoring the user-facing performance of a web application, broken down by the
partition tag, so we can quickly zero in on any issues that arise.
Display and share service status
Datadog makes it simple to monitor and manage your SLOs in the same place that you already monitor your applications, infrastructure, user experience, and more. Perhaps just as importantly, Datadog enables you to provide transparency to any stakeholders or users who depend on those SLOs being met. If you aren’t yet using Datadog to monitor the health and performance of your services, you can sign up for a free trial account here.