Best Practices for Managing Your SLOs With Datadog | Datadog

Best practices for managing your SLOs with Datadog

Author Mark Azer
Author Kai Xin Tai

Published: June 22, 2020

Collaboration and communication are critical to the successful implementation of service level objectives. Development and operational teams need to evaluate the impact of their work against established service reliability targets in order to improve their end user experience. Datadog simplifies cross-team collaboration by enabling everyone in your organization to track, manage, and monitor the status of all of their SLOs and error budgets in one place. Teams can visualize their SLOs alongside relevant services and infrastructure components on dashboards—and share the real-time status of those SLOs with any stakeholders that depend on them.

In this post, we will discuss some best practices for managing your SLOs in Datadog, and show you how to:

Choosing the best SLO for each use case

In Datadog, you can create two types of SLOs:

  • A monitor-based SLO, which uses one or more monitors in Datadog to calculate its SLI. The SLI is defined as the proportion of time your service exhibits good behavior (as tracked by the underlying monitor(s) being in non-alerting state).
  • A metric-based SLO, which uses your metrics in Datadog to calculate its SLI. The SLI is defined as the number of good requests over the total number of valid requests.

Monitor-based SLO

The best SLO type for your specific use case depends on whether you’re using time-based or count-based data to calculate the SLI. If you’re looking to track the latency of requests to your payments endpoint, it might be more appropriate to create a monitor-based SLO that tracks time-based data: the percentage of time the endpoint exhibits good behavior (i.e., responds quickly enough to meet your SLO target). To create this SLO, you could select a Datadog monitor that triggers when the latency of requests to a payments endpoint exceeds a certain threshold.

When you define a monitor-based SLO, you need to select monitor(s) to use. This monitor tracks the request latency to the payments endpoint.
Setting up a monitor that triggers when the latency of requests to the payments endpoint exceeds 0.5 seconds

This SLO can be stated verbally as “For 99 percent of the time, requests should be processed faster than 0.5 seconds over a 30-day time window.” Datadog visualizes the historical and current state of your monitor-based SLOs with a status bar, allowing you to easily see how often your SLO has been breached. And the error budget next to this status bar tells you exactly how much time your monitors can spend in alert state before your SLO turns red.

Metric-based SLO

On the other hand, if you’re looking to track that your payments endpoint is successfully processing requests, you could define a metric-based SLO that uses count-based data (i.e., the number of good events compared to the total number of valid events) for its SLI. One way you can approach this is by dividing the number of HTTP responses with 2xx status codes (which we’ll consider to be the number of good events) by the total number of HTTP responses with 2xx and 5xx status codes (the total number of valid events).

Another way is to use your trace metrics from APM to track how often a request hits the endpoint—and when they’re successful. But say that in this case, we don’t have a metric that directly corresponds to good events. You can use the Advanced option in the metric query editor to build queries based on the metrics you already have. As shown in the example below, if you only have bad events (trace.rack.request.errors) and total events (trace.rack.request.hits), you can define good events as (total events - bad events).

Calculating the number of good events based on the number of bad events and total events
Calculating the number of good events based on the number of bad events and total events

The resulting availability SLO can be written as “99 percent of all requests to the payments endpoint should be processed successfully over a 30-day time window.” In Datadog, you can visualize the status, good request count, and error budget of each metric-based SLO with a bar graph and table.

Use status corrections to exclude data from SLO calculations

Your SLOs track the performance of your services, and any disruption of a service’s availability or performance can lead to an SLO breach. But you don’t want planned operations, such as deployments and scheduled maintenance windows, to affect your SLO status.

To ensure that your SLO status information is accurate, you can use SLO status corrections to define blocks of time that should not be included in the SLO’s calculation. The screenshot below shows one status correction for a scheduled maintenance and identifies two correction periods outside business hours.

A screenshot shows a table of status corrections for the Web ELB availability SLO. The corrections listed include one scheduled maintenance, one that repeats daily, and one that repeats weekly.

You can use SLO status corrections with both monitor-based SLOs and metric-based SLOs. If you create a correction for a monitor-based SLO, the status of the monitor is ignored during the correction window. For a metric-based SLO, all events that occur during a correction window are excluded from the calculation of the SLO’s status.

Names, descriptions, tags, oh my!

SLOs are used by multiple teams across an organization, which means that developing an effective naming and tagging strategy is crucial for streamlining communication and keeping your SLOs organized. First, each SLO should have a short but meaningful name that lets anyone understand what it is measuring at a glance. As you create more SLOs, establishing a clear and consistent naming convention also makes it easier to navigate the SLO list view and pick out relevant SLOs.

In addition, we highly recommend adding a description that explains what the SLO measures, why it’s important, and how it relates to a critical aspect of the end user journey. SLO descriptions in Datadog include support for Markdown, so you can easily link to resources that are relevant to the SLO (e.g., related dashboards, workflow tools, and documentation).

During SLO definition, you should add an appropriate name and description as well as any tags to allow anyone across your organization to easily understand what it is tracking.

Besides names and descriptions, tags help you effectively organize and manage your SLOs. With tags, you can easily pivot from a breached SLO to the metrics, logs, and traces of the relevant services to investigate the root cause of the issue. At a minimum, we recommend that you tag your SLOs with:

  • journey:<JOURNEY_NAME> to state the critical user journey that the SLO is related to
  • team:<TEAM_NAME> or owner:<PERSON_NAME> to indicate the team or individual responsible for the SLO
  • service:<SERVICE_NAME>, env:<ENVIRONMENT_NAME>, or any other system-related tags that indicate the system components the SLO is tracking
  • sli:<SLI_TYPE> to indicate the type of SLI the SLO is based on (e.g., latency, availability)

When you create a monitor-based SLO, that SLO will automatically inherit the tags of its underlying monitors, so it’s a good idea to tag your monitors appropriately as well. You can head over to our dedicated post for best practices around tagging your monitors. Tagging your SLOs allows you to take advantage of Saved Views, which help you easily find your most frequently used SLOs. Simply use tags to slice and dice your SLOs and save that query as a view that you can access from the sidebar with just a single click.

Saved Views lets you easily access your most frequently used SLOs with just a click.

Group your SLOs with tags

Grouping your SLOs with tags enables you to track the status of each SLO across individual clusters, availability zones, or data centers in context with the overall status. This lets you quickly zero in on problematic segments of your infrastructure, so you can investigate and resolve the underlying issue before you fall out of compliance with your SLO.

To group your metric-based SLOs, simply add one or more tags to the sum by aggregator in the metric query editor.

You can group your metric-based SLOs by adding one or more tags to the sum by aggregator.

For monitor-based SLOs, you will need to first ensure that the monitor you want to use is grouped by one or more tags. Then, when you’re creating an SLO, enable Calculate on selected groups and select up to 20 groups. In the example below, we have broken down the monitor by availability zone, and selected six different groups (e.g., availability-zone:us-east-1a, availability-zone:us-east-1b, availability-zone:us-west-1b) to visualize in the SLO.

You can group your monitor-based SLOs by enabling the 'Calculate on selected groups' option and selecting one or more groups.

Enhancing your dashboards with SLOs

You’re likely already using dashboards to visualize key performance metrics from your infrastructure and applications. You can enhance these dashboards by adding the SLO summary widget to track the status of your SLOs over time. And to get more context around the status of your SLOs, we recommend also adding graphs of the SLIs that correspond to a metric-based SLO and displaying the status of the monitors that make up your monitor-based SLOs.

Enhance your dashboard by adding SLO widgets

In the example above, visualizing the timeseries graph for checkout errors side by side with the checkout request success SLO helps us easily identify if a spike in errors in a certain availability zone is causing a dip in SLO status. Additionally, in the bottom row, we added the monitor summary widget—which displays the status of all the monitors tracking checkout request latency across availability zones—next to our monitor-based latency SLO. You can then share these dashboards with any internal or external stakeholders that depend on these SLOs.

Learn how The Telegraph prioritized customer experience using SLIs & SLOs

Ensure service reliability with Datadog SLOs

In this post, we’ve looked at some useful tips that will help you get the most value from your service level objectives in Datadog. Together with your infrastructure metrics, distributed traces, logs, synthetic tests, and network data, SLOs help you ensure that you’re delivering the best possible end user experience.

Datadog’s built-in collaboration features make it easy not only to define SLOs, but also to share insights with stakeholders within and outside the organization. And you can proactively monitor the status of your SLOs by creating SLO alerts that automatically notify you if your service’s performance might result in an SLO breach. Check out our documentation to get started with defining and managing your SLOs. If you aren’t yet using Datadog, you can start with a 14-day today.