Track the status of your SLOs with the new monitor uptime and SLO widget
Service level objectives are an important tool for maintaining application performance, ensuring a consistent customer experience, and setting expectations about service performance for both internal and external users. We are very pleased to announce the availability of a new monitor uptime and SLO widget that makes it simple to monitor the status of your SLOs and communicate that status to your teams, executives, or external customers.
SLOs and SLIs
Best practices around SLOs have been pioneered by Google’s Site Reliability Engineering team—the Google SRE book and this talk from last year’s Dash conference both provide excellent introductions to service level objectives and service level indicators (SLIs). In short, SLOs set precise targets for your SLIs, which are the metrics that reflect the health and performance of a service. For instance, if you want to ensure that typical user requests are serviced quickly, you might use your service’s median latency as an SLI. You could then define an SLO such as, “the median latency of all user requests (as computed every minute) will be less than 250 ms 99 percent of the time in any calendar month.”
To accurately track how actual performance compares to the objectives you’ve set, you need a way to not only monitor real-time performance (e.g., computing the median latency every 60 seconds and comparing it against the 250-ms threshold) but also to measure how often that threshold has been breached over longer timespans (to ensure that the 99 percent objective is met for every calendar month).
Visualize SLO status on your dashboards
The new monitor uptime and SLO widget enables you to visualize SLOs on your Datadog dashboards, which you can share internally or externally to communicate the real-time status of your SLOs to anyone who depends on your service. Building on Datadog’s sophisticated alerting engine, you can create a Datadog monitor for any service level indicator, for example, ensuring that median latency remains below 250 ms (as shown above).
You can also set targets for success rates of event-based SLIs. For example, you can track the percentage of requests to your application that result in 2xx responses. Use the dropdown menu to select Event based metrics and then select the metrics you want to use to calculate your success-rate SLO.
You can also visualize how often that threshold has been breached, over common SLO baselines such as the previous week, month, year, or the month to date. You can then set conditional formatting rules to, for instance, display the status in green if the threshold has been met 99 percent of the time over the month to date, and change the status to red if the threshold has been met less than 99 percent of the time.
See your error budget at a glance
By default, your monitor uptime and SLO widget will generate and display that monitor’s error budget. An error budget indicates how much time your SLI can be in the red before it breaches your SLO. This is useful for quickly understanding whether you are on track to meet your targets. The monitor uptime and SLO widget will automatically calculate the monitor’s error budget based on the SLO and time window you specify. For example, a 98 percent SLO for a seven-day period would give you an error budget of approximately three and a half hours of substandard performance over that period.
Break down SLO status using tags
Monitor uptime and SLO widgets allow you to visualize the overall status of your SLOs, but they also show you at a glance how different segments of your infrastructure are contributing to performance. For instance, you can see the status of your uptime SLO for a service, and break down the uptime by host or data center to easily isolate localized issues. In the example above, we’re monitoring the availability of our Consul cluster, along with the availability of the individual nodes in the cluster, so we can quickly zero in on any issues that arise.
Display and share service status
The monitor uptime and SLO widget provides a new level of functionality for monitoring and enforcing your SLOs, as well as providing transparency to any stakeholders or users who depend on those SLOs being met. It is now in public beta, so if you’re a Datadog customer, you can immediately start visualizing your SLOs for metric monitors and synthetics. And if you aren’t yet using Datadog to monitor the health and performance of your services, you can sign up for a free trial account here.