Track the status of your SLOs with the new monitor uptime widget
The beta program for the monitor uptime widget is currently closed to new requests, but you can sign up for the waitlist here.
Service level objectives are an important tool for maintaining application performance, ensuring a consistent customer experience, and setting expectations about service performance for both internal and external users. We are very pleased to announce the availability of a new monitor uptime widget that makes it simple to monitor the status of your SLOs and communicate that status to your teams, executives, or external customers.
SLOs and SLIs
Best practices around SLOs have been pioneered by Google’s Site Reliability Engineering team—the Google SRE book and this talk from this year’s Dash conference both provide excellent introductions to service level objectives and service level indicators (SLIs). In short, SLOs set precise targets for your SLIs, which are the metrics that reflect the health and performance of a service. For instance, if you want to ensure that typical user requests are serviced quickly, you might use your service’s median latency as an SLI. You could then define an SLO such as, “the median latency of all user requests (as computed every minute) will be less than 250 milliseconds 99 percent of the time in any calendar month.”
To accurately track how actual performance compares to the objectives you’ve set, you need a way to not only monitor real-time performance (e.g., computing the median latency every 60 seconds and comparing it against the 250-ms threshold) but also to measure how often that threshold has been breached over longer timespans (to ensure that the 99 percent objective is met for every calendar month).
Visualize SLO status on your dashboards
The new monitor uptime widget enables you to visualize SLOs on your Datadog dashboards, which you can share internally or externally to communicate the real-time status of your SLOs to anyone who depends on your service. Building on Datadog’s sophisticated alerting engine, you can create a Datadog monitor for any service level indicator, for example, ensuring that median latency remains below 250 ms in the example above.
The new monitor uptime widget allows you to visualize how often that threshold has been breached, over common SLO baselines such as the previous week, month, year, or the month to date. You can then set conditional formatting rules to, for instance, display the status in green if the threshold has been met 99 percent of the time over the month to date, and change the status to red if the threshold has been met less than 99 percent of the time.
Break down SLO status using tags
The monitor uptime widget allows you to visualize the overall status of your SLOs, but they also show you at a glance how different segments of your infrastructure are contributing to performance. For instance, you can see the status of your uptime SLO for a service, and break down the uptime by host or data center to easily isolate localized issues. In the example above, we’re monitoring the availability of our Consul cluster, along with the availability of the individual nodes in the cluster, so we can quickly zero in on any issues that arise.
Display and share service status
The monitor uptime widget provides a new level of functionality for monitoring and enforcing your SLOs, as well as providing transparency to any stakeholders or users who depend on those SLOs being met. If you’re already a Datadog customer, you can sign up here to join the waitlist for beta access. And if you aren’t yet using Datadog to monitor the health and performance of your services, you can sign up for a free trial account here.