Track SLIs and SLOs

Published: April 16, 2019

00:00:00

All right, so hi, everyone. I’m Meghan, and I’m here to talk about SLIs and SLOs.

I’m a product manager at Datadog, focused on our alerting and reporting platform.

Some background on SLOs

So why SLOs?

Service Level Objectives are key components for ensuring customer happiness.

I think that we can agree how important that is, as a lot of you are on call today.

So the SRE and CRE folks at Google had been paving the way for this as well, if you haven’t had a chance to check out their book, I definitely recommend it.

There are a few things to consider when developing your SLOs.

The first is what should you measure to begin with?

What SLIs are you tracking?

The second, what is your target objective and is it even achievable?

And third, how are you sharing, communicating, and making decisions on your SLOs once you actually have them?

How are you making decisions with your stakeholders?

And this is what I’m here to talk about today.

What’s the difference between SLOs, SLAs, and SLIs?

So I just wanna take a step back for a minute just to define a few things.

Service Level Indicators are the measurements, the metrics and the thresholds they need to meet.

Service Level Objectives are the internal target values, for example, the amount of time that the SLI must be met, and by meeting that objective, you’re meeting your customers’ expectations.

Service Level Agreements are your commitments with users that when breached, will probably result in some legal or financial obligation.

A look at SLOs, SLAs, and SLIs in action

And to put these into perspective, let’s take a look at ACME Corp.

So ACME corp’s customers rely on their API’s to run their own businesses, so ACME needs to maintain a monthly SLA of 99.5%.

If they breach that SLA, they’re on the hook financially.

So they’ll probably owe some percentage of service credit back to those customers, and this is something they definitely want to avoid.

So, in order to do that, they defined a few core principles internally, and the first thing is to define an SLI.

And in this case, it’s latency but keep in mind that this could be a metric related to durability, freshness, correctness.

The important thing is that this SLI is the best indicator for the agreement that they have with their customers.

And on top of that, they have their SLO that they’ve defined as 99.8%, and you’ll see that it’s stricter than the agreement and this is because this is the the sweet spot of customer success.

And I mentioned before that ACME’s customers are relying on them to run their businesses, well those customers also have customers relying on them.

So there’s more than just finances on the line.

What does this mean for Datadog customers?

Let’s talk about this from the perspective of Datadog. So if you’re using Datadog, you probably already have a variety of dashboards with plenty of data filled with custom metrics, metrics from integrations. If you’re using APM, we’re already surfacing a lot of key indicators for each of your services.

Additionally, you probably have defined monitors to track user experience and to notify you when those metrics have fallen below critical thresholds.

And these alerts are the right signals of poor performance and are informing you and your teams for where to invest time and energy for making product improvements especially if those monitors are configured for your SLIs, as they’re important for achieving your objectives.

And this is actually already available today in the closed beta of the monitor uptime widget. How many of you have been a part of that beta or at least heard of the monitor uptime widget?

Okay, so it’s time to get everybody on board.

And before we get there, how are you sharing and communicating these objectives once you have them?

So if you are using the widget, you’ve added it to a dashboard, and this is one way to do it.

But you might also be running your own calculations, creating your own reports, and a lot of this might live in spreadsheets.

But this is time-consuming if you’re still identifying the right objectives, defining, identifying the right indicators and the objectives and iterating.

Okay.

And I’m very happy to announce today that the monitor uptime widget is now open for everybody in public beta.

So this means you can visualize and share your SLOs across teams based on monitor up time.

To make this more actionable, we’ve added an error budget visualization that tells you how much time you can afford to be in the red until your SLO is breached.

So this might not look that scary, but if we take a look at this, this should really tell you something.

So you wanna take a step back, regroup with your team, and say, “Okay we need to re-prioritize, maybe we need to focus more on reliability until we get this back up.”

And this is what’s available for everybody today that you can add on to your dashboards.

So I’ve been talking a lot about monitor uptime, and you may have also read our blog posts talking about SLIs that are specifically for monitor uptime or based off of monitors.

But, these are specifically time-based where you’d say 99% of the time there are no errors.

But you probably have SLIs that are configured for success rates, where then you’d say 99% of my requests are successful.

And these are SLIs that you don’t necessarily want to define within a monitor because they’re more about the ratio than the evaluated time.

So [it’s] also exciting to announce today that you can also add success rate SLIs or what we’re calling event based into the same widget so we have an SLO widget available for everybody in public beta.

How to get granular insights into SLOs and SLIs with monitor uptime

So to put this into context, here’s the SLO widget on a dashboard, so this can be shared with whoever needs to see it, it can be sliced across service, across team, across user journey if you’re doing that.

And what this provides is an overview of my top objectives alongside the indicators, this is a combination of my monitor uptime, and my event based SLIs all together in one place.

And you can see on the right, the right corner of the dashboard is that that’s one objective over three different time windows, so we have over seven days, over thirty days, and over the previous month.

So what this allows me to do is compare over time, how we’ve been doing, are we improving?

Are we getting worse?

Where do we need to allocate time?

And then, a lot of the emphasis has been on SLOs, but you can also just do simple monitor uptime and that’s what the widget in the left-hand corner shows. So there’s a lot of flexibility and a lot that you can do for your reporting.

Conclusion and summary

So what does the future look like for SLOs at Datadog?

Well, I’m very excited to show you all today a little sneak preview into what we’re thinking that might look like.

So we want to enable you to centrally create, manage, and share SLIs, really be able to dig in, and this is just a mockup but I hope that it makes you all excited because we’re exploring different versions of what might be an SLI summary page.

So we’re surfacing the error budget at the top or surfacing a burn down at the top where you can really dig in and get a lot of information about that SLI, and how that SLO is performing over time, and you can create comprehensive reports with your teams.

So what does that mean for you?

So I’ll be demoing, I’ll be demoing our current product offering, and we can talk about the future of SLOs in Datadog as well.

I’ll be available in Open Spaces, so come and find me and let’s chat.