Achieving Huge Performance Wins With Datadog

Published: April 16, 2019

00:00:00

Introduction

Thanks, guys.

My name is Alex Landau.

Thank you all for coming.

This is a talk about how we used a unique set of metrics and Datadog to achieve some really big performance and scalability wins with a pretty small site reliability engineering team.

So, as I mentioned, my name is Alex.

I’ve been at Rover for about two years on our SRE team.

Before that, I was at Amazon for a couple of years, Microsoft for about a year before that.

Some background on Rover

If you’re unfamiliar with Rover, you can think of Rover as like Airbnb for dog sitting or Uber for dog walking.

We actually connect pet service providers with owners.

So, it’s not just dogs, but it’s mainly dogs.

I actually don’t have a dog, I only have a cat.

I haven’t told Rover that, so this is a public announcement to that.

So, we actually just launched in Europe last year.

We’re expanding quickly and I think we have a really strong collaborative engineering culture that encourages experimentation and creative problem-solving and I think that that led to some of the things that we’re gonna talk about today.

So, the core problem that we’re actually trying to solve here is we have a pretty complex web app (we’ll define complexity in a minute) and a pretty small SRE team.

So, how do we focus on the SRE goals of reliability, performance, scalability, availability, all these sorts of things with a pretty limited set of resources?

And the way that we solved it is focusing on a really specific set of metrics, thinking carefully about how to visualize them and applying some creative problem-solving.

And we’re gonna actually share examples from our real web app to illustrate what we’ve done with the metrics.

The nature of the Rover app

So, when I mentioned complexity, I wanted to just frame what we mean by complex.

These numbers are just meant to illustrate the problem.

Our SRE team is about two to three people depending on what people are working on or what the priorities are, and we have a single monolithic Django web app with about 600,000 lines of Python code, which isn’t crazy, but it’s definitely up there for a single web app.

It’s backed by a single MySQL database.

Well, a single master in MySQL database.

We have about 100 developers working on it on any given day.

They’re deploying 15 to 30 changes a day, give or take, depending on again, the day.

And we have thousands of individual endpoints, serving web requests, asynchronous tasks, we use Celery, but it’s basically a task that we’re executing outside the request response web cycle, cron jobs, so regular jobs running and then one off commands.

Potential business problems

So, we have a lot of places where we’re executing code and a lot of opportunities for problems to happen.

In particular, the problems that we care about are the ones that impact the business and impact our customers.

Our database, {a} MySQL database, is like a shared resource, so when the database is experiencing things like high CPU utilization or a high IOPS (input-output operations per second), our customers are gonna feel the pain.

We’re gonna have web requests and start to slow down, people are unable to check out, they’re not able to contact their sitter or their owner.

So, it’s really bad for customers and it’s bad for the business.

Even more perniciously, if we have a trend over time that’s particularly bad, that can be especially bad for the future of the business.

For example, when I joined Rover two years ago, our master database CPU utilization was peaking at 60% during the day and we had vertically scaled it a number of times and we only really had one or two vertical scales left.

So, something like that is an existential threat to the business, right?

It has to be addressed.

And the core fundamental thing that underlies all these problems is that we have a complex web app that with code executing everywhere and that code could be doing anything.

It’s interacting with a database but it could be interacting in a really suboptimal way.

And it’s hard to know that op priority, especially in frameworks like Django, it’s easy to introduce performance problems by mistake.

The Rover team’s approach

So, how do we solve this problem of having a complex web app and a limited engineering team to solve it?

Well, the first thing we did is we thought really carefully and identified a core set of granular performance metrics that we would collect in production to tell us how our app was behaving at a pretty fine granularity.

And we would really focus on collecting these metrics in one place in our web app so that we wouldn’t have developers have to go manually instrument the code.

And we’ll talk about why that’s important in just a little bit.

The next thing we did when we had these metrics was think really carefully about how to visualize them and create carefully crafted dashboards in order to identify the trends in our web app that would make us worried as SREs.

And we did this without relying on any tracing tooling or EPM.

And I’m not gonna get into that too much in this presentation, but I’d love to talk to people about that afterwards because probably a lot of the work that we’re gonna cover here would be wrapped up into an APM solution.

But I think it’s still a good illustrative story to understand why APM solutions are really, really useful.

The last thing that we did is we built some developer tools and documented them and evangelized them in order to make our SRE platform more of like a developer platform rather than a reactive, “we’re gonna chase down every performance issue.”

And this was really important as a small SRE team because we don’t have the time to fix all the performance issues ourselves, but if we empower our developers to do that, like the feature developers, then we can really outsource the work and scale the team with the organization.

We’re not gonna talk a lot about the specific tools we built, but we will revisit the philosophy at the end of the presentation.

How queries affected the Rover database

So, the biggest single contributing factor to the performance of our web app: queries.

Queries impact the shared resource and, in particular, if we have queries that are slow, like multiple second queries or if we issue more queries than we actually need, like lots and lots of additional queries, those are the things that cause this sort of cascading impact that we might see if the database is under a lot of load.

Queries also happen everywhere.

We use Django, it’s an RM framework.

We’ll talk a little bit about what that means when we get to the case studies, but it basically means that most of what your app is doing is interacting with the database.

And we execute code in the four different contexts that I alluded to earlier.

Web app requests, asynch tasks, cron jobs, one-off commands and all those things are interacting with the database.

And that’s a lot of places that things can go wrong.

So, it makes sense to collect query metrics.

And in particular, we wanna understand how many queries are being issued and how much time is spent querying the database for all the different places that we can execute code.

And those are gonna become tags.

So, we talk about endpoint name, task name, cron jobs, these are gonna be tags for our metrics so that we can break down and aggregate in a useful way.

And I alluded to this earlier, but I mentioned collecting them in one place automatically.

This is actually really important because with this small SRE team, if we go to management and when we propose, “Hey, we’re gonna solve these performance problems and we’re gonna have every developer go manually instrument the 1000 different endpoints and tasks and everything that exists,"—that would never get buy off, it’s just not feasible at all.

So, we had to be really careful about how we did this and make it so that we were not disruptive to our feature developers.

So, if we imagine collecting those query metrics, we might end up with a graph that looks something like this.

So, what we’re seeing here is an example of a graph of the number of queries being issued to a particular endpoint, let’s call it the API inbox endpoint.

And on the Y-axis, you see the number of queries being issued.

So, each of these times slices, let’s say they represent a one-minute interval or five-minute interval or whatever, it’s saying that within that one minute there were about 30,000 queries issued to this endpoint.

This is a nice graph because it allows you to compare across endpoints and get an idea of how heavy your endpoints are.

You can imagine a similar graph for query time, and that’s pretty useful, but it actually has a subtle problem that makes this graph a lot less useful than it might seem at first glance.

And the problem is actually a granularity problem.

If you’re only collecting the metrics in aggregate, then you’re actually losing the ability to differentiate between a view that gets a lot of volume and a view that is poor performing.

You can fuzzily estimate it, right?

If you took number of queries and divide by volume, you get that metric, but you lose signal.

In particular, you lose the signal to identify the two most pernicious areas of query performance problems, which are an endpoint that issues a couple of very slow queries or an endpoint that issues a large number of queries or a variable number of queries, which is called the N+1 query problem.

And we’ll explain that in more detail when we get to the case studies.

Extra granularity is the solution

So, the solution to this was to not collect metrics on a per endpoint or per task basis, but rather on a per request per endpoint or per execution of a task per task basis.

So, we go down one level of granularity.

And what this gives us actually is a distribution of the number of queries or amount of time spent querying the database that’s being done on any given request or task execution or so on.

We can still look at the totals if we change the aggregation to a sum, but we don’t lose the signal anymore.

And this gives us a graph that looks a little bit like this.

This is subtly different than the previous graph.

What we’re getting is the median number of queries issued on any given request to this particular endpoint for each time slice.

So, in one of these time slices, let’s say it’s a one-minute window, it’s saying that the median number of queries that were issued on all the requests there was about 150.

And this is really nice because it allows you to dig into the performance problems of a particular view based on how its requests or a particular endpoint, based on how the requests are actually performing.

And it allows us to do some serious debugging and we’ll see what that got us in just a minute.

So, I wanna quickly speak about the implementation of these query metrics.

If you imagine in your code you have a place where all the queries get issued, like a single path that everything goes through, you could extend it and you can emit a counter after every query and you might end up with that first graph where you just get the aggregates.

If instead you don’t do that but you actually record in memory the total number of queries that are being issued and the total time spent querying the database.

And then at the end of your request, you emit a single histogram.

Well, one histogram for query counts, one histogram for query time, that gives you the amount of queries that were issued and the time spent querying, you end up with a second graph.

You actually get the distribution of how queries are performing across the request.

So, it’s a pretty subtle change, but this is actually like the major insight that allowed us to make these metrics really useful to us.

Helpful tips for building and documenting toolsets

Before we dive into the case studies, I wanna speak quickly about some high-level like graphing philosophy.

We wanna make graphs really useful and easy to eyeball and just see a quick visual diff or a trend over time. So even if you don’t know what you’re looking at, you can tell that, “Oh, there’s something going on here that’s bad.”

That’s good because we have a lot of developers looking at these graphs that might not have a lot of familiarity with them.

The other piece to that is that we believe pretty strongly that documentation should live as close to where it’s used as possible.

So, when we first rolled this out, Datadog actually didn’t have this grouping widget, which they now have and it’s really, really useful.

So, we use grouping widgets on our dashboards to group similar charts together and we use the markdown widgets to actually put documentation directly into the dashboards.

So, you pull up the dashboard, it tells you what you’re looking at and what to look for, and it makes everything self-documenting.

And the last thing is every time we use these metrics and dashboards to solve problems, we try to share those examples and that’s what I wanna do right now.

So, we’re gonna look at a few case studies and in each of these we’re gonna focus on what we actually look for in the graph, like what the graph is showing us, what did we see that made us take action, and then what action did we take, and what was the result on the dashboard.

Case study: N+1 query problem

So, I mentioned this earlier.

Django is an ORM framework, which means object relational mapper.

If you’re unfamiliar with this, it allows you to basically interact with Python objects or classes or whatever language you’re using instead of relational database query.

So, it issues those queries on your behalf.

And in these sorts of frameworks, there’s a pretty common problem where instead of generating a single query to fetch a bunch of rows from your database, it actually generates one query per row that you’re fetching.

So, you get N extra queries, very common in these sorts of frameworks.

So, when we look at a graph like this, this is that median number of queries per request graph.

We looked at this particular endpoint, and we know this endpoint isn’t supposed to be doing very much, and yet it’s issuing 200 queries on the median on every request, and there’s some variation there.

So, when we started digging with our debugging tools, we found the N plus one problem using our local tooling, and when we fixed that problem, you can see the impact here, right?

Like the number of queries per request drops to 50.

So, like a 70% reduction, so a big one for us there.

Case study: Full table scan

In a relational database, you don’t want your query time to scale with your data, that’s bad.

That means that you’re gonna run into a problem where you can no longer…your business is not scalable.

So, this is a graph of the median query time per request that we’re seeing here.

And we looked at one of our really heavy endpoints and we looked at this several-month period and we saw this increasing trend of query time per request over time for a growing table.

And that usually tells us that there’s a missing index or something like that and, in fact, that was the case here.

We identified the index that was missing on a column, so ran a migration to add that index.

And then you can see the query time per request drop basically instantaneously to almost nothing, and most importantly, it flat lines.

It’s not increasing with time anymore.

So, this was a big scalability win for us.

Case study: One impactful slow query

The last case study I wanna look at it is a little bit different.

What we’re seeing here on this graph is basically a counter of slow queries.

So, once we had in our code a place where all the queries are being issued and we’re keeping track of the queries and query counts and times, we started experimenting with other things.

One of the things we did is anytime there was a single query that was over two seconds, we started emitting a metric and some extra data on it.

And that’s what you’re seeing here.

I think that the title of this is misleading, it’s not actually by verb.

Each of these little colors is a different endpoint that’s issuing slow queries.

You can see there’s two or three that are dominating this graph.

So, we started digging into it and looking at it and we found that there was a hot code path that all these things were going through that had one non-performing query.

And when we fixed that query, we dropped the number of slow queries to almost nothing and actually we saw a noticeable drop in database utilization in IOPS when this happened.

So, this was a really big win for us as well.

By the numbers: results and solutions

So, I wanna wrap up with some real numbers from our web app.

So, Rover grows about 2X year over year, don’t quote me on that, but it’s something like that in that order of magnitude.

So, we’re growing pretty fast.

And when I joined Rover, I mentioned that our database CPU utilization was something like 60% peaking during the day.

So, we’re at about 20% now in the master DB and our read IOPS dropped from 2000 to 600.

Write IOPS dropped a little bit as well, but read is really where most of our traffic is coming from.

And then this fuzzy metric, oh, this is really small, sorry, if you can’t see this.

This fuzzy metric of queries per request, like average number of queries issued across all our endpoints, not super meaningful because there’s a lot of variation.

But as a back of the envelope, it went from 45 to 27.

So, we’re reducing the number of queries that we need to serve the amount of traffic we have.

The query metrics were not solely responsible for this drop, but what they did do that was really valuable to us is they guided us.

They localized where we should focus our efforts with the limited resources that we had so that we can have the most impact in the shortest time.

Don’t just build it. Document and evangelize it as well

And that’s why they’re really important to us.

So, I wanna wrap up with some metaphilosophy, like step back for a second.

There’s a saying that if you build it, they will come.

If you build tools, people will just use them if they’re good.

I think that that’s true, but it only is half the battle. You also have to document and evangelize those tools both internally and externally.

So, internally, we have a mechanism for an informal setting to share tech with the wider team.

So, for Datadog, the example with Datadog is we had to really understand how aggregations worked.

Aggregations are really critical to building these graphs correctly.

So, that was one thing that we had to really dig into and sharing it with the team helped us have a deeper understanding of that.

So, now whenever we collect metrics that we think would be widely applicable and useful—and I have a couple of examples (if you wanna talk to me after) of things that are maybe less obvious than the query metrics.

We go out of our way to emit them in such a way that developers don’t have to ever manually instrument their code.

We wanna make observability opt-in by default because that reduces the friction for future developers to adopt observability tools.

This may be obvious perhaps, but it’s always better to have a metric (and not need it), than need it (and not have it) because you can’t retroactively go measure something that you didn’t realize you needed.

So, whenever we’re faced with the thing is like, do we need this metric?

We almost always answer yes.

And we budgeted a lot for a custom metrics because it’s really important to us to have granular metrics to describe our system behavior.

SRE as a proactive, not reactive, discipline

And then the last thing is that you can think of SRE as this sort of reactive, we’re responding to incidents and trying to make things work.

And then you can think a little more proactive where we’re building some tooling to try to prevent developers from shooting themselves in the foot.

We try to take it a step further and make our observability in SRE a developer platform that is really focused not on reducing incidents but empowering developers.

And this was mainly a win in terms of scale.

So, we haven’t scaled the SRE team up very much in two years, but we have scaled our engineering team up like 3X.

In order to keep up with that, we have to make sure we have support from the feature developers who are always…have deadlines and they’re always behind on everything.

So, we wanna make sure that we empower them and are not disruptive as possible.

So, when we build these tools, we think about the developers as our customers of the SRE things as opposed to just tools for us to respond to incidents.

Conclusion

Thanks.

I think there’s a speaker panel later or something like that.

I’ll be around, walking around in between sessions and everything, but yeah.

Thank you very much.