Why does Grubhub use Datadog?So to take a step back to why we’re here—we’re obviously talking about Datadog. We’re a big user of the Datadog. We’re very happy with it and kind of why Datadog? I don’t know what the mix is. If people have been using it for a long time or just starting to use it and are evaluating it? But like why did we end up using it? We made the choice right when I joined two and a half years ago. We have multiple data centers. I think that that’s like a relatively, you know, for companies rebuilding infrastructures and new companies that operate at a certain scale, that almost seem par for the course now, which is kind of interesting, but we wanted a single pane of glass. It’s really hard to run some of these more traditional, Graphite-type systems in multiple data centers with a single pane of glass. You wanna look in one place and see everything across every data center, without having to, you know, do lots of jumping through hoops. We wanted built-in alerting. So again, a lot of these tools, they don’t have that built in, they end up scraping, etcetera, etcetera. We wanted like easy to use, well-documented APIs. We did want something that would be like StatsD-compatible because we didn’t want to lock ourselves in. So if two years ago, if we started using Datadog and we’re like, “Eh, we don’t really like this,” we didn’t want to have to rewrite all of our libraries and rewrite them again. So we wanted something that would be compatible. And people have talked about these events operators, like anomalies, outliners—that’s just like icing on the cake for us. It’s been really helpful. So as people have said before, “Moving to microservices is the best thing ever and you’re gonna solve all these problems,” and you end up with a lot of other, more complicated problems in a lot of senses, and we did the same thing. We moved from a monolith, or a set of them, to lots of services, lots of new teams as we’ve been growing, and what actually happened?
Monitoring that scalesSo we have all these new teams joining. We have lots of new services, people who join the company, they write these new services and it’s just easy to miss monitoring. It’s easy to, you know, engineering team A talks to engineering team B. and they both implement their monitoring in a way that really doesn’t make sense. Maybe they haven’t had experience with monitoring production or high-throughput services, so it’s easy to miss things. If we wanna have multiple application frameworks, metric names are just different. So if I’m responsible for looking into an incident and I’m like, “Huh, maybe there’s something going on with a service that is, you know, downstream of me or upstream of me,” and I look into that, I’m like, “Oh, I don’t know what their error rate metric is because they designed it to name it something like ridiculous.” So lots of application frameworks, like you can have many different metric names which makes it really complicated. Then when you start alerting off of those, there are two problems. The last one is basically you have no easy monitoring which everyone talked about. That is a big problem. People get burned out. They ignore things. They’re not actionable. Or worse, they just have none. I think that is probably a toss-up which is worse, like having too much alerting, that’s really noisy or just none at all. And those alerts, they just lack any context. I’ve seen a lot and we’ve gone through a lot of them that are just like, “Someone should look into this,” or, “This shouldn’t happen.” It’s like, “Yeah, like no shit, it shouldn’t happen. That’s why you woke me up for it.” I need to go to the next step, right? The alert is step one. What’s step two? What’s step three? What’s step four? What do we do to get this monitoring problem solved? It’s not solved by any means, but what do to kinda shepherd along the process?
Define common metric namesDefining common metric names. So if you have a framework that you wanna use, like we have a few of them—we’re mainly a Java shop, but we use different Java frameworks. Just define common metric names at the framework level for everyone. Like we, basically, own the monitoring infrastructure, we also own those frameworks. So we’ve just been able to find, “Hey, when you write an application, you start logging, you know, you start ticking metrics, you’re just gonna use the names that we’ve defined because then we built everything off the name.” So it takes like a little bit away from the developers. They don’t have to worry about that and we have common sets of names.
Provide a base set of monitors for all servicesAnd then what we’re able to do is we’re able to provide a base set of monitors for just all services. So rather than you having a new service and you saying, “Well, I know that I’m gonna need to have like this monitor and this monitor, and this monitor, and this monitor,” we just say, “You’re just gonna get all of that.” Just by the fact that you’re running in an environment, we use our service recovery tool, we use Eureka, it’s a Netflix project. But we basically just, if you’re in Eureka, in any environment, so we run the same monitoring, the same alerting, and pre-production as production, we don’t page people on pre-production, but we get to test out our monitors, you just get everything. And when I say like you get everything, we have a few important metrics that we look at and I’ll talk about that a little bit later, but we basically just say that all of the baseline stuff that we think, you know, having kind of the expertise in this area, that we think is important for you to know that your service is functional and operational, you’re just gonna get those by default.
Collect service-specific metricsAnd then it’s on you to just basically use the easy hooks that we provided to create service-specific metrics. like if we have, you know, a payment service, you should probably not only be using our built-in error metrics, you should probably be saying, “How long does it take to authorize a credit card?” Or something along those lines. This allows everything to be in source control like, you know, Airbnb. It’s the exact same type of concept. Same exact thing, like a pull request. People can look into these things. We can have a discussion on these things, versus just, you know, someone being like, “Oh, I want someone to get paged off of this arbitrary thing that no one knows what it is.” And then because it’s in source control, then it’s not, you know, an operations problem, it’s not a DevOps problem. Whatever the term is, like it’s just not…it’s everyone’s “burden” or “problem”, which sounds negative…there’s some other word, I’m sure. But because of the source control, it’s easy for developers just to own this.
Templated dashboardsThe first piece of this—define common metric names—is super important for visualizations. Getting an alert is step one. Step two, is like, “Okay, now I need a visualization.” And so what we found is, we used heavy, heavy use of templated dashboards. I was gonna have screenshots in here, and then I couldn’t find ones that didn’t have very business-specific metrics, so just picture what Datadog looks like. A bunch of cool lines on it and that’s basically what we have, but we make super heavy use of templated dashboards. And what that means is that if I get an alert for our service, I can go to our overview dashboard, which is the entry point. It’s basically, here are all the important things to tell me what’s going on with the service and I can just, easily, from the dropdown, select the service that I wanna look at and see 200s, 400s, 500s errors, system load. We have outliers on there. We have anomalies on there. Most of our services are Java, so all of our JVM metrics are on there. And it’s really easy to say, “I don’t need to search for, you know, Jeff’s super-secret dashboard that really shows what’s going on.” I just look at the overall health dashboard and anyone can do that. So if a developer wants to look at it, it’s just is…it’s our…not prescriptive, but our curated view of a service health, a lot like with what we do for monitors.
Different dashboards for dev & ops teamsThen we have the next level of visualization, which we basically broke out into two types of dashboards. We have this operations-focused dashboard, which is not like an operations team, but more like, “I’m operating this service. I want to know what its overall health is.” So it might have some more business-specific metrics on there. And then we have the developer-focused ones you might not look at in an incident. That might be more for, “I was running load testing. I was running performance testing. I want to test that my new metric is working.” So we kinda break those out, and the operation summary dashboards like those are in source control—you really shouldn’t be adding arbitrary things to those. They go through a review process. The developer-focused ones, you can kinda add whatever to it. We might have 100 graphs on some of them. And all of these visualizations, they’re just meant to provide context in monitoring. We were very keen on purchasing big TVs to hang around our office because it looks cool, but they’re not the thing that we’re looking at all the time. If something is red on one of those and we didn’t get an alert, that’s a problem. They should only be there to provide contact. So we have over my desk, I have a big dashboard that kinda shows a few key metrics for most of our key services, so let’s say there’s 20 key services. If I get an alert, I can glance up at that and kinda see what’s red and what’s yellow on there, but I shouldn’t be walking by and being like, “Oh, why is that all red?” You know, my phone should have exploded long before that. This color is weird. Someone told me that this color is gonna be weird, so sorry for burning your eyes for the next few minutes.
Prioritize important metricsWhat are the important metrics that we actually look at? So these are things that we could find at the framework level. You know, it’s our curated list of metrics that we monitor for everyone, right? So errors is an important one. I think this is obvious, but what we do is, if we log something at the error level, we just pick a metric. So every single time you log in there, we actually do it for every logging event, so like we can see if someone is logging a lot of stuff at DEBUG or TRACE or something, but we just record a metric for every single error. And what we’ve defined errors as is they’re…and I’m just gonna read this: exceptional cases reserved for events that need to be looked into. So it’s not a error, this should never happen. It’s not an error of, a user put in bad input. Maybe that is an exceptional case that someone should look into, but for most cases, they’re something that is like, oh, this is a problem, right? This was a bad experience for a user. This is not what a background job should be doing. This is something that someone should look into. And then what we do for other events, like let’s say there is an event that, if we see a large increase of them, like user input is bad, maybe that’s a client-side bug. We just look at those independently. So we say where you would have logged this at an error, log it as a warning, and then tick an independent metric for that. We could totally track that independently because maybe a 1,000 errors a minute is probably really bad, but a 1,000 user inputs a minute could be a scraper, could bad bot traffic, maybe we should get a warning about it. Maybe we should get a ticket open but it’s not something that someone needs to look into immediately. The one thing that I will say about this and I think it’s a balance always between what you should log, is there’s a balance between ensuring that errors are logged and they’re not silenced, because you’ve kinda saved these first few bullets and then people are like, “Oh, I’ll never log anything at error.” You’re like, “No, no, please log things that error, but only if they’re important and actionable and someone should actually look into them.” Timeouts is an interesting one. This has been kind of the bane of my existence for the last year. Timeouts by everything is an error. Like you timeout talking to your database, it’s an error. The problem is we’re primarily in Amazon, we’ll have a 100 milliseconds of our partition to a single database node in our cluster. That’s a problem. And if we’re doing thousands of transactions a second, like, you know, 100 milliseconds of a partition, we’ll throw lots of errors. If it recovers and we never actually show anything to the user that was bad, we have retries built throughout our stack, maybe that’s not something that needs to throw all of those errors.
Some of the other stuff that we looked at basically as a right. So we have errors, we have 500s, we have things that we proxy back to users. JVM statistics—obviously any runtime stats are important, normal system metrics, load, disk space, memory utilization, as well as process monitoring, like if something died, you know, we wanna know about it. So what we’ve done is we basically said and we’re on the process of doing this now, is we’re gonna catch timeouts. We’re gonna tick independent metrics for those and we’re just not gonna log them as errors. And then what we’re gonna do is, we’re gonna be able to track errors at a much lower rate than timeouts, because if we see a small spike in timeouts, but we don’t see anything back to the user, we’re probably okay. And that kinda goes into other metrics. Because we have retries at every layer, we mostly care about the 500 proxy back to the user.
So what we’ve done is we basically said and we’re on the process of doing this now, is we’re gonna catch timeouts. We’re gonna tick independent metrics for those and we’re just not gonna log them as errors. And then what we’re gonna do is, we’re gonna be able to track errors at a much lower rate than timeouts, because if we see a small spike in timeouts, but we don’t see anything back to the user, we’re probably okay.
And that kinda goes into other metrics. Because we have retries at every layer, we mostly care about the 500 proxy back to the user.So we’ve kind of like changed a little bit of how we look at timeouts and we look at them more as, is this systemic? Is this sustained? But if we see a short spike in something, it’s probably nothing to worry about. And, you know, 500s and errors, they usually kind of go hand in hand. If we get an alert for one, we get an alert for other, but errors can obviously fire without 500s.