Detecting anomalies with Watchdog (Datadog + Square)
Published: July 12, 2018
Monitoring constantly changing systems
Homin: Thanks, Ashley. Ashley showed us how we can visualize the increasing complexity of our systems. But how can we effectively monitor our services as they multiply and their dependencies constantly change?
Datadog provides machine learning algorithms like anomaly detection and forecasts to look out for all the usual ways your systems can fail. These are great, but you yourself have to decide when to use them. How do you know what to monitor? How do you know what to look out for?
Monitors you set up yourself reflect the problems that you’ve encountered in the past. But your systems can fail in so many different ways—any one of which might happen very rarely, but collectively is what keeps your teams up at night. Collecting metrics, traces, and logs might help you figure out what happened after the problem has occurred. But how can you possibly monitor problems you’ve never even thought of?
Now, there are many AI systems for operations out there, but they all require that you tell them where to look. They all require intensive setup. And finally, they require that you turn them off, because you get inundated with false alarms.
So, our challenge was to build a system that would scan your ever-changing environment without any configuration. We’ll be looking out for the dragons for you. Only highlight real problems. There’s nothing worse than getting woken up at 3:00 a.m. by a robot. And finally, point you to any possible root causes. We don’t want to just show you that something happened; we want to show you how you can possibly fix it.
And so, we built Watchdog. Watchdog will monitor all your services and notify you of real problems. There’s nothing to set up. And we use several machine learning techniques to ensure that Watchdog is never going to surface anything that someone on your team wouldn’t care about. And it’s constantly monitoring your firehose of data.
Let me show you a quick demo. So, we see here that Watchdog found two things of note in our store-web-app service, as indicated by these binoculars. And then below we see a story showing that the latency for your service has been going up. And that pink box highlights when exactly that occurred. So, now if you go into the details panel, we see that increase in latency. And we also see below it that we’ve been spending a lot more time in Postgres than we used to.
So now, let’s go to the service page. On the service page, we see that same increase in latency and we also see where we’ve been spending more time in Postgres. We see that same Watchdog story below. And then below that we see that there was an increase in errors happening exactly at the same time. And this is happening in the checkout endpoint of that service.
So, let’s go into the details panel there. And here we see the stack trace that points the actual cause that caused all the errors.
Monitoring your applications + infrastructure
So, I just demoed for you a story about APM. Watchdog can also tell you stories about your infrastructure. For instance, maybe your host has been misconfigured and it’s not reporting any metrics whatsoever. Or maybe a disk is about to run out of space. Watchdog will notice if your memory usage has been drifting slightly upwards for a long time, pointing to a possible memory leak.
Watchdog will even be able to tell you when your cloud provider is experiencing network issues. Here we see that two different data centers in us-east have been having network issues just before 9:00 a.m.
So, Watchdog has been vital for us for helping monitor our own systems. I’d like to introduce Joe Sadowski, engineering manager at Square and an early beta user, to explain how Watchdog has been useful for them.
Datadog at Square/Caviar
Joe: Thanks, Homin. So, at Square, we primarily use Datadog to monitor Caviar, which is our food-delivery business.
So, for those of you unfamiliar with Caviar, basically it works like this: Diners order food online. We tell the restaurant what to make. We dispatch a courier. A courier goes, picks up the food, brings it to you. And everybody has a great meal.
Caviar’s tech stack
So, the technology behind this is basically an iOS app, an Android app, and a web app for diners to place the order. Couriers have an iOS or Android app that they use to get orders, get to the right restaurant, and get your food to you. And in the restaurant we have an iPad that’s running an iOS app. And then we have a whole bunch of internal tools that are basically helping us manage all of this.
Altogether, there’s 30-something services and thousands and thousands of endpoints. As you can imagine, it’s all pretty hard to monitor. There’s a lot of moving pieces. And that’s why we’re really excited about Watchdog.
Watchdog in action at Caviar
Because it’s monitoring all of this in the background for us and letting us know when there’s problems. So, for example, on Tuesday, June 16th, at around 7:30 a.m., our pager went off. And it showed us that one of our services, the one that takes most of the orders from our diners, was failing. Here it’s, I think, a 1-percent error rate on the overall service.
Zeroing in in the endpoint and error
So, first thing that we did, we popped up Datadog and the first thing we see is this bit about Watchdog at the bottom showing us that, hey there’s this endpoint. It’s the HomeController#index method and that is totally crashing at, like, a 90-something-percent error rate. Probably an issue. So, we click in and sure enough, Watchdog shows us that increased error rate. We’re still getting traffic to the endpoint. But then at the bottom, it’s got this awesome callout of a stack trace that’s common across all of the errors. And here you can see the service region URL name is being passed as “New (space) York”. It’s actually been passed into a Rails generator and it’s not able to find the route. Somebody had accidentally changed it from “New-York” to “New York”. The space was breaking it. Went into the database, fixed it. And then everything is back to normal.
So, pretty cool. Watchdog noticed that there was a problem. It identified exactly where this problem was with the endpoint. And it picked up the common stack trace, allowing us to know why it was broken and fix it quickly. So, to me that’s awesome.
Benefits of Watchdog
Basically Watchdog is giving us faster incident response. It’s showing us where the problems are in our system that we wouldn’t have otherwise seen. And it’s showing us where other impacts are throughout the system. Overall, it’s pretty great to use and it’s allowing us to essentially deliver a better level of service to our customers. And with that, I’m gonna hand it back to Homin. Thanks.
Homin: Thank you. Thanks so much, Joe.
To recap, Watchdog will monitor all of your ever-changing infrastructure for you and only point you to real issues. There’s absolutely no configuration needed. It just works. And I’m extremely excited to announce that it’s generally available today starting with all of our APM customers. Let me turn it back to Alexis.