Full-Stack Observability With Datadog APM

Full-stack observability with Datadog APM

Aaditya Talwai

Published: 3月 14, 2017

00:00:00

Introduction

Hey, guys. My name is Aaditya. I’m on Datadog’s APM team. I’m an engineer there.

It’s been great talking to everyone so far and getting to finally put faces to names and GitHub avatars.

I’m excited to tell you about the tool weve built and, sort of, give you a sense of how it coheres with the Datadog infrastructure project that you know already and hopefully enjoy using.

And I’ll try to ease out what we mean by full-stack observability.

I wanna show you three monitors we have defined in Datadog’s own Datadog account that we use to monitor ourselves.

Full-Stack Monitoring

So, there’s a bit going on in this slide but bear with me for a second.

So, these three monitors are meant to catch failure modes in three different layers of our offering.

The first one right at the top is a monitor we came up with.

Like, had to put together the hardware after realizing that we had too many dead tuples in one of our PostgreSQL tables that was greatly slowing down some of our select queries and table scans and had an adverse effect on various components in our metrics and alerting pipeline.

The second monitor is on one of our Kafka consumers.

So, we use a highly-replicated Kafka queue to essentially store your time to these metrics.

And we have consumers on the other side of that queue that are working hard to make those metrics available for queries.

This particular monitor, second from the top, is meant to catch lags in that kind of processing.

And the third one right at the bottom is arguably the most critical of these because it’s pointing out an actual, user-facing error, right?

There is a high incidence of errors on dashboard endpoints and people were having trouble sort of seeing their dashboards refreshed.

Commonalities Between Examples

So a couple of interesting things about these monitors, they all, sort of, need to exist and are necessary because each of these components can fail independently and in sort of subtle ways.

And it requires a human to, like, be there to respond and actually be able to analyze what the impact of one of these failure conditions is.

And the other interesting thing about it is that all of these monitors can trigger and, sort of, result in the same end-state, the same end-state for the user which is missing recent data on graphs.

And this, obviously, sort of, an emergency situation for Datadog that we care a lot about and we sort of generally need a responder on-hand to analyze what might be going on behind this query and actually figure what components are responsible for this integrated service.

Issues with Full Stack Monitoring

So if I could ask you, guys, like, put yourselves in the headspace of a similar user-facing emergency in one of your production services.

And so, think through how you’d go about actually getting to the bottom of what the components responsible for this kind of heady failure are.

And this is sort of the kind of the thing that we built the Datadog APM offering for because we wanted to, sort of, plug into your existing monitoring tools and give you sort of a full stack of visibility to actually understand what’s going on here.

The endpoint that actually serves a graph like this one is the series batch query endpoint.

That’s what it looks like when it’s working well.

While you may be doing a relatively simple thing on the surface and taking a graph definition and splitting back timeseries into a graph like this, it’s actually talking to a bunch of disparate services in Python and Go.

It’s talking to some more, sort of, traditional data stores like PostgreSQL and Redis, in addition to some homegrown data stores of our own.

And it’s involving itself in a number of different subtasks that each represent a point of failure or a point where something unexpected could happen.

And in order to actually, like, intuit what happened during the course of a single request or a single sort of error state, it’s actually like surprisingly hard to do given that these are disparate systems and the communication between them can have sort of like various adverse failure modes.

And so maybe this is a situation that resonates with you where your app has number of moving parts is consistently changing and evolving.

This is where we like to recall is one of Alexi’s slide from the first presentation of the day where we have all this complexity to deal with and we have services being phased in, all services to being phased out.

We have expertise being handed off from one team to another.

We consistently have to ask ourselves these questions of why is our app slow or why is it broken or we’re still at why is it slow or broken for another user sitting on another continent?

And all this together is, sort of, these are the questions our APM tool tries to answer.

Distributed Tracing Introduction

And the tool of choice we have chosen to actually, like, expose this information to is tracing, distributed tracing to be exact.

So what do I mean by tracing?

So tracing, in this context, is a specialized form of logging.

I’d like to think about it as adding another axis to, like, the traditional log file where you’re, sort of, tailing a process and observing it above over time.

Tracing is that but, like, blown out to several different processes in several different environments and crossing host boundaries and service boundaries such that you can follow a request as it jump from service to databases to caches, etc.

And in effect, it sort of makes a complex system that has several moving parts into a more legible whole that you can understand and figure out in a reasonable way at a reasonable time frame.

And so what, sort of, the end state of distributed tracing is we actually get to see our batch query endpoint which is surveying data into your dashboards and to your monitor status widget.

We actually get to see exactly where it’s spending its time.

And I’m gonna show you a more interactive version of this in a second but, in effect, it sort of like the flame graph you are seeing here is you have time on the X axis and you actually see each of these sort of chunks represents a chunk of computation time.

And so you know, like, “Hey, this amount of time is spent making queries for a replica database.

This amount of time is spent in one of our served custom microservices," and so on and so forth.

How Tracing Works In Datadog

And if I could jump to a quick demo where I can actually walk you through how this actually looks in terms of a running live traced production environment.

First of all, I’m gonna talk about how we actually get this data into your systems.

And so we have open-source clients in Python, Ruby, and Go that interact with your application in various sort of, like, non-intrusive way.

The idea is to essentially be sort of like a good citizen in your application environment.

So for Ruby, you would just drop in a gem to your Gemfile and then be sort of, like, ensure that any sort of requests coming into your Rails application, for example, or a request that you’re making via Rails via active record to your ORM is appropriately traced.

And similarly for a Flask, we have a similar way for you to, like, attach on Middleware and actually start streaming data into Datadog for every single transaction that comes through your infrastructure.

And now, time for a quick demo.

So what you’re seeing here is Datadog’s staging infrastructure sort of broken out into individual services.

Our tracing clients will do the job of actually, like, spreading up your application into its component web database and sort of cache layers.

And we do this at deliberately because it sort of makes a lot of sense to us to be able to view health metrics for these disparate services in their own little, like, self-contained views and self-contained dashboards.

Splitting up your application between web and database like this also makes it possible to do a kind of, like, bottom-up analysis in addition to top-down analysis.

And I’m gonna talk a bit more about what exactly that means.

And so what you’re seeing here is a mix of various services that we have in Datadog’s own infrastructure.

Some of them come from, like, out-of-the-box integrations that we have.

Some of them come from sort of custom tracing which our clients expose an API for you to plug into.

In general, we encourage you to, like, split up your services based on the business case, sort of the business domain that they’re operating in so you can actually see, like, “Hey, this sort of service used for authentication has an error rate of x and you can sort of, like, comprehend your SLAs in a more reasonable way that way.”

Pylons Dashboard Example

So if I were to jump into a single one of these, so what you’re looking at is a pool of Pylons web servers.

And I’m just gonna jump into a single one of these web servers.

And so what you’re looking at here is actually a sort of a high-level overview, a dashboard of some key health metrics for your application.

And this dashboard is available, too, out of the box.

So at the moment, you plug in our tracing client, we expose these health metrics to you to give you a good sense of where exactly your application’s spending time and what the key metrics you should care about are.

Going clockwise from the top, we have a sort of have total hits which is basically like requests coming into your application.

We have percentile latencies all the way up to P99 and max.

And so, we allow you to understand the full spectrum of requests coming into your application and what’s sort of the long tail of request looks like.

We have error accounts which is obviously important to be able to track how many, sort of, 500 users caching layer web app.

And this is probably what I think is one of the more useful graphs in this page.

It’s a representation of time spent by the server in terms of its downstream components.

And so what we’ve done here is actually split up this web server, this web server is multi-query, into the various downstream components it relies on.

So there’s, like, a Redis database, Cassandra database, a couple of PostgreSQL as well.

And we can actually see over time the evolution of how much time we spent in these downstream dependencies.

And so if you deploy a change that makes your cache issues a little bit better, you’re generally gonna see this evolve over time as you, sort of, track the use of your caching layer versus your database layer in code.

And this makes it really useful way to, sort of like, cache these issues and cache these regressions.

In addition, we have this exact same metrics scoped to your individual endpoints as well.

And so, you can actually see each of these entries on the list is one of our, sort of like, handler functions in our web server.

So for Rails, you’d, sort of, see your Rails controller and the method they executed on it similar for, like, Django and Flask and other web apps.

Individual Endpoint Analysis

And so if I were to jump into an individual one of these endpoints, these handler functions, I can see…I’m just gonna refresh that really quick.

Cool.

So now, I see these exact same metrics scoped into the individual endpoints.

And this is super useful because it actually allows you to, like, drill into sort of like problematic endpoints.

Like if you have a set of users complaining about latency in a specific subset of your routes, we can actually give you the tools via Datadog APM to like jump just to that subset and be able to understand things and scope your view in a way that’s meaningful to you.

And so what we actually have captured here is a bunch of several transaction traces.

And I showed you a brief glimpse of this earlier but in essence, we’re consistently capturing a statistically significant group of transactions from your live-running applications.

And so the idea here is that we wanna give you visibility into what the entire spectrum of your application performance looks like.

So we’re not just gonna show you, like, the slowest trace.

So we’re not just gonna show you like the one error trace that happened.

We wanna show you a view of, like, how your application behaves in various different scenarios like in a hot cache scenario or in a cold cache scenario, for example.

And that is super useful to developers and operators as well as they try to understand how their application behaves under particular instances of infrastructure pressure, for example.

So this view can also be scoped a little bit further.

And so what we have here is a distribution of requests into certain time buckets.

And so we can, sort, of intuit where our P99 lies as far as what our slow requests look like.

So I can jump into just a certain subsection of requests and actually say, “Hey” okay,

So these are the things that are P75 and higher.

So I can jump into an individual one of these and be like, “Hey, this took more than a second,” and I can actually look at exactly the servers that were responsible for…the services, rather, that were responsible for that additional slowness.

Individual Trace Analysis

So now, I’m gonna jump into an individual one of these traces.

So an interesting thing about our traces is they sort of, like, marry the traditional out-of-the-box integrations that you get via Datadog APM with custom tracing as well.

So you can use our tracing clients to, like, wrap a problematic function or like wrap a problematic bit of code that you suspect to be a bottleneck.

And so everything you see in green here is actually one of our custom trace expressions.

And you can see we give you the tools to actually, like, decorate your transaction traces with rich metadata so you can actually see exactly what the state of the world was when this request got executed.

And we encourage you to put as much metadata onto your SPAs and onto your traces as you consider useful just because it gives you all the visibility you need in a single place as to, like, how this particular request performed and why it might have done something that you did not expect it to.

Another interesting thing about this particular trace is that it’s actually spanning a bunch of hosts as well.

And so you can actually see a jump across host boundaries over here.

So Royce is the server, it’s a service that, sort of, functions as a caching layer for some of our more recent metrics data.

And we communicate with it over GRPC via our Python web application.

And so we’ve been able to tie into this particular framework to actually understand that a GRPC request that crosses a host boundary is part of this same unified transaction.

And being able to sort of propagate your context in this way across a distributed environment is, sort of, key to, like, constructing a full map of what your transaction looks like and sort of what the various hosts and the various services that it touched are.

Another key thing to note is that all of these traces are hostware and so you can consistently jump back to the actual host or process that executed this transaction and sort of identify spots where you have hot spots and load balance pool for example or if you have one host misbehaving when it should be behaving exactly like every other host is.

And so this, sort of, general level of detail is super useful to you when you’re all sort of targeting the end-user problem that I mentioned right at the start when you know that users are experiencing heightened slowness or heightened latency or heightened error rates and you wanna tease out exactly what sub-service or what downstream component is responsible for that.

APM Tracing Features

We’re also consistently capturing error traces as well.

And so I thought this trace was interesting because it points out one of the failure modes for this batch query endpoints.

So we had one of our ready service that we communicate with over the Redis protocol went offline briefly.

And it, sort of, exposed this strange failure mode for our batch query endpoint where we didn’t really have sort of like a graceful failure over for this particular case of a downstream server going offline.

And so these are the, sort of, general failure modes that we looked to expose via APM.

And we give you the tools to sort of propagate at our trace specs up to stack and make sure that once you surface the transaction UI and if it was a 500 Error, for example, that you have mostly all the context you need to actually identify where that error came from.

And another thing to note is that all these metrics tie directly into your existing dashboards as well.

You can throw them on a dashboard, you can throw them into a monitor.

I’m just gonna show you this, sort of, the latest feature we enabled.

But yeah, you always have the option of alerting on these metrics as well.

And because we’re sort of collecting them for you out of the box, we give you some interesting tools to actually say like, “Hey, this is the sort of standard rate of throughput for one of my web servers.”

We can make intelligent suggestions to actually tell you what your alert should look like and how they can be configured in, sort of, an easy transparent way.

And one of the last things I wanna show you is another cool aspect of the tool that we think is particularly useful and unique.

And so we saw for the first section of this demo, we went from the web layer down.

We sort of handpicked a web server that we wanted to investigate.

We handpicked an endpoint that we wanted to investigate in that web server, and we sort of boiled down into exactly where that endpoint is spending time.

Going Up The Stack From The Database

What I wanna show you now is actually how to go from, like, the shared or the database until you are up the stack.

And this is another thing that’s particularly useful especially if you have, for example, a PostgreSQL database that’s shared within various applications.

And maybe it’s underload or maybe you wanna understand where that load comes from.

So this is a view of one of our PostgreSQL master databases.

And so you can, sort of, see you have these exact same metrics that you had for the web server but scoped to an individual database.

I can actually jump down and see every single query that was executed from application land onto this database.

And this gives me a really good tool to actually diagnose where expensive queries are coming from.

I can jump into an individual one of these.

That should dive in and then say, “Hey, here is this problematic PostgreSQL query in the context of the web request that originated it.”

And being able to actually go from the database there up to stack like this is particularly useful, especially in cases where your data tier or your shared tier is under a particular infrastructure or network or dispatcher.

And yeah, this is one of the tools that we found particularly useful.

And if you, sort of, have an environment where you have tons of different databases that are shared between tons of different servers, we give you the tools to actually isolate them and view health metrics for your master database as distinct from your replicate database and your various shareds as distinct from each other.

We wanna give you all the customizability you need to design a map of your infrastructure that actually makes sense that you can actually, like, intuit and maps the business cases in a reasonable way.

And yeah, if I could jump into the last one of these things.

Other Tracing Service Features

So our tracing service is entirely real time.

And so, we’re actually fielding traces as they come in.

We have, sort of, a sophisticated sampling other than that works at various layers that makes sure that you consistently have a view of the most statistically significant selection of traces and that you consistently are able to know exactly what’s happening if you have a problem, if you have a customer on the phone or someone who’s emailing you and saying, “Hey, there’s an issue right now with this particular subset of endpoints.”

You can scope, sort of, live your traces to an individual service or just looking at error traces.

And this kinda makes it really easy for you to, like, track exceptions as it flows through your wrap and as it flows to the various layers of your database.

Yeah, Datadog tracing is sort of a very fast evolving product.

We’re really a more, sort of like, traditional view for like asynchronous traces that would remind you of the chrome and spectrum timeline view.

We’re making it easier and easier for you to, sort of, just plug this into your app and make sure that any touch in this to happen as far as code changes are concerned are like entirely transparent and happening in a very non-intrusive way.

We’re actually making it a lot easier for you to search your traces by more facets.

Right now, we expose a few, sort of, key high-level things.

So like, you can search your traces by service, by host, or by a process.

But we wanna make it easier for you to, like, scope this by things that have strong business cases like customer ID or like availability region, for example.

And that’s one of the things that we’re working on improving and building into the trace product.

And as I mentioned earlier, this is one thing that’s also very close to landing.

We’re building the ability to, like, have smart suggested monitors to use such that.

We can give you, sort of, like these solid directions on what the baseline level of performance for some of your services are and how you can like monitor that when it becomes anomalous or it becomes unusual.

Conclusion

And all these together, sort of, makes up our Datadog tracing offering.

We’re gonna have a hands-on session in the afternoon, I think, where you can, sort of, like play around with it a bit more.

We’re happy to get you guys get set up.

And yeah, in the meantime, we’re improving fast.

Most of us are open-source.

We welcome you to check us out in GitHub and, like, ask request integrations, actually like poke through the code and see how we’re actually integrating with your applications.

And yeah, we’re really easy to get started, so we encourage you to try us out.

And, happy to take any questions.