Trace Search & Analytics (Datadog + Zendesk)

Published: July 12, 2018

00:00:00

The pillars of observability

Thanks, Renaud. I always wanted to walk on stage to Kanye.

So Renaud took us through logs and what’s happening there.

He walked us through dynamic retention,

he walked us through Live Tail,

and he walked us through the concept of Logging without Limits™.

And when you heard from Olivier earlier today, he talked about the mission of Datadog to break down silos.

So, with that, I’d like to talk about our next pillar, traces.

And we have a big, exciting announcement to make today.

But before I get there, I wanted to talk about a major shift we’re seeing across the industry.

Shift from monoliths to microservices

We see it no matter what vertical you’re in, no matter the size of company,

and all of you in this room today are very familiar.

And that’s the shift from large monolithic applications to smaller microservices.

And in this new paradigm, what we find is that visibility into these new distributed systems is more challenging than ever.

Now, a request could come in,

it could hit an edge proxy.

It could hit a top level web service, down to many other services. Out to external API calls, hitting caches as well,

and then a fan out to a number of database queries. All before returning to the user.

And if you’re the developer holding the pager, and it goes off, you have to answer one question as quickly as possible:

Where is the performance bottleneck?

And once you answer that, you have a second question:

And what is the root cause?

And in these new distributed systems, we thought that this was too difficult and we wanted to make it easier.

So, one year ago, we introduced Datadog APM and distributed tracing so we could trace through and give you visibility in this new paradigm.

Datadog APM and distributed tracing

And with that, we saw some amazing adoption.

We had Square, which is powering local commerce.

We had Zendesk, one of the largest SaaS companies in the world, building software to power customer success.

And we had Airbnb, the largest home sharing website in the world.

And I’m happy to announce that we have over 1,000 other companies now using Datadog APM and distributed tracing to monitor their applications.

We started with Java, Python, Ruby, and Go.

And in the coming months, we’re going to be releasing Node, .NET, and PHP.

So no matter what your stack, you can come to Datadog and we’ll trace it out of the box.

When we launched the product, we launched the flame graph.

This was the easiest way to understand a single request as it goes across a distributed system,

and visually, you could pinpoint the bottleneck.

We gave you out-of-the box application health:

your errors, your throughput, your latency—on every service, on every endpoint, and on every database query.

And of course, it was deeply integrated with Datadog dashboards—with tags, with infrastructure, with logs—

so we broke down silos,

and you could see it all in one single pane of glass.

And with that, we had to ask ourselves a question:

How can we make it even easier for developers to pinpoint performance issues?

And when we started talking to you, our customers, we learned a few things.

We learned that you wanted your application metrics to be able to tie back to the underlying infrastructure.

You wanted to find traces that matched certain application criteria.

And, of course, you wanted to combine that with your custom business data.

And you wanted to do queries across any of these data, across any of these tags, any of these values, which meant that we would need to support infinite cardinality.

Trace Search & Analytics

Which is why today I’m very excited to be here. Because today, we unveil the next evolution in Datadog APM and distributed tracing.

And we call it Trace Search & Analytics.

This is the Google search bar for your application data,

and we think it’s going to change the way that we all monitor our applications.

Let’s take a quick walk through.

Up top, we have the search bar.

You can add your tags, as many as you want, with any number of values.

Right below that, we have your metrics, your requests, your errors, and your latency percentiles,

and these will be all scoped to those tags in the search bar.

On the left side of the screen, you see what we call facets, which are all the tags and all the values available,

so you can point and click.

That means no need for a complex query language.

And down at the right, we have our list of traces, which matches all the tags in the search bar as well.

Trace Search & Analytics use cases

So with that, I’d like to walk you through how you could use this in the real world.

Now, let’s imagine for a second that a customer writes in.

The customer tells us that they’re getting an error,

then they tell us it’s a 512.

Based on the page they visited, we know the endpoint and, of course, we know the customer as well.

So, let’s see if Trace Search & Analytics can help us to figure out what’s going on.

So with that, I’m going to go in and I’m going to click the service.

I’m going filter down as well to the endpoint.

Notice how the three graphs up top immediately scope down, as well as the list of traces.

Now, we mentioned the customer told us it was a 512,

so we know that.

So, let’s go ahead and add that status code.

So, we just type status code,

and there we have it,

and notice the auto-complete there.

So I can see the 512,

and I know that’s one of the errors that’s occurring.

And again, we have the customer,

so we’ll add that too.

And our tag for customer in this case is a custom tag and we call it org_id,

so we’re going to go ahead and choose the org_id of that customer.

I click a trace.

Immediately, I click the error,

and in right across the screen—that red bar shows us it’s a gevent.timeout.

So, now we know what the problem was.

But the best part about it is that Datadog has also captured the stack trace.

So you don’t need to leave the product,

it’s right there and you can see the execution path before the error as well.

Now, let’s walk through another example.

Let’s take that same endpoint, and now, let’s imagine that we want to look at slow requests.

We want to understand how to optimize the latency on them,

so let’s see how Trace Search can help us through this, too.

So I’m going go ahead, and I’m going to remove the org_id and the status code and now we get our endpoint.

I’m going to filter to just a successful request and use my duration slider here to move this up to only requests that are 2.5 seconds or over.

I can go ahead and I can click one,

and there you have is the flame graph.

And at the bottom of the flame graph, you see two areas.

The first is this gRPC call,

and that’s taking almost 30% of the time.

And down here, what we see is this Mindy database query,

that’s a custom database that we have.

And so you can see this is taking 16%.

So immediately, just in a few clicks, I can understand for these very slow requests, what are two areas that I can go optimize the performance.

And it’s just that simple.

Infinite cardinality

But it gets even better because this is just the first part of Trace Search & Analytics.

I mentioned something earlier about infinite cardinality.

And so, I’m going to show you how powerful that is in the product.

Now if we look at this graph here, and I’ll switch it to a line graph,

if you look at the search bar, you see we’re looking at our production environment

and we’re looking at one particular service.

And what we see here is the count,

so this is the number of requests,

this is the throughput of that service.

And let’s say I want to filter down the service as many endpoints.

I want to know, this throughput, what is it made up of?

Like, what are the top five endpoints?

So, let’s go ahead and do that.

And I just group by resource, which just means endpoint in this case.

And there we have it,

those are the top five endpoints,

it’s that simple.

And let’s say we have great monitoring on this top one,

it’s very dominant in terms of throughput,

so we’re going go ahead and we’re going to exclude that.

And just like that it’s excluded,

and now let’s drill into the third one, the blue one there.

Great.

And now we have drilled down, from the service level, to many endpoints, to just one.

But we want to go even deeper here.

We have, you know so many customers, and we want to know, who are the main customers that are causing this throughput, right?

How is it made up?

So just like this, we’re going to add that customer tag, which remember I said was org_id.

So, we go ahead and we add the org_id,

and immediately, we can now see across thousands of customers who the top five are on this endpoint.

And I can click one, and you can click view traces.

And now notice the time at the very top left and the org_id match where I clicked.

And I can go ahead, I can click a trace,

and with the flame graph, it’s very simple.

I can see it’s this string query,

that’s taking up 40% of the request.

So, I now know where to optimize.

And it’s just that simple.

And this is just a very small subset of what you can do.

But, we can take it even further.

Because as Olivier mentioned, the goal of Datadog is to break down silos.

And so, what we’re going to do is we’re going to export this to a Datadog dashboard.

So, if I click export, I can pick my timeboard, the single pane of glass.

And just like that, this graph is going to appear on the Datadog dashboard alongside my infrastructure, my logs, my custom metrics.

And there we have it.

So now I can go monitor these five customers on this one endpoint over time.

So, that’s Trace Search & Analytics.

And we’ve been working on this for a while.

And we built it with a set of partners,

and one of them is here today, which is Zendesk.

Zendesk has been an amazing partner to us as we built this.

They’re now using it at scale across hundreds of engineers.

And we have a special guest from there today.

So, without further ado, I’d like to welcome to the stage, Hemant Kataria, the Senior Manager of DevOps at Zendesk.

Partnering with Zendesk

Hemant: Hi everyone,

my name is Hemant Kataria, and I’m a DevOps Manager at Zendesk.

What is Zendesk?

In case you haven’t heard of us, Zendesk makes beautifully simple software for customer success at scale.

We build multiple products to make experiences better for agents, admins, customers.

The application monitoring challenge

Zendesk is a global company.

These are all our engineering offices around the world.

We have about 500 engineers deploying hundreds of times a day.

We also have, in addition to our internal customers who are our engineers, we also obviously have external paid customers, which number about 125,000 paying customers right now.

And our annual revenue run rate at the end of last year was $430 million,

so downtime for us is very, very expensive.

So, we’ve been using at Zendesk Trace Search for about six to eight months now.

And how do we use it?

In this particular example, just an example, but if you can imagine on the x-axis over here, if there is time, on the y-axis could be any metric you care about like error rates, latencies, throughput, and the lines are customer IDs.

We have a multi-tenant environment, like most for you I’m sure,

and in this example we can see one customer being anomalous, as compared to everyone else.

So, the ability to drill into this sort of information is very useful for us,

and something we use Trace Search heavily for.

Zendesk outage response

We also use Trace Search and APM significantly during outage responses,

and we use it particularly to find the needle in the haystack type search or trace.

In this particular example, you can see us running a search against our production environment for any Zendesk service,

in this case, we call it Zendesk Service 1.

We’re looking for error rates against a particular API endpoint for one customer.

So, the ability for us to drill down in to this level of detail, again, very important.

And Trace Search does provide us that ability.

Capacity planning and analysis

We’ve also been using Trace Search over the past six months heavily for capacity planning and analysis.

I’d encourage you, if you have the time, to go watch Dan and Anatoly talk later today at 4:00 p.m. where they’re going to talk about how we use Trace Search at Zendesk to basically discover hidden capacity within our systems.

Excellent talk.

So, Trace Search & Analytics has become critical at Zendesk.

It’s the first place developers go to look at when they need to understand their application better.

We use it heavily during outages.

And when we have an outage, after that when we have a postmortem, the first question we’re asking is, “Was Trace Search enabled for this application?

If not, why not?"

We ask teams to go out and instrument it for themselves.

And once a team enables Trace Search, it sort of provides instant value.

It’s almost funny, like, I get pinged at least once a week by a global engineer in one of our offices about how they just enabled Trace Search and how much value they’re finding out of it.

So, we’re pretty excited,

we think it’s a great tool,

we’re excited to continue using it.

Thanks again for having us and congrats on the launch of a great product.

Thanks, guys.

Thanks, Brad.

Trace Search & Analytics now generally available

So, there you have it,

that’s Trace Search & Analytics.

And to recap, it’s the global search across all your traces from one place.

It’s the ability to slice and dice your application data at infinite cardinality.

And then you can take those queries and you can add them to Datadog dashboards alongside your infrastructure and alongside your logs.

And as of this moment, I’m happy to announce, that Trace Search & Analytics is now generally available.

So, if you open up your laptops and go to Datadog under APM, it’ll be there for you to use.

And the best part is we have a great price.

It’s just $1.27 per million events.

And that’s less than a McDonald’s breakfast sandwich.

But it gets even better than this. Because, if you’re using APM with Datadog, we’re going throw in a million events for every APM host.

So, 100 APM hosts, that’s 100 million events,

that’s a lot.

Because we want you all to be able to access it and to use it because we think it’s going to change the way you monitor your application.