Tracing without Limits™ | Datadog
Datadog's Research Report: The State of Serverless Report: The State of Serverless
Tracing without Limits™

Tracing without Limits™


Published: July 17, 2019
00:00:00
00:00:00

What developers really want

Priyanshi: What do developers really care about?

As you have heard about today, Datadog is moving closer to monitoring the end-user.

And the reason is, because we asked this question to you and the answer was clear: you care about delivering great user experiences.

For decades, you have turned to APM solutions to answer the fundamental questions: What’s slow?

What’s broken?

Where are more users experiencing high latency, and where are they hitting errors?

The problem

High-volume applications generate tens and hundreds of terabytes of tracing data every day.

And storing all those traces is expensive.

Hence, in monolithic applications, where requests run on a single host, the prevailing strategy became sampling at the host, keeping in errors and latency.

And that’s worked pretty well until today, where applications are moving from a monolithic architecture to distributed systems.

From a single request traveling across multiple microservices, containers, serverless functions, the sampling strategy at the host has started to break down.

In distributed systems, developers care about three things: completeness, errors, and high latency.

In order to achieve completeness, the sampling decision had to be made at the start of a request and propagated downstream.

This gives guaranteed completeness, but no guarantee over errors and latency, as this can and do happen downstream.

To achieve all three, you either retain all the traces, which is prohibitively expensive, or you cleverly find a way to decide retention at the end of a trace.

That is, when it’s known if an error occurred or the trace was high latency.

Our solution: ingest everything, but only store what’s important

This analysis brought us to one important insight: ingest everything.

Only store what’s important.

In fact, this insight enables us to go beyond just completeness, errors, and latency.

And with this, I’m excited to announce Tracing without Limits™ in APM today.

Okay.

Your live data has no sampling.

Search all your traces with unparalleled visibility in an outage or incident.

Your historical data is retained on tail-based decisions, guaranteeing to keep completeness, errors, latency outliers, infrequent code parts.

Going even a step further, you decide what’s important, like keeping all transactions greater than $1,000 on your checkout endpoint.

You got it.

The best part, it’s 100% hosted and managed by Datadog.

Since only important traces persist by default, tracing still remains affordable.

And since all the storage and analysis happens on server side, it’s simple to deploy and manage all this with Tracing without Limits™.

Now that we have all your tracing data and enriched with the tags that you’ve added, can your APM solution quickly tell you the root cause of that error and latency?

Is it a single host, shard, container, product, page, customer, user, or any of those hundred tags that you’ve added?

Well, today we unveil Trace Outliers.

Sure.

Exhorted.

We have taken the red span data we ingest and correlated it with those tags so that Datadog now automatically tells you the root cause of that error or latency.

And at this, I’m going to hand it over to Andrew to show us the power of Tracing without Limits™ and Trace Outliers in action.

A hypothetical example

Andrew: Imagine it’s Black Friday.

I’m an application developer at an e-commerce site builder.

Our customers are merchants who build their website on our platform.

Their end users are shoppers who make purchases.

So now, I’ve just received an alert for a small increase in the error rate for the checkout endpoint specific to our enterprise tier merchants.

Any change in error rate could represent a loss in revenue.

This is something we need to get to the bottom of and fast.

Using Tracing without Limits™, I can use the APM live search to search through 100% of my traces in real time.

This means I can access every checkout across every endpoint for every end-user here.

In live search, there’s no sampling occurring.

Now, to go back to the alert I received before, let’s say I want to filter down this data specifically to the checkout endpoint, specifically to the enterprise tier as well, and only for errors here.

From here, I can see exactly the traces I need to troubleshoot this issue.

So now, clicking into a request here, what we’ll see specifically is an error, 503 error.

Payments service reported 503 unavailable.

What this tells me is there’s an issue right now with our third-party payment provider.

However, to get more details about this error here, what we’ll do is click into a related log.

These logs are directly correlated with this request.

They’ve had the trace ID automatically injected here.

And now, in this log message, I could get more information.

Specifically, what I see here is payment rejected for a transaction due to number of requests exceeding 1,000.

So what does that mean?

That means in a minute period, let’s say we make 1,005 requests, only 5 of those have this error.

This error is extremely rare, infrequent, and exactly the kind of trace we would need here in the APM live search.

So now, because it’s Black Friday, I wanna know if there are any other issues in my infrastructure here.

I wanna be sure that, you know, there’s no errors, no slowness, so let’s take a look further.

So using Trace Outliers here, we can now automatically analyze all incoming traffic here to give you root cause analysis for errors and slowness.

We’ll correlate them to tags.

What we noticed here on the checkout endpoint is the tag coupon code Black Friday currently has an issue with latency right now.

Let’s click into a trace to see what the issue may be happening.

So now, when we click into a trace here, what we’ll notice is when we make…we hit our shopping cart and we apply this coupon code here, we reach out to a serverless lambda function.

This serverless lambda function calls out to a database of coupons to validate the request here, and we can see the limiting of this function in that load campaign database span.

Now, in the metadata as well, what we’ll notice is when we load this campaign database, it’s taking up a lot of memory.

Essentially, what we’ll have to do here is add more memory to the serverless function to remove this latency outlier.

Using Tracing without Limits™ is critical for any company or especially for some of the largest companies in the world with the highest volume applications.

That’s why we are so lucky here today to have Justin Wright, vice president of Architecture and Development at Thomson Reuters, to talk about how they use Tracing without Limits™ today.

How Thomson Reuters uses Tracing without Limits™

Justin: Thanks, Andrew.

Great demo.

So I’m really excited to be here today and share with you a little bit more about Thomson Reuters and how we’re leveraging Datadog to help with some of our monitoring challenges.

Thomson Reuters is a pretty big company.

We’ve got about 25,000 employees worldwide, and we have roughly many thousands of technologists, as well, working on all of our products.

We provide news and information services to legal professionals, tax professionals, corporations, and government entities.

Last year, we had about $5.5 billion in revenue.

And in our news business alone, we have about 1 billion unique readers every single day.

Over our whole product portfolio, we have about 460,000 customers.

So we know a little bit about monitoring at scale, and we have had a few challenges in that area that I’m gonna talk to you about today.

So we have roughly 350 unique products.

Each of those products is highly distributed across many thousand services, microservices, serverless, containers.

We run on tens of thousands of hosts, and we generate over 5 billion unique traces each and every day.

Now that’s on average.

So we also have unexpected or unpredictable peaks that we have to deal with as well.

So the first use case I’m gonna talk to you about today is a new product we just launched at the end of February, called Thomson Reuters Panoramic.

Now, Panoramic is an entirely new market for us.

It’s all about connecting the practice of law with the business of law.

We spent over a year doing customer research and development on this product.

We had no idea how much usage we would get, what a simulated user would do, how that activity would be represented in the product, and we wanted to make sure that we maintain a good experience for our customers.

So what we were able to do with Live Tail is keep an eye on exactly what’s going on, have visibility into what our customers are doing in a brand new product, and be able to determine any sort of issues before they become a severe incident.

Now, on the exact opposite side of the spectrum, we have many products that have been around for a long time and have quite significant users.

We scale for tax season every year, every quarter.

And we have unpredictable peaks in our Reuters news business when there’s a breaking news story.

So what we wanna do is be able to selectively keep all traces for some of our products so that we can have all of the information and details around for later analysis and troubleshooting.

And lastly, I mentioned 460,000 customers.

So that’s a lot of data to wade through.

When something’s going on, we really quickly wanna be able to identify, is this a systemic issue, or we’ve got a problem in our code base, a recent deploy or configuration?

Or is one of our users or one of our customers, perhaps, doing something that’s a little out of the ordinary?

So what we were able to do with leveraging our tags that we have is be able to quickly get anomaly detection from Datadog where we can see if one of our customers is causing an issue or if they’re having a problem with their environment.

This has been extremely helpful, getting us to root cause analysis much, much faster, which provides us efficiencies and also a better experience for our customers.

So on a quick summary, from brand new products, where you don’t have any idea of what to expect in terms of utilization or scale, to existing products that have many, many millions or billions of requests that you’re trying to make sure everyone has a unique and happy customer experience, to do an automatic root cause analysis, and saving time for remediation of an issue, Datadog has been extremely helpful for us.

Thank you very much.

Priyanshi: Thank you, Justin.

We have enjoyed partnering with you and Thomson Reuters on this.

In summary

To recap, Tracing without Limits™ is in beta today.

Search all your traces with no sampling.

Process the traces that you care about, complete errors, latency outliers.

You are in control to decide what traces are important to retain.

And the best part: it is completely affordable, cost-controllable, and 100% SaaS.

Thank you.