APM and Distributed Tracing

Published: April 16, 2019

00:00:00

Well, I’m here to talk about application performance monitoring.

The origins of Datadog APM

Since its launch in 2017, Datadog APM has built a strong, reliable, and scalable foundation with multiple developers across the globe using our application to monitor theirs.

At Airbnb, Zendesk, Square and Peloton, and at your organization, developers are building applications to build customer success to scale.

And with Datadog, we want to scale as smoothly and efficiently with you as we can, building a single pane of glass for monitoring all your applications.

So looking back at our journey of product evolution, we started in Feb 2017 with Python, Ruby, and Go Tracers.

Java and Node closely followed.

In July 2018, App Analytics was launched which allowed you to do a completely customizable tag-based query with tags ranging from your business ID to resource end points for each specific service.

Then in August, Service Map was launched, which allows you to view an overall architecture of your microservices distributed system.

And recently, in Jan, we launched Span Summary, which allows you to do a bottleneck analysis to find that n+1 query.

APM Tracing with .NET and PHP

However, this wasn’t enough.

Our customers had an extensive complex stack of languages that was yet to be traced.

So today, I’m happy to announce APM Tracing with .NET and PHP.

With this, these two languages join our family of extensive tracers making APM a really comprehensive and strong product for all your monitoring needs.

So what does that mean for you?

Well, more stacks to be traced.

Anything from a cURL, Laravel, Zend request in PHP, to Webforms, ASP.NET, ADO.NET, can now be instrumented automatically out of the box with APM.

Moreover, it’s also OpenTracing compatible.

So what does that really mean for you?

Let’s try to understand this with a story.

Troubleshooting a simulated on-call workflow

I’ve been a developer for all my life, and I know what an outage at 3:00 a.m. can cause.

So a while ago, I got an email from this customer, ID 2855, that all my services are quite slow and they’re getting time out on their payments and nothing works.

Basically, my application sucks.

Sounds familiar?

Well, what do we do?

I quickly go to my Service Map right here.

So all the services that require my attention have already been colored in red.

So if I quickly go to the service, Cassandra, over here, I can see that it has a latency of one and a half milliseconds, which is still fine.

So this is well.

MongoDB, this has zero errors and decent latency.

But the one over here, coffee-house, is actually getting 800-millisecond latency.

Let’s inspect this one.

Because I can see my coffee-house service is actually making a request to my .NET coffeehouse, which is my payment service, and making a request to my MongoDB over here and flooding my frontend with my PHP frontend site.

However, the .NET payment service looks to be fine with that green color.

It’s the coffee-house that has some issues going on.

Maybe I can scope down to the exact tracers that customer asked me for and, like, look for the exact request.

I can do that using App Analytics.

So I click here.

This is my App Analytics panel, and as you can see, my request has already been scoped out by my service and my environment.

Now I am only interested in customer ID 2855.

So, here we go.

And I got that email like an hour ago, so maybe scope down to that.

As I can see, most of my requests are, like, 200, okay, and I’m only interested in the slow requests.

So anything above four seconds is basically unacceptable.

Look at that one.

It’s, like, taking more than nine seconds.

If I click on this and see what’s going on here, I see my Flame Graph.

This is basically a trace view of all the requests and their interaction with each other.

If I quickly look at my coffee-house get order service, it’s taking most of my time and making a request to user fetch .NET request, which is decent.

But this Laravel PHP request is taking 78.1% of my time making some asynchronous calls over there and then calling this external web service, taking two third of my time.

There are three total errors which have been listed here.

And if I specifically go to the PHP frontend site, it’s a 503.

Okay, maybe I have a stack trace.

Yeah, right here.

So I have my stack trace listing the exact issue, 503, service unavailable.

I can go to this line of code, release a version and go back to my sleep.

But wait, what if there are other customers who are facing the same issue but never got back to me?

Maybe I can look for all the customers who were affected by this.

I can do that using my Trace Analytics Panel.

So if I go back, I have this request already scoped out.

I’ll quickly go to my Trace Analytics, remove this customer ID and look for duration to find, like, the slowest customers.

Maybe a top list will help.

So as I can see, the customer 2855 reached back to me, but these other two customers who did not reach back are facing the same issue, only worse.

So what I can do is resolve this issue, get back to these customers, and then go back to sleep.

So as you saw how easy it was with Datadog to troubleshoot, not only faster, but also find out those exact customers who are facing this issue.

We want to let you troubleshoot fast, and we want to provide you a deep correlation of your traces, logs, and metrics.

As Steve just talked about, the logs integration with the events panel, and Gabriel walked you through the synthetics integration with the events.

We also have host data available in line with our Trace ID in the trace view panel.

However, is that enough?

Host data alone cannot be used to investigate all the infrastructure problems that your application might be facing.

Announcing runtime metrics

So in order to resolve that, I’m happy to announce that we have runtime metrics in APM today.

With this, you’ll be able to check out if there’s an extensive CPU usage, a garbage collection, a memory leak in your container.

What if there’s a class load delay or a start-up delay because of multiple classes being loaded?

Things like that.

Let’s see how it actually looks in the product.

So if I go back to the same service, I have my coffee-house service right here, which is listing all the JVM metrics.

The heap usage, non-heap usage, garbage collection size, in line with my Trace ID, so I’m not losing any context and I can see any spike if there’s an issue.

I can also see the related processes, hosts, and logs over there.

And not just that.

Along with my Trace ID, I can also see the JVM metrics or these runtime metrics for my service.

So if I go to my service page and scroll down, I have my JVM metrics in line available which is correlated with your runtime ID so you don’t really lose context and are able to check these metrics.

As of now, these metrics are available for Java, Python, Go and Ruby, and we will be rolling them out for other languages soon.

Conclusion

So to have a recap, we have some pretty cool announcements for application performance monitoring.

APM is now available in .NET and PHP and we have runtime metrics for Java, Python, Ruby and Go, and we’ll be soon rolling them out for the other languages as well.

While I’ve already met some of you in the workshop downstairs, I’ll be there for an open space at 4:15 p.m. and I’ll be happy to chitchat about your monitoring needs or any specific questions you have about APM.

Thank you for your time.