Distribution Metrics | Datadog
New announcements for Serverless, Network, RUM, and more from Dash! New announcements from Dash!
Distribution Metrics

Distribution Metrics

Published: April 16, 2019

Good morning, everybody.

I’m a product manager here at Datadog as well and I’m here to talk about exciting new features of our distribution metric type, and to invite this room to participate in early access to those changes.

Our customers might have started monitoring metrics that we collected about their infrastructure itself, CPU, network, clouds.

But more and more, you think up the stack in terms of what your applications are actually doing.

What are distribution metrics?

But how do you measure these things at arbitrarily high a scale?

What is the aggregate user behavior of all of your customers?

To this end, you have been using distribution metrics.

Distributions capture every transaction made and allow you to aggregate however you want a query time.

As transactions come in, the agent captures each value in a statistical distribution.

These distributions can be aggregated server side and until now, we have focused on the ability this gave us to provide accurate global percentiles for any tag (which I’ll talk about in a moment).

With the changes I’m here to present, global percentiles are still a feature of distributions, but we can take advantage of having the data server side for other functionality and easier onboarding.

How to aggregate and isolate distribution metrics

So now, we have this distribution that represents all of the values from all of your hosts, tagged to allow you to slice and dice that data in any way.

What can we do with it?

First, this metric will appear in the distribution metrics UI.

Here, you can query average, minimum, maximum, sum, and count along any tag unlike counts which naturally aggregate by some or gauges which aggregate by last.

Because no aggregation has been applied to these metrics, you now have the option to control tagging.

For this service level metric, if you don’t care which host sent the metric, you can restrict aggregation and query just to those tags that do describe your user’s behavior, what web service served the request, or what availability zone their query is routed to.

You can control all of this in the Datadog app without touching your code.

As with logging without limits, this will effectively decouple ingesting query for these metrics.

How to segment your data with tags and percentiles

The most significant feature of distributions is the ability to calculate global percentiles for any tag combination.

These bars from technology built for our APM product where latency percentiles are calculated for tags like service or resource, but solves the issue for the generic case and gives you total freedom in what you measure and by what tags you measure it.

Unlike other distribution histogram or timer implementations, there is no need to set boundaries on the expected data set which we know to be impossible in most cases.

Our implementation of distributions allows you to send any data and aggregate accurately on the fly no matter what.

Here, I’ll take my sample distribution and calculate percentiles for p50, p75, p90, p95, and p99.

Here again, we can provide you with control over the way you segment your metric.

I’m going to aggregate this along data center and site.

Percentiles help me tell the whole story.

With an arithmetic average, outliers have an outsized effect on aggregation.

So I’m gonna open a new notebook.

I’m gonna query my metric and now, I’m gonna query it three times for different percentiles.

You can see here that the aggregation behavior is coded in this aggregate and I’m gonna query p50, p75, and p90.

Here are two distributions.

One is for organically generated traffic and the other is for users who found my store through a specific ad campaign I’ve placed.

Summary and conclusion

As you can see, the median is the same for each.

The average user buys the same number of items regardless.

However, it is at the extremes where things diverge significantly.

In the case of the campaign, there are actually comparatively few average users.

These users either wind up leaving quickly or else, finding what they want and buying a lot.

In general, those customers who found my site themselves, behave broadly similarly to one another.

As you can see, the new metric type for distributions provides you with accuracy and control.

If you’re not yet taking part in the distributions beta, I invite you to do so.

We will be rolling out the additional control features in the coming weeks, but please do tap me on the shoulder on the floor or reach out to me or your account manager if you have any questions.