Introducing Metrics Without Limits™

Published: July 17, 2019

00:00:00

Michael: Your businesses run on Metrics.

You might be instrumenting hundreds of microservices, and they might be orchestrated in containers from host to host.

But, you use custom metrics to measure values globally.

Time on page, dollars per customer, items bought, trips taken, widgets produced, data from fleets of mobile devices and smart sensors, all of those things by which you measure your business performance, in addition to application performance and infrastructure health.

The inspiration for Metrics without Limits™

Last year, we announced, “Logging without Limits™.”

With this feature, we recognized that not all logs are valuable or at least, not valuable all of the time.

As customers, you send us arbitrary volumes of metric data and instrument up the stack, or embrace containerization and full match multitenancy.

We understand, that not all timeseries are valuable all of the time as well, and that you do not want to pay for a timeseries that you do not intend to query.

This is especially true of business metrics.

Business data is valuable in aggregate, not necessarily in detail.

You care about the customer, him or herself, not the shard or container their request was load balanced to.

For other metrics, you need the deepest level of granularity at all times.

Metrics without Limits™—and how it works

This is why, I am excited to announce, Metrics without Limits™.

Send us all your data and pay for what is valuable to you.

Limits are not the capacity to storing query, but the freedom to use Datadog for everything that is valuable to you within the observability budget that you have established.

This is effectively decoupled the instrumentation of metrics from ingest and query.

We will ingest and process metrics at 10 cents for every 100 metrics, which makes it very affordable and easy to send us that data.

Then, you can drop tags and persist and query data that is useful for you.

This still includes the full 15 months at one-second granularity for the entire period.

In order to provide this freedom though, we also provide flexibility.

All aggregation for Metrics without Limits™ is done in-app.

There is no agent or code changes necessary.

You send us all the data that you always have, and then dynamically decide what to keep or drop on the fly.

Metrics without Limits™ in action

I’ll show you.

So, here I have instrumented my example from before, the amount of money each customer pays shopping on my site.

I might not want to pay for customer level granularity all of the time, because it’s not necessary to describe the health of my business.

Here, I’m gonna edit and drop the host tag.

I will also have…

Yeah, so here, I will drop the host tag, and I also have a set of tags that are available by default, that come with the metric and I have removed customer ID.

I don’t need this tag to report the health of my product line to my CEO.

So now, I’ll go to a dashboard, and I can see that I am aggregating out fewer timeseries and therefore paying for fewer timeseries.

I’ll drop the user ID here as well.

But, what happens if metric behavior crosses an unacceptable threshold, or I wanna do deep user research?

Here, I can see in my example, that latency has gone above 300 milliseconds on the subset of my shards.

So, now I am gonna go back.

I’m gonna add user ID back in.

This way, I can find out who is load balanced to which shards and I can get in touch with them and let them know a fix is in flight.

I can also go ahead and fix my load balancing so that we fix the issue.

I am gonna leave my demo off here, but after my incident is resolved, I can go ahead and drop that tag again.

I am happy to present Calvin French-Owen, the CTO and co-founder at Segment, and a long time partner and customer of ours, who has helped us design and build this feature.

How Segment uses Metrics without Limits™

Calvin: Thank you Michael.

Now, I won’t belabor how important metrics are to a business, but suffice to say, it’s clear that if you are running a large high-throughput web application, you need to understand what’s going on in every layer of the stack, from your core infrastructure to your business logic, all the way up to that user experience.

And, the reality that I think at least I have been hearing, I hope you have too from the other speakers, Aparna and Alexis, is that getting that level of visibility is really more complex in today’s cloud environments than ever before.

And at least in our case, we have tens of engineering teams who are building hundreds of microservices, who are running on thousands of EC2 instances, which are then running tens of thousands of Docker containers which come and go at a moments notice.

And at least for us, keeping track of all of that is incredibly difficult.

So, over the next five minutes, I’d like to share some of the challenges that we faced building out segments and how Datadog and their metrics product has helped us solve those.

What is Segment?

Now, before I get started, I wanted to share a little bit about what Segment is and what Segment does, just to give you the context for the scale that we are operating in.

Kind of at the core of it, Segment provides our customers with customer data infrastructure.

We help thousands of businesses around the world understand who their users are and what they are doing, so that they can give a first, best-in-class customer experience to all of their users and understand the complete customer journey.

And concretely what that looks like, is something like what’s on the slide behind me.

And so, Segment helps you take data from different sources here on the left.

This might be your web application, it might be your mobile apps, or maybe it’s places where your code isn’t running, but your customers are still interacting with you, tools like Zendesk and Salesforce.

We help you get that data into one consistent place, and then we help you send it to whatever tools you might be using.

This could be a data warehouse like Redshift or BigQuery, or it might be an analytics tool like Google Analytics or Mixpanel.

Or maybe it’s an email tool like SendGrid or Customer.io.

And the key insight is that no matter what your data looks like or what tools you are using, your teams need to all understand the same information, which is just, “Who is my customer and what are they doing?”

And in terms of scale, it’s actually taken us to some really interesting places.

If we look at Segment by the numbers today, we are collecting 360 billion events every single month.

Now, inbound that translates to about 200,000 incoming HTTP requests, but of course we fan that out, and then outbound, it’s more like 400,000 concurrent requests.

Powering all that infrastructure is 15,000 containers running on ECS and 250 different microservices.

And today we are serving thousands of customers, from startups like Confluent, GitHub, Atlassian, through some of the world’s largest global brands like Intuit, IBM and Levi’s.

But of course, with that scale, comes a few different problems for us.

I’d like to focus on one today.

A lot of times we have a question like this.

“We are trying to send data into Salesforce on behalf of some sort of customer, and we see a lot of 401s coming back from it.”

And immediately, we wanna understand why.

The platform problem

Is it a single customer whose API keys have somehow expired?

They stopped paying for Salesforce?

Is it a bunch of customers who are all experiencing a global outage?

Is there some sort of network disconnect?

Did we push some bad code?

What’s going on?

And the reason this is so hard to answer at our sort of scale, is what I’d call the platform problem.

Kind of at the core, our state space is unbounded by what our users send us.

And to give you a sense of the cardinality here, essentially Segment is trying to collect data from 100,000 different data sources today.

We then send that data to more than 250 different destinations, and each one of them might respond with one of 1300 different error codes.

And what that means, is a lot of metrics, 33 billion different timeseries that we could possibly be collecting.

And so, obviously we don’t wanna be spending our time building out this metrics infrastructure, and instead, we rely on Datadog to do so for us.

Here is an example of one of our actual dashboards that we use to monitor our infrastructure.

Not only does it give us deep insight into how these individual customers are behaving, but also lets us slice and dice by any of our different custom tags and metrics, using the tools that Michael just outlined before me.

Additionally, there are certain questions that we have all the time.

Things like “how is integration X behaving”, or “how much traffic are we sending to integrations E”?

And for that, we’ve started using this aggregation framework.

We want to be able to answer these questions really quickly without having to pay for the overhead of the deep dives where we only need the occasional insight.

Now, by the numbers, to give you a sense of how this looks within our organization, we are using Datadog everywhere.

To date, we’ve created more than 950 different dashboards in around a 1000 different monitors.

But really it’s spread throughout the whole organization.

Not only is our engineering team using it, but our customer support, our sales team, and our account managers are all going to this place for their data, to understand what’s going on in the infrastructure, so we can communicate that well to our customers.

Closing the loop

Of course, I’d say all these internal processes are really useful, but they are sort of table stakes when it comes to a business.

And really what we wanna figure out, is how do we then provide more value on top of those metrics to all of our users.

As I said, there is some of the table stake stuff which is creating alerts when things go wrong, but more recently, we’ve been investing to understand where we should be spending our time as a business.

We’ve used Datadog’s SLO feature to understand how close we are in error budget.

And literally, every team can see and understand where they should be spending their time, whether it’s building new features, or focusing on reliability.

Additionally, we’ve taken all of this data and actually made it available to our customers.

We baked it into the core part of the product.

If you go to our status page today, you’ll see these metrics here which actually are pulled from our live production Datadog metrics, which give our customers the confidence and insight into how their data pipeline is behaving.

And then probably the one I’m most excited about: we are just now announcing an integration with Datadog that lets customers see not only how the service is working at an aggregate level, but what is happening specifically with their own data.

They can understand the latency that it’s taking to get their data into Google Analytics, how long it takes to load their warehouse, and whether they might have a single failing API key that lets them get first class visibility into how their data pipelines are behaving.

Observability is key to Segment’s operations and business model

So, if I could leave you with one thought in closing.

Observability is really the key to how we run Segment, not only for our internal teams, but making it’s way all the way to our product, our customers and the value that they provide knowing that they are relying on a world class customer data infrastructure that’s both reliable and meets their scale.

And really, Datadog and its metrics products that we’ve helped develop, help us do that.

Thank you.

In conclusion

Michael: Thank you very much, Calvin.

What we’ve shown you here today, with Metrics without Limits™, is how you can provide additional value and control to your Dev and Ops teams, without sacrificing the ability to drill down into the finest grain details.

Metrics without Limits™ is available in technical preview now, and I’ll be on the floor during the conference or reach out to your account manager for details about how to sign up.

Thank you very much.