Network Performance Monitoring | Datadog

Network Performance Monitoring


Published: 7月 17, 2019
00:00:00
00:00:00

The problem

Miranda: Many of us are migrating towards microservices and increasingly distributed systems. Your infrastructure is not only running the code that you wrote, but includes services from multiple cloud platforms, managed services, lambdas, and databases.

And it’s difficult to see the thousand-foot view of where all of that data is going.

At the end of the day, communication between these services is critical to your end-user experience.

So, how do you understand the dependencies between these components?

You have probably SSH’ed into a host to run NetTop, looked at raw bandwidth and packet counts.

But this only shows you one host at a time.

Cloud services do provide network bandwidth data, but how do you tie it back in to your services or applications that you’re running?

So, neither of these options can be resolved to something that really matters to you.

And so, because of this, I’m happy to announce a new product from Datadog: Network Performance Monitoring.

What is Network Performance Monitoring?

Network Performance Monitoring provides visibility across any of your tags, from high-level tags like availability zones and services, all the way down to the container and even process levels.

This allows you to aggregate your network metrics across any two types of objects, across your infrastructure and applications, and whether or not this is hosted on-prem or in the cloud.

Datadog Network Monitoring helps provide immediate insight into your performance and dependencies.

How Network Performance Monitoring works

Datadog begins by collecting network metrics and creating aggregate views without any instrumentation.

In the first 30 seconds, you can see the thousand-foot view of your network connections.

Maybe one of your co-workers has added a managed service, which is inadvertently using up all of your bandwidth.

Or maybe you’re sinking all of your assets from a file server on the other side of the world, when you really should have just been reading it from your local cache.

You can view this data on the service level or a different aggregation, like availability zones or services, or even organizational elements like teams (seen here).

And on top of this, you can visualize your bandwidth data or TCP re-transmit counts.

How to use Network Performance Monitoring

Suppose you’re running monitors on your services, you can identify the services that are currently in an error state.

And you can dive in to see both incoming and outgoing traffic to your specific node, as well as any additional dependencies.

If you’re looking to get more granular data, you can navigate to the network explorer here on the left.

And suppose you’re tasked with reducing cross-availability zone traffic, you can select the source and destinations as availability zone and filter your source service by something like Elasticsearch.

Type it in here, or really any facet that you’re interested in.

And you can see here that Elasticsearch is talking in cross-availability zones.

So, you can dig in more here to gather more data.

And you might be able to reduce your transit costs of AWS by creating a duplicate instance per availability zone.

We’ve partnered with Brent here at Cvent—we are Datadog, he’s from Cvent—while building out Network Performance Monitoring.

If you don’t already know, Cvent powered our registration here at Dash.

We’re excited to have him join us on stage and tell us about his experience using Datadog Network Performance Monitoring.

Cvent’s experience with Network Performance Monitoring

Brent: Thanks Miranda.

Our platform streamlines everything from creating an event website, to finding the right venue, to getting hotel rooms for your attendees.

This lets event planners focus on what matters, bringing people like us together to build those face-to-face connections.

Within Cvent, SREs set standards and provides tooling that empowers other teams to deliver reliable, scalable, and performance systems for our customers.

Over time, our systems have become increasingly complex.

We’ve been converting to microservices, moving to containers, and integrating acquisitions.

One thing that helps us manage this complexity is Datadog APM.

It provides our teams with the context they were missing in a highly distributed system—context that helps facilitate this team’s ability to troubleshoot and resolve issues.

It’s the same level of context that Datadog has delivered with Network Performance Monitoring.

As Cvent software has grown more complex, so have the underlying systems they run on.

We’ve grown from a single data center to running on both data centers and cloud environments across many regions over the world.

With this migration, we lack visibility into what our infrastructure is actually doing and are forced to rely on our expectations.

With Datadog NPM, we gain visibility in what our underlying infrastructure depends on.

Because it uses data from the entire Datadog ecosystem, we can view and filter on our existing tags and gain additional context right out of the box.

This allows us to identify unknown dependencies or missed configurations, where we previously lacked visibility.

We can also use this to answer questions about our infrastructure and make informed decisions to improve our reliability, optimize for costs, or further improve our architecture.

We’re not only able to gain this visibility into the host for the processes and containers that run on them; the agent is able to correlate data down to the process and the containers they run in.

This allows us to further understand where these dependencies originate.

It’s especially powerful for us, since we run most of our services in containers and are still in that migration.

This level of granularity allows us to start at the global scale and drill into help pinpoint exactly what we’re looking for within our infrastructure.

The increased visibility isn’t just limited to the infrastructure that we own.

Some unexpected uses of Network Performance Monitoring

With Datadog NPM, we can see anything from inside our network or outside our network, including apps that are not built by us.

We’re very interested in how we can leverage this information to gain visibility into our changing security landscape as well.

As we add additional services and acquire new companies, there’s an increased risk that changes could impact our security posture.

Our hope is that Datadog NPM can help us provide improved visibility for our security health, as well as our system health.

We’re happy to be partnering with the Datadog team to help shape this product.

We believe the visibility that Datadog NPM provides will continue to help us improve and deliver new products to our customers.

And with that, I’m going to go ahead and pass this back to Miranda.