Introducing always-on production profiling in Datadog | Datadog
Datadog's Research Report: The State of Serverless Report: The State of Serverless

Introducing always-on production profiling in Datadog

Author Kai Xin Tai

Published: January 17, 2020

To complement distributed tracing, runtime metrics, log analytics, synthetic testing, and real user monitoring, we’ve made another addition to the application developer’s toolkit to make troubleshooting performance issues even faster and simpler. Today, we’re excited to introduce Profiling—an always-on, production profiler that enables you to continuously analyze code-level performance across your entire environment, with minimal overhead. Profiles reveal which functions (or lines of code) consume the most resources, such as CPU and memory. By optimizing these, you can reduce both your end-user latency and cloud provider bill.

In Datadog, you can now:

Analyze your production code to improve performance and reduce infrastructure costs with Datadog Profiling.

Visualize all your stack traces in one place

Profiling allows you to observe how your programs execute in production, so you can effectively diagnose and troubleshoot performance issues that occur under real-world conditions, such as OutOfMemoryError exceptions in Java and lock contention. At the same time, it could potentially surface lines of code that you were not even aware were adding unnecessary overhead to your application.

Profiling collects representative samples of all your stack traces—regardless of whether they come from your code or third-party libraries—and visualizes them as a flame graph. Each bar represents a function and is arranged vertically, from top to bottom, in the order in which it is called during a program’s execution. In the Java profile shown above, the width of each frame corresponds to its resource consumption, while its color identifies its package.

Inspecting these stack traces can help you understand the different ways your functions are called—and which ones are consuming the most resources. As your application scales, optimizing these resource-intensive sections of code can significantly reduce end-user latency and infrastructure costs. Depending on the language your program is written in, you can explore a variety of profile types, including CPU, memory, lock, and I/O.

Discover bottlenecks in your code at a glance

You can use the summary table in the right panel to filter call stacks by various attributes, such as method, thread, and package. For example, in a Socket I/O profile, you can view a list of threads, IP addresses, and hosts sorted in descending order of the amount of data read or written, or the time spent in either of those processes. You can then easily filter the flame graph to show only the relevant call stacks from a specific host and identify ways to optimize specific read or write operations.

Correlate profiles and distributed traces seamlessly

When developing Profiling, we wanted to ensure that it was tightly integrated with the rest of the Datadog platform. Distributed tracing and APM allows you to track the path of individual requests across all your services, and identify which step is creating a bottleneck or causing an error. To provide even deeper context for debugging performance issues, we’re working to ensure that you can pivot seamlessly between APM and Profiling.

When investigating a particularly slow request in APM, you can pivot with a single click to the related profiles to identify this specific request’s resource bottlenecks. Similarly, within a profile, you can identify the most resource-intensive requests and inspect them in APM to understand how they fit into the bigger picture (e.g., what other services were called in this request?)—and how they impact your business (e.g., which customers are using the most resources?).

Get actionable insights for performance improvements

Datadog automatically performs a heuristic analysis of your code and displays a summary of the main problem areas at the top of the Analysis view. In the example below, we can see that the largest performance improvements can be achieved from addressing the blocked threads, inefficient garbage collection, and memory leak. For even more granular insight, you can view the complete analysis, broken down by categories such as code cache, class loading, and heap.

Profiling provides a powerful way to observe long-term performance trends since it collects data from all your hosts, all the time, without requiring access to individual machines. By pivoting to the Metrics tab from within a profile, you can get an overview of key metrics from the service, such as top CPU usage by method, top memory allocations by thread, and garbage collection time by phase.

With the time of the profile overlaid on all the graphs, you can determine, for example, whether a spike in lock wait time is a recurring behavior (as in the example above) or a one-off, so that you can take the appropriate course of action. By correlating different metrics, you can get a more comprehensive view of your application’s performance—and if you find any interesting trends, you can add any of these graphs to your custom dashboards, or create alerts to notify your teams when a metric rises or falls beyond a critical threshold.

Zero in on profiles using tags

Since Profiling is built to be always on, developers can effectively debug issues during time-sensitive situations, such as outages, by pulling up profiles captured before and during any downtimes. The Profile Search view displays all your profiles in one place and allows you to use facets to quickly slice and dice your profiles across any dimension—whether it’s a specific host, service, version, or a combination thereof. You can also use the controls in the sidebar to drill down to profiles with the highest CPU or memory consumption. Clicking on any profile then takes you straight to the flame graph of stack traces for a more detailed view.

Optimize the performance of your code, running anywhere, with Datadog Profiling

Start Profiling

Together with distributed tracing, real user monitoring, network performance monitoring, synthetics, and log management, Profiling delivers yet another layer of visibility to help you understand how to improve the performance of your code in production and reduce cloud infrastructure costs. Profiling is currently available in private beta, and supports Java, Python, and Go (with support for .NET, Node.js, PHP, and Ruby coming soon). If you’re already using Datadog to monitor your infrastructure, you can sign up for access here. Otherwise, you can get started with a 14-day today.