Tracing is a critical part of monitoring application performance, especially as organizations shift to deploying services using distributed systems, serverless computing, and containerized environments. Teams need real-time, end-to-end visibility into all of the traces relevant to performance issues such as an application outage or an unresponsive service, but managing tracing costs often results in gaps in valuable tracing data. These gaps can increase the time it takes to pinpoint and resolve an issue and turn a small anomaly into a worst-case scenario that could significantly affect your customers and business.
Traditional sampling methods may only sample a trace at the beginning of its path through your distributed services (i.e., “head-based sampling”), creating traces that are incomplete and missing the important telemetry needed to diagnose a problem.
Datadog APM and Tracing without Limits™ is designed from the ground up to address these problems by using tail-based decisions and no sampling to always capture complete traces, including all error and high-latency traces, allowing you to:
- ingest 100 percent of traces by default (requires Datadog Agent 6.19+ or 7.19+)
- live search and analyze every trace and span by any tag over a rolling 15-minute window
- keep traces with critical business context for 15 days with tag-based retention filters
Tracing without Limits provides unparalleled visibility into the performance of applications at any scale, enabling you to monitor customer transactions, deployments, bug fixes, and more in real time.
Datadog retains all error and high-latency traces and any high business-value traces you automatically captured with retention filters—such as shopping carts with higher dollar value, top merchants, or transaction IDs associated with key products—for 15 days. This enables you to easily troubleshoot critical issues reported by your customers.
In this post, we’ll walk through how Datadog Tracing without Limits can help you troubleshoot problems in your application by:
- using Live Search to pinpoint the source of customer issues
- leveraging Live Analytics to visualize service performance in real time to determine how widespread the issue is
- creating retention filters to capture critical performance telemetry and business context for all of your traces
When you receive an influx of customer-reported issues, such as not being able to check out or purchase an item from your application or site, you need the ability to quickly find and resolve the problem before the issue becomes widespread. With Datadog APM Live Search, you can search and filter across all traces using any tag in real time within the last 15 minutes (rolling window). Datadog automatically streams all of your ingested traces for your configured services—regardless of their scale or throughput level—ensuring that you never miss a critical trace when troubleshooting production outages, deployment issues, or other types of incidents.
The example query below reveals a sudden increase in 5xx errors from a
web-store service, which correlates with the incoming reports of issues with customers’ checkout experience.
As seen in the example, you can then select a specific trace, explore its flamegraph, and view more details about the services that were affected and what caused the error. The example trace below shows that a payment service was unavailable, resulting in the internal server errors that were reported by customers.
Using the flamegraph, you can investigate any span to determine if the errors were related to application code, another dependent service, or an API call, and then seamlessly pivot to related logs for further investigation—providing a single, unified platform for resolving application issues. As you can see in the associated log stream below, the payment service was unavailable because the payment API exceeded its rate limit.
Once you identify the source of a reported issue, it’s important to assess how widespread it is. For example, you need to be able to quickly determine if a recent deployment introduced the problem to a larger subset of your customers. With Live Analytics and Datadog’s unified ‘version’ tag, you can analyze all traces for the last 15 minutes to correlate the error in question with a recent version release. The graph below shows a recent increase in error counts for both the “5.4.0” and “5.4.1” versions, with a significant spike in the latter as it gets rolled out to production.
Drilling further into version “5.4.1”, you can see the customers that were affected the most by the errors so you can follow up with them after you resolve the issue—you will see every customer in your search, regardless of trace volume.
If you have Continuous Profiler enabled, you can easily correlate the increases in error counts with code-level performance to confirm if they were caused by events such as a resource-intensive query or too many stop-the-world pauses.
Applications can generate large volumes of traces, making it more difficult to not only manage the costs of keeping traces long term but also find the exact traces you need to pinpoint the root cause of a problem. To efficiently resolve application issues for your customers, you need greater control over the traces you keep while ensuring you never miss the traces critical to diagnosing application errors or service latency.
To solve this problem, Datadog enables you to create tag-based retention filters to keep only the high-value traces you need and applies Intelligent Retention Filters to automatically capture traces that are critical to monitoring the overall health of your applications (e.g., traces that indicate errors or high latency).
Retention filters can easily tie in important business context to traces, ensuring that you have all of the telemetry data needed to debug an issue. Your filters also enable you to decide which traces you don’t need, so you never have to pay to store traces that do not add value to your application performance monitoring.
For instance, you can create filters to keep all traces for:
- all credit card transactions over $100
- high-priority customers using a mission-critical feature of your SaaS solution
- a canary deployment of a critical application update
- specific versions of an online delivery service application
- traces from the latest version of your iOS application in a specific geography
Retention filters help you sift through the large volumes of traces your applications generate, giving you complete control over the cost of retaining them and allowing you to add the necessary context for faster troubleshooting.
With Tracing without Limits, you can search and analyze all error, high-latency, and high-value traces in real time to debug application performance issues and better understand customer impact. You also have full control over the traces you keep (and the cost of keeping them) with retention filters. And Datadog enables you to control the ingestion rate per instrumented application, ensuring that you have complete transparency into service performance. Check out our documentation to learn more about tracing and Datadog APM. If you don’t have a Datadog account, you can sign up for a free trial.