Distributed Tracing for AWS Lambda With Datadog APM | Datadog

Distributed tracing for AWS Lambda with Datadog APM

Author Kai Xin Tai

Published: February 4, 2020

Since AWS Lambda was launched in 2014, serverless has transformed the way applications are built, deployed, and managed. By abstracting away the underlying infrastructure, developers are able to shift operational responsibilities to the cloud provider and focus on solving customer problems. At the same time, serverless applications face new challenges for monitoring—instead of system-level data from hosts, you need insights into function-level performance and usage, like cold starts, throttles, and concurrency. To provide comprehensive visibility into the performance of your serverless functions, we’re excited to announce that Datadog APM now natively supports distributed tracing for AWS Lambda.

APM, fully reimagined for serverless

In serverless-first architectures, we know that performance—down to the millisecond—matters for your end-user experience and business objectives. So, we developed a lightweight solution that embraces the nuances of serverless and delivers the deep performance insights you need, without adding any latency. Datadog APM’s open source, framework-aware client libraries work with our Lambda Layer to trace your Lambda functions—and send these traces completely asynchronously as logs to Datadog APM through our Lambda Forwarder.

Datadog's architecture for APM and distributed tracing

With APM, you can get end-to-end visibility into requests that flow across Lambda functions, hosts, containers, and other infrastructure components—and zero in on errors and slowdowns to see how they impact your end-user experience. For instance, APM can help you determine if cold starts during peak traffic hours (or other issues that arise during your serverless workflows) are causing users to prematurely leave your website. Or, you can use APM to capture traces of background jobs—such as data processing or file indexing—and set up monitors to alert you when requests to downstream resources fail.

Native tracing with APM builds on the insights you get from our other monitoring features—including the Serverless view and the Service Map—to provide a comprehensive, context-rich picture of the usage and performance of your serverless functions.

Capture request traces, then drill down to investigate issues

Datadog visualizes the full lifespan of all your requests, regardless of where they travel. So whether a request begins on a VM and triggers a serverless function that writes data to an Amazon DynamoDB table, or flows across multiple microservices, Datadog captures it all to help you track critical business transactions and understand how performance issues impact end-user experience.

Datadog's Service Map for serverless visualizes how your Lambda functions fit into your services.

When it comes to observability, metrics tell us the “what,” while traces and logs help us paint a picture of the “why.” Say you’re running an e-commerce site, and as part of the checkout flow, your application makes a request to a Lambda function that validates coupon codes. When the coupon validation service exhibits a spike in maximum request latency (aws.lambda.duration.maximum), APM helps you pinpoint bottlenecks, so you can effectively resolve the issue at hand.

We can observe a slow span (in purple) at the bottom of this request trace that involves the loading of the campaign database.

In the trace above, we can quickly identify a particularly slow span at the bottom involving the loading of an in-memory coupon database. And if you need further context for troubleshooting, you can click on the “Logs” tab to pivot to the associated logs generated during this request. With your traces, metrics, and logs all in one place, you can get end-to-end visibility across your entire serverless infrastructure—without having to switch contexts or tools.

Slice and dice trace data across any dimension

Investigating errors and bottlenecks during critical periods, such as outages, can be challenging and time-consuming. App Analytics allows you to quickly search and filter by high-cardinality dimensions, or tags, to find the needle-in-the-haystack trace you need—without using a custom query language. In addition to any custom tags you’ve configured, Datadog applies tags to your traces based on automatically detected AWS metadata—such as the function name, region, and service—so you can drill down by attributes that matter to your business.

For instance, by filtering down traces by customer ID and inspecting one in more detail, you can track an individual customer’s journey and determine the impact of a performance issue on your business. By looking at the trace below, we can see that this user encountered a 5xx error on the payments page and was unable to check out, resulting in a direct loss of revenue for the business.

We can see from this trace that the user encountered a 5xx error when they navigated to the payments page.

You can also use these tags to analyze your application’s performance—and if you see anything interesting, you can dive into the relevant subset of traces for more detailed analysis. For example, Datadog automatically detects cold starts—an increase in response time when a Lambda function is invoked after a period of idleness—and applies a cold_start attribute to your traces.

Graphing cold starts by function can help you understand the impact of this increased latency on user experience—and identify the specific times of the day when configuring provisioned concurrency would be beneficial. As you explore your APM data, you can easily export any useful graphs to your dashboards so that you can monitor them alongside other key datapoints from your serverless infrastructure.

You can graph cold starts by function name in App Analytics.

Instrument your AWS Lambda functions

To get started, make sure you’re running the latest version of Datadog’s Lambda Forwarder and Lambda Layer. You can then instrument any Lambda function using the Node.js runtime, as follows:

index.js

 
const { datadog } = require("datadog-lambda-js"); // Import Lambda Layer
const tracer = require("dd-trace").init(); // Import Datadog's Node.js tracing library

// This function will be wrapped in a span
const couponValidation = tracer.wrap("validate-coupon", () => {
  // Coupon validation logic goes here
});

// This function will also be wrapped in a span, (based on the current function ARN).
module.exports.hello = datadog((event, context, callback) => {
  couponValidation();

  callback(null, {
    statusCode: 200,
    body: "You've successfully added your coupon!"
  });
});

If you’re currently using our AWS X-Ray integration, we’ve made it easy for you to switch to APM. As you transition to APM, you can use the mergeDatadogXrayTraces option with your wrapper to merge your APM traces with the relevant X-Ray trace in Datadog, as shown below:

index.js

 
module.exports.hello = datadog((event, context, callback) => {
  couponValidation();

  callback(null, {
    statusCode: 200,
    body: "You've successfully added your coupon!"
  });
}, { mergeDatadogXrayTraces: true });

You can read more about instrumenting your Lambda functions in our documentation.

Serverless meets complete observability

With native, end-to-end tracing now available for AWS Lambda through Datadog APM, you can get deep visibility into all your serverless functions, without adding any latency to your applications. Correlating traces with metrics and logs gives you the context you need to optimize application performance and troubleshoot complex production issues. Our community-driven tracing libraries are part of the OpenTelemetry project, a unified solution for vendor-neutral data collection and instrumentation.

Datadog APM currently supports tracing Lambda functions written in Node.js, with support for Python, Ruby, and other Lambda runtimes coming soon. If you’re already using Datadog, head over to our documentation to begin instrumenting your functions. Otherwise, you can get started with a 14-day full-featured today.