Last year, we released native tracing for AWS Lambda through Datadog APM to provide deep visibility into serverless functions and surface performance issues such as cold starts and errors, without any added latency. But Lambda functions are only one piece of the puzzle in a rapidly growing serverless ecosystem, which includes message queues, data streams, notification services, and more. Developers often find themselves managing hundreds of loosely coupled components that power event-driven workloads, making it difficult to trace which components were involved in any given request.
To effectively debug event-driven serverless applications, you need to understand where an issue occurred—and how upstream and downstream services were involved. That’s why today, we’re excited to announce that Datadog APM now connects Python and Node.js Lambda functions to AWS managed services all in one trace. Datadog also now tags your function spans with additional information about incoming events, which you can use to quickly search, filter, and aggregate your data when troubleshooting issues.
With our latest enhancements to APM, even if a request triggers multiple functions and services such as Amazon SQS and Amazon Kinesis in an event-driven architecture, Datadog will follow the entire request from end to end and tie all the components together in a single trace. This way, if you notice a spike in Lambda errors or latency, you can easily identify the root cause (e.g., malformed requests) from the exact service that triggered the function. If the triggering resource is Amazon API Gateway, Datadog also captures the incoming endpoint’s URL path, request method, and status code.
Datadog brings your distributed traces into the same view as your infrastructure metrics and logs to provide detailed context around your event-driven architecture. Once you determine the scope of an issue—and how it affects your end users—you can prioritize fixes more strategically. For example, you may decide to fix issues in the services that support business-critical functionalities first before addressing those that run less time-sensitive tasks (e.g., cron jobs).
Let’s take a look at an example of how Datadog supports application-centric troubleshooting in an AWS serverless environment. Say we have a Node.js Express application for a theme park deployed on a Lambda function. With this application, users can purchase tickets, view ride information, get notified about wait times, and more. When we get alerted to elevated latency in our function, we start by navigating to the Serverless view and inspecting details of every invocation.
To dig deeper, we can navigate to the trace for a particularly slow invocation to see which other functions and services were involved in the same request. In this example, we can see that this invocation was the first in a sequence of Lambda functions connected by Amazon SQS and Amazon SNS. If we take a look at the
http sections of the Tags tab, we can see that our Lambda function was invoked by a GET request from API Gateway to the
/dev/checkout/cart endpoint. If we notice that the Lambda function latency is healthy when triggered by other endpoints, that may mean that there is an issue in the application code processing GET requests to the
/dev/checkout/cart endpoint. In other words, while customers getting ride information are not impacted, others might be experiencing long page load times during checkout, which could result in abandoned carts and churn.
To determine if the issue with the
/dev/checkout/cart endpoint is one-off or recurring, we can pivot over to Trace Analytics. With our new span tags (e.g.,
http.url_details.path), you can meaningfully filter, aggregate, and analyze your serverless application data to find any patterns or anomalies. In this case, we can graph this function’s duration by URL path, which tells us that requests to the
/dev/checkout/cart path consistently took longer than other paths—a strong indicator that we should further investigate the handler for this endpoint in our Express application. Since dozens of endpoints may be connected to any given Amazon API Gateway, having this insight into which endpoint is problematic saves us valuable troubleshooting time.
Datadog APM supports a variety of AWS managed services in applications written in Python and Node.js—and automatically propagates trace context through Amazon SQS and direct Lambda function invocations from AWS SDK without any changes to your code. For other AWS managed services, including Amazon SNS, Amazon EventBridge, Amazon Kinesis, and AWS IoT, see our documentation for instrumentation instructions.
If you have already set up AWS serverless tracing, all you need to do is upgrade your Lambda Library to v28+ for Python and v49+ for Node.js. Otherwise, follow the steps here to set up AWS serverless tracing with Datadog APM.
Our new approach to distributed tracing embraces the complexities of modern serverless applications to help you troubleshoot faster. By capturing the relationships between Lambda functions and other AWS managed services, Datadog APM gives you end-to-end visibility into your event-driven serverless applications to help you find and fix issues faster. If you’re not yet using Datadog, sign up for a 14-day free trial today.