Monitor AWS Step Functions With Datadog | Datadog

Monitor AWS Step Functions with Datadog

Author Paul Gottschling

Last updated: September 14, 2020

AWS Step Functions is a service that abstracts distributed applications into state machines, with each state representing a component of an application. Not only does this automatically generate an architectural diagram of your application’s workflow, it also makes it straightforward to reorder your states as well as implement parallel execution, retries, and other tasks. Whether your states are AWS Lambda functions, Elastic Container Service tasks or traditional web applications, you can use AWS Step Functions to coordinate your workloads without changing their code.

We are pleased to announce Datadog’s new integration with AWS Step Functions. You can use Datadog to monitor your state machines individually and alongside the rest of your infrastructure, reveal any states that are slowing down performance, and easily correlate metrics with distributed trace data from your states and functions for insights into errors.

Out-of-the-box dashboard for AWS Step Functions.

The state of your states

Since AWS Lambda functions process well-defined inputs into predictable outputs without maintaining data between invocations, they work particularly well as Step Functions states. Datadog’s customizable dashboards give you comprehensive visibility into your Step Functions state machines alongside the Lambda functions they run—and in the context of your infrastructure as a whole—to help you diagnose performance issues and identify invocation failures.

Datadog’s out-of-the-box dashboard shows how often state machine executions have succeeded or failed, and tracks the status of states that invoke Lambda functions. You’ll also see latency metrics for state machines and the states they execute.

The AWS Step Functions integration automatically tags metrics collected by Datadog’s AWS Lambda integration with the relevant step name, state machine name, and state machine ARN, making it straightforward to compare the performance of functions running as part of the same state machine. By cloning and customizing the out-of-the-box dashboard, you can use these tags to compare metrics from AWS Step Functions and AWS Lambda, as demonstrated below. In this case, it’s likely that the high level of state execution failures has to do with the similar volume of AWS Lambda errors rather than, for instance, a misconfigured IAM role.

Custom dashboard showing both AWS Step Functions and AWS Lambda metrics.

You can then use these same metrics to set alerts and notify your team when your state machines fail more frequently than expected, or when they suffer from slower than normal execution times.

Visibility into every state and function

If your Step Functions states are failing or underperforming, you’ll want to find out as quickly as possible. Datadog supports distributed tracing and APM for Step Functions through our integration with AWS X-Ray. By tracing across an entire state machine, as shown in the screenshot below, you’re able to visualize how long each state ran for and whether any errors occurred while executing the workflow.

View traces from Step Function states and Lambda functions with Datadog's AWS X-Ray integration.

Datadog’s X-Ray integration also gives you visibility into trace data from your Lambda functions themselves, including how long they take to execute and how often they are returning errors. And since your state machines might be processing events at a high volume—over 100,000 per second in the case of Express Workflows, for example—you’ll need a way to find the most relevant traces for your investigation.

You can use the statemachinename tag in Datadog’s Serverless view to filter your AWS Lambda traces and see which functions are involved in a specific state machine. You can then sort the list of traces by error rate or execution time to identify functions that require your attention, letting you know where to act first to reduce the latency of your state machines, or which states might be causing an incident.

The Datadog Serverless view showing AWS Lambda functions that run within the same state machine.

Smarter error handling

AWS Step Functions gives you several ways to handle errors when executing steps, such as retrying an execution or passing the error to another state. Datadog’s AWS Step Functions integration helps you plan the most realistic error handling strategies for your state machines.

If something looks awry within an AWS Lambda function that is part of a Step Functions state machine—let’s say Watchdog has shown an increased error rate—you can explore your traces to get the context you need to start troubleshooting. From the Serverless view, just click the name of a Lambda function running as part of your state machine to see a list of traces—including any errors that Datadog discovers during execution. Once you know what kinds of errors your state machines are encountering, you can determine the best way to handle them.

Use the Datadog Serverless view to see error messages collected using distributed tracing and APM.

Full visibility, step by step

You can set up the AWS Step Functions integration right from your Datadog account to get full visibility into your state machines. And since Datadog integrates with other AWS services you can run with Step Functions, like Amazon Simple Queue Service and Amazon SageMaker, you can inspect every state of your workflows, along with the rest of your infrastructure, in a single platform. Don’t have a Datadog account yet? for a free trial.