Monitor AWS Step Functions With Datadog | Datadog

Monitor AWS Step Functions with Datadog

Author Paul Gottschling

Last updated: August 7, 2023

AWS Step Functions is a service that abstracts distributed applications into state machines, with each state representing a component of an application. Not only does this automatically generate an architectural diagram of your application’s workflow, it also makes it straightforward to reorder your states as well as implement parallel execution, retries, and other tasks. Whether your states are AWS Lambda functions, Elastic Container Service tasks, or training AI/ML models, you can use AWS Step Functions to coordinate your workloads without changing their code.

We are pleased to announce Datadog’s native support for monitoring and tracing AWS Step Functions. You can use Datadog to monitor your state machines individually and alongside the rest of your infrastructure, drill down to a singular execution to reveal any states that are slowing down performance, and easily correlate metrics with distributed trace data from your states and functions for insights into errors.

Out-of-the-box dashboard for AWS Step Functions.

The state of your states

Since AWS Lambda functions process well-defined inputs into predictable outputs without maintaining data between invocations, they work particularly well as Step Functions states. Datadog’s customizable dashboards give you comprehensive visibility into your Step Functions state machines alongside the Lambda functions they run—and in the context of your infrastructure as a whole—to help you diagnose performance issues and identify invocation failures. You can also instrument your Step Functions to get enhanced metrics generated by Datadog.

Datadog’s out-of-the-box dashboard shows how often state machine executions have succeeded or failed, and tracks the status of states that invoke Lambda functions. You’ll also see latency metrics for state machines and the states they execute.

Datadog automatically tags metrics with the relevant step name, state machine name, and state machine ARN, making it straightforward to compare the performance of functions running as part of the same state machine. By cloning and customizing the out-of-the-box dashboard, you can use these tags to compare metrics from AWS Step Functions and AWS Lambda. If this dashboard shows rising state execution failures, it’s likely due to a similarly high volume of AWS Lambda errors rather than, for instance, a misconfigured IAM role.

new_step_functions02.png

You can then use these same metrics to set alerts and notify your team when your state machines fail more frequently than expected or experience slower than normal execution times.

Deep visibility into each execution

If your Step Functions states are failing or underperforming, you’ll want to find out as quickly as possible. Datadog supports native distributed tracing and APM for Step Functions. By tracing across an entire state machine, as shown in the screenshot below, you’re able to visualize how long each state ran for and whether any errors occurred while executing the workflow.

new_step_functions03.png

Datadog’s native tracing also gives you visibility into trace data from your Lambda functions themselves, including how long they take to execute and how often they return errors. And since your state machines might be processing events at a high volume—over 100,000 per second in the case of Express Workflows, for example—you’ll need a way to find the most relevant traces for your investigation.

You can now also get a high level overview of each state machine’s execution count, average duration, failures, and successful executions in our Serverless view. To drill down and understand your state machine’s status better, you can open the side panel to view recent traced executions, enhanced metrics, error tracking and logs–all within a single, unified view.

new_step_functions04.png

Smarter error handling

AWS Step Functions gives you several ways to handle errors when executing steps, such as retrying an execution or passing the error to another state. Datadog’s AWS Step Functions integration helps you plan the most realistic error handling strategies for your state machines.

If something looks awry within an AWS Lambda function that is part of a Step Functions state machine—let’s say Watchdog has shown an increased error rate—you can explore your traces to get the context you need to start troubleshooting. From the Serverless view, just click the name of a Lambda function running as part of your state machine to see a list of traces—including any errors that Datadog discovers during execution. Once you know what kinds of errors your state machines are encountering, you can determine the best way to handle them.

Use the Datadog Serverless view to see error messages collected using distributed tracing and APM.

Full visibility, step by step

You can set up the AWS Step Functions integration and instrument them right from your Datadog account to get full visibility into your state machines. And since Datadog integrates with other AWS services you can run with Step Functions, like Amazon Simple Queue Service and Amazon SageMaker, you can inspect every state of your workflows, along with the rest of your infrastructure, in a single platform.

Don’t have a Datadog account yet? for a free trial.