What is AWS Step Functions? How it Works & Use Cases | Datadog
Knowledge Center

AWS Step Functions Overview

Learn how AWS Step Functions allows you to construct resilient application workflows.

What is AWS Step Functions?

AWS Step Functions is a serverless orchestration service that lets developers create and manage multi-step application workflows in the cloud. By using the service’s drag-and-drop visual editor, teams can easily assemble individual microservices into unified workflows. At each step of a given workflow, Step Functions manages input, output, error handling, and retries, so that developers can focus on higher-value business logic for their applications.

In this article, we’ll cover how AWS Step Functions works, the benefits and drawbacks of using this service, and the tools you can use to monitor your workflows and applications.

AWS Step Functions offers a visual builder for creating application workflows.
AWS Step Functions offers a visual builder for creating application workflows.

How AWS Step Functions Works

AWS Step Functions consists of the following main components:

State Machine

In computer science, a state machine is defined as a type of computational device that is able to store various status values and update them based on inputs. AWS Step Functions builds upon this very concept and uses the term state machine to refer to an application workflow. Developers can build a state machine in Step Functions with JSON files by using the Amazon States Language.

You can choose a standard workflow for processes that are long-running or that require human intervention. Express workflows are well-suited for short-running (fewer than five minutes), high-volume processes.

State

A state represents a step in your workflow. States can perform a variety of functions:

  • Perform work in the state machine (Task state—see more information below)

  • Choose between different paths in a workflow (Choice state)

  • Stop the workflow with failure or success (a Fail or Succeed state)

  • Pass output or some fixed data to another state (Pass state)

  • Pause the workflow for a specified amount of time (Wait state)

  • Begin parallel branches of execution (Parallel state)

  • Repeat execution for each item of input (Map state)

The states that you decide to include in your state machine and the relationships between your states form the core of your Step Functions workflow.

Task State

A task state (typically just referred to as a task) within your state machine is used to complete a single unit of work. Tasks can be used to call the API actions of over two hundred Amazon and AWS services. Two types of tasks can be included in your workflows:

  • Activity tasks

    Activity tasks let you connect a step in your workflow to a batch of code that is running elsewhere. This external batch of code, called an activity worker, polls Step Functions for work, asynchronously completes the work using your code, and returns results. Activity tasks are common with asynchronous workflows in which some human intervention is required (to verify a user account, for example).

  • Service tasks

    Service tasks let you connect steps in your workflow to specific AWS services. Step Functions sends requests to other services, waits for the task to complete, and then continues to the next step in the workflow. They can be used easily for automated steps, such as executing a Lambda function.

Within your AWS console, you’ll be able to visualize and validate your state machine as a series of steps. As each step is executed, Step Functions logs its execution time, any input and output, the number of retries, and any errors that occur. This information allows engineering teams to easily understand which step or steps may have caused a workflow to fail and which steps led up to that failure.

Use Cases for AWS Step Functions

AWS Step Functions is useful for any engineering teams who need to build workflows across multiple Amazon services. Use cases for Step Functions vary widely, from orchestrating serverless microservices, to building data-processing pipelines, to defining a security-incident response. As mentioned above, Step Functions may be used for synchronous and asynchronous business processes. The following example shows an asynchronous Step Functions workflow for approving a credit-line increase. The workflow includes Amazon SNS components and several Lambda functions:

Sample Step Functions workflow for a credit line increase request requiring human approval.
Sample Step Functions workflow for a credit line increase request requiring human approval.

Here’s an example of a synchronous workflow for running a data processing pipeline:

Sample Step Functions workflow for a synchronous data processing workflow.
Sample Step Functions workflow for a synchronous data processing workflow.

In summary, AWS Step Functions can be used whenever teams need to define a business process as a series of steps.

Original Research: The 2021 State of Serverless Report

Benefits of AWS Step Functions

No matter what use case is involved, AWS Step Functions enables engineering teams to economically construct complex workflows at scale. The billing model is volume-based, so payment is dependent on the number of times a step in your workflow is executed. Aside from pricing, the core benefits of using AWS Step Functions include the following:

  • Simplified orchestration of microservices-based applications

    AWS Step Functions orchestrates multiple steps in your application workflows. As your workflow executes, Step Functions tracks which step is being performed and which data is passed between steps, allowing your application to pick up where it left off in the event of a network failure.

  • Improved application resilience

    Step Functions manages the workflow steps, errors, and restarts to ensure that application tasks are executed as expected. With this improved application resilience, fewer user requests fail.

  • Reduced need for integration code

    When using Step Functions, engineers can spend less time writing integration code that defines the relationship between distributed application components. Step Functions automatically coordinates parallel processes, exception handling, retries, and timeouts based on your specified business logic.

  • Separate workflow and business logic

    Step Functions decouples business logic from the code that defines how your application is implemented. This separation allows teams to quickly modify workflows, scale components independently, and reuse workflow code for multiple applications.

Above all, the main benefit of AWS Step Functions is that it eliminates the need to manually orchestrate your application components and to manually define how they should work together. As a result, engineers can spend less time writing workflow code and focus more on higher-value business logic.

Challenges of AWS Step Functions

Although AWS Step Functions makes it easier to create and manage complex workflows, using the service is also associated with the following limitations:

  • Application code that is harder to understand

    By decoupling business logic from workflow logic, your application code could become harder to understand for others on your team who may need to modify or update it.

  • Proprietary language requirement

    State machines can be defined only in Amazon States Language, so engineers need to learn this language to use Step Functions.

  • AWS limits

    AWS has imposed various functional limits on Step Functions. For instance, a maximum of 256KB of data can pass through your workflows, the maximum execution time for a state machine is one year, and execution history is retained for only 90 days.

  • Vendor lock-in

    If you decide to move away from AWS Step Functions in the future, you will have to redefine all of your application workflows manually or with a different vendor.

  • Monitoring limitations

    Each state machine exposes data to AWS CloudWatch, but this built-in observability isn’t sufficient to monitor all of your functions and microservices. For end-to-end visibility, engineers must be able to see where the data originates before entering the state machine, and where the data ends up after exiting the state machine.

To avoid these challenges, some teams choose to manually orchestrate their application workflows. This option does provide more flexibility, but it can also be much more time-consuming and challenging to manually integrate several functions and services together.

AWS Step Functions Metrics

Amazon CloudWatch collects and reports the following basic metrics from Step Functions to help engineering teams keep tabs on their state machines:

  • Execution metrics

    The number of times a state machine started, succeeded, and failed, and the length of time it took for the state machine to complete its tasks the last time it ran.

  • Activity task metrics

    Granular metrics about how many activity tasks started, succeeded, and failed, and how long each activity task took to complete the last time it ran.

  • Service task metrics

    These metrics let you know how many service tasks have started, succeeded, and failed, and how long each service task took to complete the last time it ran.

  • Service metrics

    These metrics measure the load on the Step Functions service overall, as well as the service’s health and performance. For example, you can track how many requests all your Step Functions workflows are receiving during a specific period of time.

  • API metrics

    These metrics are associated with calls to the Step Functions API.

While these metrics provide some visibility into Step Functions workflows, most teams also need the ability to monitor each step and service within their workflows. For example, if your workflow includes a database call, you should be able to see traces and logs from the database service to identify the source of any errors. In addition, most teams need visibility into what happens before a request enters a Step Functions workflow and after it exits the workflow. Monitoring platforms like Datadog provide end-to-end visibility into application workflows.

Monitoring AWS Step Functions

Datadog allows teams to view data from their state machines alongside service and infrastructure data in one unified platform. Through Datadog’s AWS Step Functions integration, you can collect metrics related to your state machines’ starts, failures, and execution times, and the status of each state in a workflow. An out-of-the-box dashboard makes it easy to spot trends at a glance.

Datadog’s AWS Step Functions integration gives you visibility into your state machines and states.
Datadog’s integration with AWS Step Functions gives you deep visibility into your state machines and states.

In addition, state machine metrics can be viewed in the context of metrics from other services (e.g., AWS Lambda, Amazon SQS, API Gateway, and many more) that your application calls before, within, or after your Step Functions workflow completes. When an error occurs, you can easily pivot between metrics, traces, and logs from any service to investigate issues and address them. With Datadog, you have everything you need in one platform to elevate application performance for your users.

End-to-end tracing for Step Functions workflows that invoke Lambda functions and other AWS services.
In Datadog, teams can view end-to-end traces for Step Functions workflows that invoke Lambda functions and other AWS services.