AWS originally released their Step function in late 2016. Despite this service being available for five years now, there has been very little buzz around it on the web. Even after browsing through countless “Top x most commonly used AWS services” articles, you’d be hard pressed to find a single mention of step functions or any basic tutorials on how to set them up beyond AWS’ official documentation. Information regarding step functions is hard to come by. In this post, I will guide you through the benefits of step functions, their general use and potential benefits.
To put it simply, AWS step functions allow us to visually create workflows that follow a set of predefined steps to fulfill a specific goal by the end of the workflow execution. It does so with the Workflow Studio, or through a schema we provide it. This schema is written in Amazon States Language (ASL), a JSON-based, structured language to write workflow states.
Step functions handle failures and retries automatically which can otherwise be a significant time investment. This allows development teams to focus their efforts on higher-value priorities and business logic.
Steps Functions may execute tasks on plenty of AWS services, but for the sake of brevity, this article will mostly be focusing on their interoperability with Lambdas.
Step functions start off by reading the schema provided to it written in ASL.
Each state (step) requires a name that’s unique within the state machine’s schema and states are referred to by name. These states are essentially steps that the state machine will go through, each of them representing a task it must accomplish.
The schema will have the following 3 properties at its root:
- Comment: String - A quick description of the state machine and its uses
- StartAt: String - The starting point with the value being the first state’s name
- States: Array - A list of all the states within the state machine
In general, most states will at least have the following two properties: Type and Next. The former is a string representing the type of work performed on the state machine while the latter points to the next state once the current one has finished executing.
The Type value must be one of AWS’ predefined values:
- Task - Executes a task to do some work, usually used to execute a Lambda function
- Choice - Branching paths, selects one based on the provided condition
- Wait - Wait for a specific amount of time or until a certain date
- Parallel - Executes multiple branching paths at the same time
- Map - Executes a set of steps for each item within a provided input array
- Pass - Does nothing, takes the input and sends it out as the output, helpful for debugging
- Succeed/Fail - Ends the execution, doesn’t require a Next property like the other states
When the state machine gets executed, it will first execute the state named in the StartAt property. Once that’s done, it will look at the Next property and work its way through the tree until it either reaches a state with the property End: True or one that has a Type of either Succeed or Fail.
Take note that when trying to create a new step function on the AWS console, you’ll be presented with the option to pick between standard (default) or express workflows. The former is mostly used for long term tasks that may take an extended period of time to complete while the latter is mostly used for high-volume and frequent tasks.
There are many benefits to using step functions. Here are some that we would like to highlight:
- Heavily facilitates the orchestration of multiple Lambdas and services to perform a specific task.
- Allows developers to focus more on business logic while the state machine handles orchestration and error handling of the workflow.
- This separation will also allow Lambdas to be more reusable as they will no longer be strongly coupled and can be used independently.
- The visual graph generated by the step function schema would allow developers to easily explain concepts and workflows to non-technical users.
- Native integration with other AWS services and automatically scales based on changing workloads.
- (Standard workflows only) Provides execution logs detailing every state within a state machine’s execution for easier debugging (full input/output logs).
- Error handling can be done within the state machine rather than within the Lambda function, but a combination of both is ideal.
Here are some drawbacks to consider:
- Extra costs - you can expect to pay 1$ per 40 000 state transitions, on top of the additional service costs that the StepFunction service is interacting with.
- Step function payloads have a limit of only 256kb, file manipulation is not recommended (using a presigned S3 URL would be better).
- Vendor lock-in. Amazon States Language is unique to AWS and since step functions handle all of the coordination between AWS services (Lambdas), moving away from AWS to another vendor will be harder since you’ll need to rewrite all of the orchestration done by AWS.
- Choice state types only support basic conditions such as number comparisons, whether or not a variable is present or if a string is equal to something.
More complex conditions are not supported, a solution would be to add a Lambda step to calculate the condition and include the result in the Lambda response. The state machine can then use that as the condition and remove it at the start of the next state.
- (Standard workflow only) Standard workflows are used for long-term tasks and are asynchronous. When a standard step function gets executed, we only get the executionARN and the timestamp of when the execution started instead of a response.
- (Express workflow only) Express workflows lack the detailed execution logs present with standard workflows, debugging express workflows can become a nightmare due to a lack of information.
As with any other service, you want to know how well it will perform and how well it scales based on changing workload before using it in a PROD environment. You’d be happy to know that there is very little negative impact on the execution time when a step function is called from an API gateway. Step functions also scale automatically so you wouldn’t have to worry about that either.
Here’s a real-world example, let’s say we have an application to send messages to connected users, the message must be recorded, and messages must be filtered before being sent.
We’ll be executing the state machine from an endpoint on an API gateway.
The step function itself is relatively simple, but involves a handful of other services such as DynamoDB read/write, S3 bucket access, Lambda calls and web socket message propagation. Do note that it is using the Express workflow.
First, it will pass the payload to a Lambda in Received Message which handles reorganizing the data into an easier to read format. It will then execute two Lambdas in parallel: Save Message to DB and Filter Message.
The task of Save Message would be to save the message to DynamoDB, while Filter Message would fetch a file from S3 to be used as a black-list to filter the message content onto. After going through the filter, Propagate Message will handle sending the message to every websocket client that is currently connected. Message Sent is simply a state of type Pass with End set to true to mark the end of the state machine.
Now, let’s look at the K6 load test execution graph below:
We put the endpoint through a 12-minutes long load test consisting of 50 users, starting from 10 users and going all the way up to 50, as seen from the purple line. As you can see throughout the course of the test, the execution time never shifts dramatically with the exception of the first execution, which is due to the Lambda going through a cold start.
Here’s a more detailed look at the execution times
Despite the step function involving multiple different services and asynchronous operations, the average execution time was only 176ms with 95% of those 16 631 requests taking under 313ms.
To verify the integrity of the results, we ran the same test 5 times and determined a deviation of only 5 to 15ms.
Now, all of this is not to say that a step function is a free orchestration tool with no performance downside. Whenever a state transition takes place, the state machine will trigger some pre and post-execution activities and while the performance cost isn’t high at all, it’s still something to keep in mind, especially with state machines that will be frequently executed. See below:
This is the execution log of a very simple state machine with 2 states. The first state executes a lambda that simply returns a string and the next state is of type Pass which marks the end of the execution.
You’ll notice that there’s a 75ms delay between the state machine execution start and the Lambda being scheduled for execution, with an extra 19ms until the actual Lambda execution. Afterwards, you’ll notice that there’s a delay of 21ms between the end of the Lambda execution and the scheduling of the next state (step 6 to 7).
Before bringing up the numbers, it must be noted that Standard and Express workflows are billed differently.
Standard workflows are billed per state transitions, that is, whenever the state machine goes from one state to another. The first 4000 state transitions are free but subsequent transitions will cost $0.000025 each, or $0.25 per 10 000.
Express workflows, on the other hand, are billed based on the number of requests and the duration. Each individual workflow will cost $0.000001$ per request, or $1 per 1 million requests.
The duration of each Express requests also adds to the price and follows these rules:
- $0.00001667 per GB-Second ($0.0600 per GB-hour) for the first 1,000 hours GB-hours
- $0.00000833 per GB-Second ($0.0300 per GB hour) for the next 4,000 hours GB-hours
- $0.00000456 per GB-Second ($0.01642 per GB-hour) beyond that
You can find the full pricing guide on AWS’ website.
These numbers might seem low on paper, but they start stacking up quickly when you have a large user base. Let’s take the following scenario as an example:
- You have an application with 10 000 daily users
- Your application’s main feature uses a Standard workflow step function
- The state machine has 7 total state transitions it must go through
- A user, on average, will execute this state machine 12 times per day
Knowing this, the application will cost you $0.0021 daily per user, $21 daily for a user base of 10 000 or $7665 a year for a user base of 10 000. This price doesn’t even take into account all the other services your state machine will be communicating with.
There are additional factors that can add to the price in the future, such as having a growing user base or an increase in state transitions per state machine to add features.
While step functions aren’t something you’d throw into every new app you develop, especially considering the extra cost, it is worth considering if you have a proper use case for it.
For example, with standard workflows, you can use it to handle cases where an administrator needs to approve of a user’s action before they are able to proceed, which is likely to take up more time than a Lambda’s limit of 15 minutes unlike a standard workflow’s 1 year limit. The state machine can be stopped until the approval is given, after which, it’ll resume and send a notification to the user. All of which is handled by the state machine so you won’t have to add a “status” field into your database to keep track of this.
They are best reserved for the following kind of workflows:
- Business-critical logic (such as payment handling)
- Complex: A workflow that would need to orchestrate plenty of services that are spread out
- Potentially extensively long: some service may take up a long time to process, also example above
- Potentially requiring human interaction at some point (example above)
Step functions should only be implemented if you have a proper use case for them or if your application can benefit from its features. The reasons stated above are a good way to determine if that would be the case for your project.
Photo credit: Philipp Katzenberger