Using Step Functions to build CloudFormation custom resources
Photo by
Serban
Mestecaneanu on
Unsplash
My team uses AWS CloudFormation to provision our cloud infrastructure using code. Most of the time we can get what we need with the set of resources that AWS provides.
However, sometimes CloudFormation support for a service or a particular feature takes a while to arrive, and we need to fill in the gap ourselves. CloudFormation gives us the ability to fill these gaps by building “custom resources” that can literally run any logic you need in AWS Lambda, and we’ve used these when we needed to.
Sometimes Lambda isn’t the right answer
Sometimes even Lambda fails us, though, as some resources can potentially take a long time to set up, and we don’t particularly want to have a Lambda function sitting idle or potentially timing out halfway through the resource setup.
The great thing about Lambda functions is that you only pay for them when they’re running. This makes them great for places where you need to run a quick task or handle an API request. You lose some of the benefits when your Lambda function looks like this:
start_something()
while not_done():
time.sleep(30)
finish()
because you’re paying for the Lambda to run while you’re waiting for your operation to complete. Worse, Lambda functions have a maximum lifetime of 15 minutes, so if your process takes longer than 15 minutes, you have to do weird hacks to make it work with a pure Lambda solution.
When we found out that CloudFormation didn’t have support for creating DynamodDB Global Tables, our first thought was “we know how to do this, a Lambda function custom resource can handle it.” However, as we dug into the details and tried it out, we quickly learned that we could get into a scenario where creating the initial replica set or updating the replicas could easily exceed the 15-minute Lambda timeout.
Step Functions to the rescue!
Here’s where AWS Step Functions comes in. Step Functions make the task of orchestrating processes easier. They have built-in support for looping, waiting, and integrating with different functions and services, which makes them perfect for this sort of thing.
One of our team members put together this Step Function definition for creating
a DynamoDB global table. It starts out by checking the state of the table,
waiting until the table is ready for updates, then comparing the set of replicas
with the desired set. You can only add one replica at a time, so the step
function repeats the UpdateReplicas
step until the actual state matches the
desired state. Each step is very small and self-contained, usually only one or
two API calls, and all of the waiting is done by Step Functions instead of in
the Lambda function, so we’re not paying for idle time! Best of all, Step
Functions can run for up to a week, so we didn’t need to worry about the
15-minute timeout any more.
There’s a small catch…
I wish I could use Step Functions directly to build CloudFormation custom resources instead of having to have a Lambda function that triggers the Step Function. #awswishlist
— Geoff Baskwill (@geoff_baskwill) February 24, 2021
CloudFormation doesn’t support direct integration with Step Functions as a custom resource provider yet, but we can use our old Lambda function trick to trigger the Step Function execution, and send the response back to CloudFormation when we get to an end state.
Wrapping up
When you love infrastructure-as-code and need a custom resource for something that CloudFormation doesn’t support, Lambda is usually a great solution. When you need a bigger hammer for complex orchestration or operations with lots of idle time, Step Functions can help get you there.
The goal is to retire this particular resource soon, as AWS tells us that they’ll have built-in support in CloudFormation for DynamoDB Global Tables in the very near future. That said, my team is happy that we were able to deliver an initial implementation with this workaround and provide value to our customers!