Thursday, February 18, 2021

AWS CloudFormation Multi-Region Failover using DynamoDB v2019 and Lambda

TL;DR

Rant

Scouring the internet for a solution for multi-region failover using a DynamoDB Global Table (v2019) in CloudFormation resulted in basically nothing.  Sure, I found some solutions using an antiquated version of DynamoDB Global Tables--and this sent me down a rabbit trail until I realized that it was out-of-date.

One would think that AWS would have at least an example laying around somewhere so you aren't left trying to reinvent the wheel.

I tried posting a solution on Stack Overflow, but got bounced because it references off-site resources (GitHub) and this example is way too complicated to post in a single answer on SO.  I'm starting to see why it's so hard to find good resources on common AWS questions.

Solution

After lots of trial and error, I've come up with a solution and posted it on GitHub here.

I'm new to applying licenses to code, so let me know if I did it wrong or should've chosen a different license.

The repository is a collection of common templates that will likely fill out over the years.

Here were our requirements:

  • CloudFormation
  • NodeJS
  • Multi-Region failover
  • Serverless (Lambda)
  • Low-intensity check to see if we're running in the active region (we went with an environment variable that the Lambda could check)
  • No manual creation of resources in the AWS Console (production requirement)
The basic setup is that we have a global config table (DynamoDB) that streams edits to all regions we're operating in.  We look to see what the new active region is, and then update all of our resources to accommodate the new state.  In the case of this example, it only updates the Lambda environment variable REGION_STATUS to either active or inactive.
In our final implementation, we're enabling and disabling alarms, schedules, events, etc.  But all of those examples come with a lot of extra complexity in the template and I decided to leave them out to have a solution that's as simple as possible, yet introduces enough complexity to be useful.

After the stacks are deployed to the various regions, just add/edit the active-region key with the value of the region you want and Bob's your uncle.  After about seven seconds, you should be updated and failed over.