Safely migrating a Ruby app from Heroku to AWS Serverless

Safely migrating a Ruby app from Heroku to AWS Serverless | Glenn Gillen

Safely migrating a Ruby app from Heroku to AWS Serverless

May 04, 2022

This is a multi-part series on how to safely refactor and migrate a ruby-based app from Heroku to running entirely serveless on AWS. Each step is incremental and self-contained, with special attention given to avoiding any need to wholesale lift-and-shift the application so as to avoid the risks such large changes introduce. This is the distillation of more than a year of lessons learned, but that's mostly a by-product of the complete lack of urgency on my part for completing this. This playbook could be followed as quickly as your own urgency and confidence desires.

I've a project that's been running happily on Heroku for many years now. It's mostly been a set and forget app, but from time to time it requires some attention. It's also a permanent fixture on my credit card statement every month much to my chagrin. It's only only a few hundred dollars which isn't by any measure an enterprise level spend, but it's always felt disproportionate to the actual needs of the app. It's very bursty in terms of the traffic it serves. When it gets traffic it might service several thousand requests per hour. But between those times... nothing. Literally nothing. There's no baseline load. Yet it's always running 2x dynos, a couple more for async workers, a production database. This is the exact type of app that is a perfect fit for a more serverless and scale on demand approach.

Step 0 - Define a pragmatic plan

Now, this is my plan. It worked well for me precisely because I wanted to re-architect the implementation anyway. It might not be the right plan for your app though. I hope in going through this that you get some inspiration rather than a blueprint to copy. There's aspects of this that could likely be adapted.

Overall what I hope you take away is that there's a path migrating things safely. In small steps. With no downtime. And you can have it happen as quickly or slowly as you like. My migration took many months! Not because it was hard, but because there was no urgency. I reaped many of the cost and performance benefits early and so the last bits just dragged on becase... well, they could.

Step 0.a - Create a Terraform Cloud account

Trust me, it's going to be easier to let HashiCorp take care of some of the moving parts here. Everything we need is available on the free tier. Sign up for Terraform Cloud and create a new organization. We'll use it later on.

Step 0.b - Importing the existing Heroku config into Terraform

To make this transition easier, in addition to always having a reversible audit-trail of the changes that have been applied at any point, I imported the existing Heroku app into Terraform so that I could manage it via code. While meaning there is now a single approach to managing the infrastructure irrespective of where it's running, it also made it more straightforward to connect the require contact points without manually copy & pasting values into different systems.

It starts by setting up a new Terraform project and adding the Heroku provider:


# terraform.tf
terraform {
  cloud {
    organization = "the-name-of-the-organization-you-created-on-terraform-cloud"

    workspaces {
      name = "the-name-of-your-app-or-project-goes-here"
    }
  }
}


# providers.tf
terraform {
  required_providers {
    heroku = {
      source = "heroku/heroku"
    }
  }
}
provider "heroku" {
}

Now that you've created the bare minimum required for this project, run terraform init to setup the project and load the providers. Next it's time to create the basic Terraform config for your Heroku app, here's a pseudo-example of mine (consult the official Terraform Heroku Provider Docs for all of the available options):


# heroku.tf
resource "heroku_app" "your-app-name" {
  name    = "your-app-name"
  region  = "us"
  acm     = true
  stack   = "heroku-18"

  config_vars = {
  }
}

Now you can sync the current state of your app with the Terraform config you created by importing it:


$ terraform import heroku_app.your-app-name your-app-name

Step 1 - Switching database to DynamoDB

With the knowledge that this will all eventually be a lambda-based workload, keeping postgres in the mix was going to be a problem. It's also not going to solve the "paying for things that are always running but rarely used" problem. On the other hand, DynamoDB On-Demand does exactly that.

This also happens to be one of the most impactful changes in terms of reducing costs from our perspective!

Step 1.a - Creating a DynamoDB instance

If you're not already familiar with DynamoDB this is going to be a big conceptual change! I would very strongly recommend you get familiar with it first and realize how you'll have to rethink how you do almost everything! As you'll see though I end up running Postgres and DynamoDB side by side, even having data for a single table split across both stores at various points, without the app missing a beat. So it's not like you need to burn everything down and start again. You do need to know your data access patterns though.

I'd also strongly recommend that if you're new to DynamoDB you watch a bunch of Rick Houlihan's videos on YouTube to see how you should adapt your thinking coming from an RDBMS world.

Once you know how you're going to structure your data it's time to start creating the tables! My first one was one of the most critical, my accounts table. As I had fairly bursty workload I didn't want to have to pay for pre-provisioned capacity I wasn't going to use and so I opted for the pay-per-request pricing option (all up everything we're going end up with here will cost me a few dollars per month, down from the multiple hundreds it cost on Heroku):


resource "aws_dynamodb_table" "accounts" {
  name           = "accounts"
  billing_mode   = "PAY_PER_REQUEST"
  hash_key       = "id"
  stream_enabled   = true
  stream_view_type = "NEW_IMAGE"

  attribute {
    name = "id"
    type = "S"
  }

  attribute {
    name = "email_address"
    type = "S"
  }
}

Step 1.b - Granting permissions to DynamoDB from your Heroku app

And here's where the cross provider magic of Terraform came into shine. Defining a DynamoDB table in AWS is one thing, but I now somehow have to grant my Heroku app access to it. I'll explain the specifics shortly, here's how I did it:


resource "aws_iam_user" "heroku" {
  name = "heroku-user"
}

resource "aws_iam_access_key" "heroku" {
  user = aws_iam_user.heroku.name
}

resource "aws_iam_user_policy" "heroku-dynamo-access" {
    name = "heroku-dynamo-access"
    user = aws_iam_user.heroku.name
    policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "dynamodb:*"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
EOF
}

So we create three resources: aws_iam_user, an aws_iam_access_key for that user, and then we attach a policy using aws_iam_user_policy. That policy is set to allow any DynamoDB action (dynamodb:*) on any resource. Way too permissive, you should definitely lock that down. But it keeps the demo clean for now.

Next I updated heroku.tf file I mentioned earlier to add some config vars:


resource "heroku_app" "your-app-name" {
  name    = "your-app-name"
  region  = "us"
  acm     = true
  stack   = "heroku-18"

  config_vars = {
      AWS_DYNAMODB_ACCESS_KEY    = aws_iam_access_key.heroku.id
      AWS_DYNAMODB_ACCESS_SECRET = aws_iam_access_key.heroku.secret
  }
}

This now takes the access key we created earlier, and adds it as config vars on the Heroku app! Huzzah! You're a terraform apply away from creating your DynamoDB table in AWS + creating a user that has access to that table + adding the required config into you Heroku app. When the apply finished it'll all be created, your app will restart, and in theory you can now access DynamoDB.

We need to update our app to take advantage of it now.

Step 1.c - Slowly moving data from Postgres to DynamoDB

I didn't want to take down my app to start moving data across. Again, bursty access patterns make this problematic. I never know when there will be a spike in popularity, it just happens! What happens if there's a sudden spike right in the middle of my attempts to migrate the accounts table? That's absolutely what would happen if I tried because the universe is weird like that.

Instead here's the plan I wanted to implement:

Start saving data in both locations ASAP
Move lookups for account information to start preferencing DynamoDB, fallback to Postgres is no record found
Migrate data in batches at my leisure
Once all data is in DDB and the Postgres fallback is no longer needed, remove that codepath
Rinse & repeat for all tables
Once all Postgres codepaths are removed take a final snapshot of the database and then shut it down

Here's a module I created that I included in my model to make the first step here easier and reusable:


require 'aws-sdk-dynamodb'
module Dynamo
  module ClassMethods
    def dynamo_resource
      @dynamo_resource ||= Aws::DynamoDB::Resource.new(
        region: 'us-east-1',
        access_key_id: ENV['AWS_DYNAMODB_ACCESS_KEY'],
        secret_access_key: ENV['AWS_DYNAMODB_ACCESS_SECRET']
      )
    end
    def dynamo_table(table_name = nil)
      if table_name
        @dynamo_table_name = table_name
        @dynamo_table = dynamo_resource.table(@dynamo_table_name)
      end
      @dynamo_table
    end
  end

  def self.included(base)
    base.extend(ClassMethods)
  end

  def save_dynamo
    item_hash = dynamo_params
    item_hash.keys.each do |key|
      item_hash[key] = {
        'value' => item_hash[key],
        'action' => 'PUT'
      }
    end
    self.class.dynamo_table.update_item(
      key: { self.class.hash_key => self.send(self.class.hash_val) },
      attribute_updates: item_hash)
  rescue Aws::DynamoDB::Errors::ValidationException => ex
    puts ex.message
    puts "Error: Can't save: #{dynamo_params.inspect}"
    raise
  end

  def dynamo_item
    result = self.class.dynamo_table
      .query(key_conditions:
        { id: {
            attribute_value_list: [self.uuid],
            comparison_operator: 'EQ' } })
    result.items.first
  end

  def dynamo_item?
    !!dynamo_item
  end

  def sync_dynamo
    save_dynamo unless dynamo_item?
  end
end

And the way I included it into my model:


class Account < Sequel::Model
  include Dynamo
  dynamo_table 'accounts'

  def dynamo_params
    base = {
      id: self.uuid,
      ... # More key/vals in here
    }
  end

  def after_save
    super
    db.after_commit do
      save_dynamo
    end
  end

So here's the general idea of what I've implemented:

Update the model so that every save goes to DDB as well as Postgres. I do that by calling my save_dynamo method after the transaction is committed in Postgres. This also has the convenient property that it captures all new records as well as any updates to existing records. The moment this is deployed it freezes in time the complexity of any migration task. New data will be in DDB, and any actively updated data will be migrated just-in-time when it's updated.
Each model implements a dynamo_params method that will create a hash of what to store in DDB.
In the module: a centralised way to configure my access to DDB (using the config vars we set earlier) and for each model to specify the name of the table in DDB for that model (I avoided any inflection based naming magic and made specifying it explicit).
Finally there's the dynamo_item method (to find an item by ID) and dynamo_item? helper to return if the a matching item exists in DDB, which are both only used by sync_dynamo. A function I could later use to migrate records one at time across to DDB. Or batch with something like Account.all.each{|a| a.sync_dynamo }

Within a few hours data is flowing into DDB. Come back tomorrow and I'll have a sizeable amount of things migrated.

The next step was to adjust the way data was being read. I already had helper methods for most of these which made it easier (rather than any dynamically generated methods generated by an ORM). And so I could update methods like:


class Account < Sequel::Model
  dataset_module do
    def active
      select(Sequel.lit('accounts.*')).where(deprovisioned_at: nil)
    end
  end
  def self.find_by_email(email)
    active.all.select{ |a| a.email_address == email }
  end
end

To be something more like


class Account < Sequel::Model
  dataset_module do
    def active
      select(Sequel.lit('accounts.*')).where(deprovisioned_at: nil)
    end
  end
  def self.find_by_email(email)
    res = self.dynamo_table.query(
      index_name: 'AccountsEmailActive',
      key_conditions: {
        'email_address' => { attribute_value_list: [email], comparison_operator: "EQ" },
        'active' => { attribute_value_list: [1], comparison_operator: "EQ" }
      }
    )
    if res.count == 0
      active.all.select{ |a| a.email_address == email }
    else
      res.items.map do |item|
        Account.find(item["id"])
      end
    end
  end
end

There is a lot here to not like. The degenerate case is kinda expensive performance-wise. It'll run a query on DDB, if the return count is zero it'll then run that query on Postgres and get the old results. If the first DDB query did look, it then makes another query to get the full item (because not all of the data we need is availble in the GSI I created to search by emails).

It's one of those cases where it's important to understand your usage patterns. While there's definitely inefficiencies here the numbers involved, for me, mean that end users would barely notice. We're still single digit millisecond responses for all of the database stuff. I was actually tempted to squeeze in a call to sync_dynamo for any row we pulled from postgres just to make sure that was the last time we'd ever retrieve that record from there.

Step 1.d - Migrating the remaining data

I made my plans for just migrating the remaining data. With this in place (and a couple of other find methods implemented) the app was now primarily using DDB for all account data! The last step was to write a little migration script that was barely anything more than:


Account.all.each do |account|
  account.sync_dynamo
end

I think I may have maybe added some ordering on it, and chunked it out into a couple of threads, but that's the migration story. Just run through every record, call that method, if the record exists already in DDB it'll do nothing and if there's no record it'll save it. When that script completes there should be no more new lookups of data in postgres any more.

Step 1.e - Stop writing to postgres

This is the biggest and riskiest change in the whole thing. We're querying as much as we can from DDB now, falling back to postgres where absolutely necessary. New data is still being stored in Postgres even though we're querying it from DDB so we can stop doing that. Doing that meant ditching Sequel though, and doing that meant we also had to reimplement a couple of other features like validation.

To start, I replaced Sequel::Model with Hashie:Mash which gave me a hash-like structure (great given that's how DDB items are returned) but with attribute accessors like Sequel had (great because that's what the rest of the app is expecting):


class Account < Hashie::Mash
  include Dynamo
  dynamo_table 'accounts'

  def save
    save_dynamo
  end
end

Next was to make sure the way I created new records still worked. Again, I thankfully had helper methods already that made this straight-forward. Without them I'd probably explore some decoration of the initialize method on each model:


def self.create_from_signup_form(signup_params)
  new(signup_params)
end

That save method above needed a bit more help though. There's some functionality we got for free from Sequel or Postgres that we'd need to re-implement ourselves such as generating and ID and storing a created_at timestamp. Those would apply to all of the models though so they went into the module, while the save method was updated to make use of them:


require 'securerandom'
class ValidationError < StandardError; end
module Dynamo
  def generate_id
    self.id ||= SecureRandom.uuid
  end

  def validates_presence(attrs)
    attrs.each do |attr|
      val = self.send(attr)
      raise ValidationError.new(attr) if val.nil? || val.empty?
    end
  end
end

class Account < Hashie::Mash
  include Dynamo
  dynamo_table 'accounts'

  def validate
    before_validation
    validates_presence [:id, :email]
  end

  def before_validation
    generate_id
    timestamps!
  end

  def save
    validate
    save_dynamo
    true
  end
end

There's some other custom validation logic and attribute setting I've removed for simplification but you hopefully get the general idea. That's it though! That's enough to remove Postgres for this specific model. Much, much, much testing later and I pulled the trigger and deployed the changes without any issue.

And there it was, the biggest and most important table in my app no longer using Postgres. I repeated the same exact steps one table at a time. The hardest one was done and so the others were much quicker and easier to do. One at a time everything moved across until there was nothing lef using Postgres at all.

Step 1.f - Decomissioning Postgres

🎶 "And now, the end is here, and so we face, our final curtain..."

You served me well Postgres, but it was time to say goodbye (for this app at least). I took a backup, downloaded it, and stored it in a couple of different locations just in case. Then I got the DATABASE_URL and saved that somewhere else incase I needed to roll it back. Because I'm extra paranoid I spun up a new empty database an attached it to my app. I then promoted it to be my primary (i.e., it now became the database referenced by DATABASE_URL) and restarted my app just to be sure. My app was now pointing at a completely empty database so if anything was going to break it'd be now. If it did, I could quickly switch back to the other DB that was still there.

But nothing broke. It all worked. And so it was finally time to run heroku addons:remove heroku-postgresql 😢.

Next Steps

This is just the start! But it was a huge step both in terms of cost savings for my tiny app and also in terms of complexity of what had happened. I'd completely replatformed my database to a whole new technology without skipping a beat. Up next was:

Step 2 - Replacing my workers with serverless functions
Step 3 - Adding a CDN to provide flexibility in migrating the front-end & APIs
Step 4 - Moving the API to API Gateway
Step 5 - Serving the static content from S3
Step 6 - Forcing everything else into a lambda container

Stay tuned for more details on how I approached each of these.

Hi, I'm Glenn! 👋 I've spent most of my career working with or at startups. I'm currently the Director of Product @ Ockam where I'm helping developers build applications and systems that are secure-by-design. It's time we started securely connecting apps, not networks.

Previously I led the Terraform product team @ HashiCorp, where we launched Terraform Cloud and set the stage for a successful IPO. Prior to that I was part of the Startup Team @ AWS, and earlier still an early employee @ Heroku. I've also invested in a couple of dozen early stage startups.