AWS RDS to Aurora Migration

May 3, 2024 8 min read

In this post, we will go over a complex migration process I recently did when migrating from a provisioned AWS RDS solution to the serverless Aurora offering.

Setting the scene

I had a number of environments in live service including:

prod which users are using on a daily basis.
test which our QA team members use.
uat which is used internally to test changes.
dev which is used internally by the engineers.

Some of these environments are considered production grade, and others are not. In terms of the database this meant that the production grade environments were running RDS clusters with multi A-Z. Clusters in this case is a bit of a misnomer, there was a primary RDS instance with secondary replica instances. Non-production grade environments only have the single primary RDS instance in a single A-Z.

All databases make use of the managed RDS password solution, which means they store the password in AWS secrets manager. This password is automatically rotated.

Make a note of these characteristics as they become important later on.

Sketching this out looks like the following: original_db_architecture

The requirement was that we wanted to migrate from AWS RDS to the AWS Aurora serverless v2 offering.

AWS Aurora serverless

A quick note about the AWS Aurora serverless offering before we move on.

We’re going to focus on the v2 offering, which has some pretty big caveats that you should be aware of.

I chose the v2 option for my use case due to the highly variable demand on the database.

The serverless v2 offering does give you a lot of powerful features including things like:

Autoscaling capabilities
Monitoring
Maintenance
High availability with failover
Replication across multiple availability zones

And perhaps the most interesting one being the seperation of the compute workloads from the storage layer. Combined with our business use cases and context, we decided to switch from RDS to Aurora.

However, Aurora has this concept of the Aurora Capacity Unit (ACU), which is tied to around 2GiB of memory, CPU and networking usage. For Aurora serverless v2, we have the ability to scale in increments of 0.5 ACUs. The issue is that the minimum number of ACUs is 0.5. Which means there is a standing charge associated with Aurora serverless v2, even if you have no demand on your database.

But AWS have marketed Aurora as a serverless offering. Which you might think feels… not very serverless.

Context

The problem I had was that I had multiple database solutions of varying configurations scattered throughout our managed environments.

This included a number of environments which were live and in use on day-to-day basis.

There was a need for 0 downtime and no data loss. And we’re a pretty small dev team to top it off! All of this would have meant doing the migration with a snapshot would be tricky to say the least.

The load on the database was also very read-heavy. But I did have the benefit of knowing exactly when writes were being made and where those writes were coming from.

existing_db_architecture_with_workloads

Because of the way we designed our system, I also know exactly which workloads need write or read access to the database.

So I needed a migration path which I could orchestrate across all my environments. Preferably with a blue/green type approach in which I could keep the original database in place until I was absolutely ready to make the switch.

The proposed solution

With all this in mind, restoring from a snapshot was out of the question. Using AWS’s Database Migration Service also felt out of the question, primarily down to its snapshot-type approach to replicating the data.

So I decided the better approach would be to:

Add an Aurora cluster and have it replicate from the RDS primary as if it were a secondary read replica.
Wait for the replication lag to fall to 0, at this point I’d know that the data has been fully replicated.
When ready, connect all the read-type workloads to the Aurora read endpoint.
Promote Aurora to the primary and connect the write-type workloads to the newly promoted Aurora primary node..
Remove the original / demoted RDS database(s).

Arguably it feels more long-winded. But this approach means there is little for me to have to orchestrate. It also means I can keep the service running at all times and be certain the data has been fully replicated.

This also meant that if I was pulled onto another piece of work perhaps more urgently then I could safely leave the migration process up until step 4 without having to worry about anything.

The main cutoff point is of course step number 4. Once Aurora is promoted to the primary, it is effectively disconnected from the RDS database.

This means that for the meantime, any writes from the write workloads will not propagate through to Aurora from the RDS database. So at this point, it was important to connect the write workloads to Aurora as soon as possible.

The problems

The first problem was the difference in database configurations between the production-like deployments compared to the non-production deployments.

The fact that production-grade environments had RDS read replicas actually was not a problem. The issue lied with the fact that the RDS primary had multi A-Z.

Unfortunately, AWS do not support replication from a multi A-Z configuration to a replica cluster.

The other issue was that managed database secrets are also not supported for the migration path I’d chosen.

And finally, the last issue was that I define and deploy all of this infrastructure in Terraform. There is no corresponding resource that I could deploy which would perform the action of promoting aurora to the primary.

Promoting to the primary must be done via the AWS console or the AWS CLI

Bringing about baseline database configurations

The first step was to bring about a baseline configuration for the databases. Namely, all multi A-Z RDS primary instances would need to be switched to single A-Z temporarily:

resource "aws_db_instance" "app_rds_primary" {
   engine                      = "postgres"
   multi_az                    = local.is_production_grade ? true : false
   manage_master_user_password = true
   # Truncated for brevity
   ...
}

This was an easy change to make via the multi_az parameter:

resource "aws_db_instance" "app_rds_primary" {
   engine                      = "postgres"
   multi_az                    = false
   manage_master_user_password = true
   # Truncated for brevity
   ...
}

With this change deployed, I could be sure that all the RDS databases were only single A-Z.

Enstating static database password secret

The next step was to switch the managed password to a static password.

I use AWS SecretsManager to store all secrets and credentials associated with a deployment. So I created a secret in AWS SecretsManager to persist the database password:

resource "random_password" "temporary_db_credentials" {
  length      = 20
  min_lower   = 1
  min_numeric = 1
  min_upper   = 1
  special     = false
}

resource "aws_secretsmanager_secret" "temporary_db_credentials" {
  name = "temporary-db-credentials"
}

resource "aws_secretsmanager_secret_version" "temporary_db_credentials" {
  secret_id     = aws_secretsmanager_secret.temporary_db_credentials.id
  secret_string = jsonencode({
    username = "some_username"
    password = random_password.temporary_db_credentials.result
  })
}

resource "aws_db_instance" "app_rds_primary" {
   engine                  = "postgres"
   multi_az                = false
   password                = jsondecode(aws_secretsmanager_secret_version.temporary_db_credentials.secret_string)["password"]
   # Truncated for brevity
   ...
}

With this in place, the password was persisted and was made to be static whilst the migration path was being executed.

Adding the Aurora replica cluster

To add the Aurora replica I defined something like the following. Note that some details have been omitted for brevity.

resource "aws_db_instance" "app_rds_primary" {
   engine                  = "postgres"
   multi_az                = false
   password                = jsondecode(aws_secretsmanager_secret_version.temporary_db_credentials.secret_string)["password"]
   # Truncated for brevity
   ...
}

module "aurora_db_main" {
  source = "terraform-aws-modules/rds-aurora/aws"
  
  is_primary_cluster            = false
  replication_source_identifier = aws_db_instance.app_rds_primary.arn    # this is the change which tells aurora to replicated from RDS
  manage_master_user_password   = false
  
  engine_mode       = "provisioned"                                      # Aurora serverless v2 actually has an engine_mode of provisioned...
  instance_class    = "db.serverless"
    
  serverlessv2_scaling_configuration = {
    min_capacity = 1
    max_capacity = 10
  }
  instances = {
    1 = {}
  }
}

Impact on the CI/CD pipeline

The main issue I experienced was a problem which cropped up between steps 1 and 4. I have a CI/CD pipeline which deploys a fresh instance of the application into AWS and tears it down after.

The problem was that once I added Aurora as a replica cluster to the RDS primary, AWS do not allow you to delete a replica cluster which is pointed to an RDS primary instance via the API. This meant that Terraform could not destroy that resource when running my CI/CD pipeline.

So for a short time, I had to manually delete those Aurora replica clusters in the AWS console.

Note that once step 4 was done, and the Aurora replica cluster was promoted, this was no longer a problem.

Follow-up steps

Once the Aurora cluster had replicated, I was in a position to promote Aurora out of the RDS area and point all workloads and connecting services to the newly promoted Aurora cluster.

Once this was done, the original RDS database could then be safely removed.

Summary

Because I had all the infrastructure defined in Terraform, I could break up the migration steps into separate and more manageable pull requests.

I could also leave the continuous deployment to be responsible for actually deploying each migration step to the environments.

All of this was made easier by virtue of the infrastructure as code approach. This allowed me to orchestrate the migration process across numerous environments at once.

Afaan Ashiq