Migrating a service with feature flags

Table of Contents

Context #

I recently encountered a situation where I had to migrate some logic and production data from a legacy service (and database) into a new one.

I had already extracted all relevant code into a new service, ensuring its API was 100% compatible with the old service, then deployed it all the way to production: now I needed a strategy to migrate its data.

Data migrations #

Years ago I have written about database migrations with AWS DMS and that remains a solid approach for long-lasting migrations when you

cannot hope to update all your consumers quickly
the volume of data does not allow you to manually dump data, or
you cannot afford downtime of any sort (read, write)

The situation I recently encountered had different constraints:

the new service had to become the source of truth within days for business reasons, the sooner the better
read downtime was not acceptable, but 1-2 hours of planned write downtime were
there were only a few tens of thousands of rows to migrate, distributed over some relational tables.

Migration stages #

I started out by defining a set of migration stages to carry out the migration to the new service while respecting the above constraints.

Off #

The old service behaves as usual; the new service proxies all requests to the old one.

The source of truth remains the old service’s database (old database). The new database is not used at all, at this point.

All consumers and producers should be updated to use the new service: since the old and new services offer the exact same API, this should in principle only require configuration changes (e.g. updating the URL they point to).

Read-only #

Both the new and old services allow read requests, but block write requests by returning an error to clients.

This is when the database migration happens, copying data from the old database into the new one and ensuring that they are consistent.

Shadow #

The new service writes to both the old and new databases, to keep them in sync, but only reads from the old one: this is to verify that the logic in the new service works, without any impact for clients.

Live #

The new service still writes to both the old and new databases, but now reads from the new one.

Completed #

The new service now reads and writes from/to the new database, which now becomes the source of truth and starts diverging from the old one.

The old service returns errors to any clients still trying to use its API: at this point there should be none left, but in case there are they will be easy to spot thanks to the error.

Feature flags #

In its most basic form, a feature flag can be seen as a conditional statement whose evaluation can be changed while the service is running (see what are feature flags). What I needed for my purposes was a feature flag evaluating, at any point in time, to a string value selected from a set (= the migration stages).

I updated both the old and services to change their behaviour according to the feature flag’s value (= the current migration stage): this enabled me to move between stages without being slowed down by deployments.

Considerations #

The advantage of the solution I just described is that it’s trivial (and fast) to get back to a “safe state” in case of issues: this is both thanks to the use of feature flags and because, starting from Read-Only and up to Live (included), both databases are in sync (assuming no bugs or other failures, of course).