Migrating a service with feature flags
Table of Contents
Context #
I recently encountered a situation where I had to migrate some logic and production data from a legacy service (and database) into a new one.
I had already extracted all relevant code into a new service, ensuring its API was 100% compatible with the old service, then deployed it all the way to production: now I needed a strategy to migrate its data.
Data migrations #
Years ago I have written about database migrations with AWS DMS and that remains a solid approach for long-lasting migrations when you
- cannot hope to update all your consumers quickly
- the volume of data does not allow you to manually dump data, or
- you cannot afford downtime of any sort (read, write)
The situation I recently encountered had different constraints:
- the new service had to become, for business reasons, the source of truth within days (the sooner the better)
- read downtime was not acceptable, but 1-2 hours of planned write downtime were
- there were only a few tens of thousands of rows to migrate, distributed over a few relational tables.
Migration stages #
I started out by defining a set of migration stages to carry out the migration to the new service while respecting the above constraints.
Off #
The old (= legacy) service behaves as usual…
… while the new one proxies all requests to the old one under the hood.
The source of truth remains the legacy service’s database (legacy database). The new database is not used at all, at this point.
We update all consumers and producers to use the new service: since it exposes the exact same API as the old service, this should in principle only require configuration changes (e.g. updating the URL they point to).
Read-only #
The new service allows read requests exactly as in the previous stage…
… but blocks write requests, returning an error to clients.
This is when the data migration happens:
- we copy all data from the legacy database into the new one, and
- we ensure that they are consistent (e.g. using validation scripts)
The legacy service now also returns an error to clients trying to use its read API.
Shadow #
The new service writes to both the legacy and new databases, to keep them in sync, but only reads from the legacy one: this is to verify that the write logic in the new service works, without any impact for clients.
Here it’s important to ensure that any errors when writing to the new database alert developers (they signal an issue that must be solved before going to the next stage), but do not impact the customer experience.
Live #
The new service still writes to both the legacy and new databases, but now reads from the new one.
This stage helps to catch any issues in the read logic while preserving consistency across databases, allowing us to quickly go back to the previous stage by simply changing the feature flag value back to Shadow.
Completed #
The new service now reads and writes from/to the new database, which now becomes the source of truth and starts diverging from the legacy one.
Once we confirm that there are no errors, we can decommission the legacy service and database.
Feature flags #
In its most basic form, a feature flag can be seen as a conditional statement whose evaluation can be changed while the service is running (see what are feature flags). What I needed for my purposes was a feature flag evaluating, at any point in time, to a string value selected from a set (= the migration stages).
I updated both the legacy and services to change their behaviour according to the feature flag’s value (= the current migration stage): this enabled me to move between stages without being slowed down by deployments.
Considerations #
The advantage of the solution I just described is that it’s trivial (and fast) to get back to a “safe state” in case of issues: this is both thanks to the use of feature flags and because, starting from Read-Only and up to Live (included), both databases are in sync (assuming no bugs or other failures, of course).