Migrating Data Quickly and Safely (Nextdoor) | Datadog
Datadog's Research Report: The State of Serverless Report: The State of Serverless
Migrating Data Quickly and Safely (Nextdoor)

Migrating Data Quickly and Safely (Nextdoor)


Published: July 17, 2019
00:00:00
00:00:00

Thank you. I’m here from Nextdoor, and I’m here to talk to you about database migrations.

For most engineering organizations that are using relational databases these days, migrations are a pretty common operation but they’re definitely not trivial.

And some interesting problems with migrations are only gonna show up when you are operating at scale.

And I’m gonna talk about some of those problems that we ran into at Nextdoor and how we addressed them.

What are migrations?

But before I dive into that, I just wanna make sure we’re all on the same page and explain what are migrations.

Migrations are how you update the schema of your database.

You’re running some sort of DDL statement like a CREATE TABLE, ALTER TABLE, adding a trigger or an index.

In addition to those schema migrations, you also have data migrations, which is…a one-time update of the data in your database.

And over the course of your application’s lifetime, you’re gonna run a lot of migrations, and it can become tricky to manage all of these migrations.

Fortunately, most modern frameworks have really first-class support for managing migrations.

We use Django.

If you use Rails, it has really good support for migrations.

If you’re using Hibernate, there’s Flyway, Alembic.

There’s a lot of libraries and tools out there to help you manage migrations, and they all actually work on sort of the same basic principles.

How a migration works

Instead of directly running the SQL yourself, you define the changes, usually in terms of updates to your ORM models if your migration management system is integrated with your ORM.

Each of the migrations is a set of changes that should be atomically applied to your database, and they’re each uniquely identified and sequential.

So you might have one that’s “002001” and then you have one that’s “002,” and we can take a look at what one for Django looks like.

And this is pretty similar to how it would look in Rails as well.

You have some operation, in this case, it’s changing the length of a particular column on the model.

And when you run a particular migration, it gets inserted into a special database table that looks something like this.

And, you can query this table to identify what the state of your database schema is.

And then now that you know what the state of your database schema is, if you have any migrations that are in your code but not listed in this database table, you know that you need to run those to bring your database up-to-date with the current schema that your application expects.

Fortunately, you don’t need to run all of these commands to select out of this table and compare what is in the table versus what’s in your code.

This is all wrapped up in a convenient command, so you just run the one command.

It does all of this and brings your database schema up-to-date.

And this is all great.

It works really well.

It makes it very easy to manage a lot of different databases, like one on every developer’s laptop that might all be in different states as far as the current database schema that they have.

But with this deep integration between your application and your database schema, it does come with a cost.

You’ve coupled them very tightly together.

Your database schema is very tightly coupled to your application version, and that introduces some challenges, particularly if you are frequently updating your application version, like you’re doing continuous deployment for example.

If you’re more interested in the details of these migration systems and the development workflows that will let you leverage them effectively, I recommend this article on Martin Fowler’s site.

He has lots of great articles on his site, and it goes into a lot of more detail than I’m going to do today about these systems and how to use them effectively.

If you read through that article, you’ll eventually get to the part where Martin Fowler explains to you how to apply these changes in production.

And he says, ‘‘The easiest way is to take the database down.’’

If we wanna keep the application live, however, there are techniques, but they will require another article to explain.

If any of you can find that article in Martin Fowler’s site, please forward it to me.

I would have loved to have read it, but we had to develop our own techniques at Nextdoor.

Deployment

And before I can really talk about the techniques we use to deploy things, I need to talk about how we deploy things at Nextdoor and sort of how we got there.

Many years ago, we were not doing continuous deployment at all.

We did weekly releases off of a release branch and everything was run manually by an engineer like SSH’ing in and like generating a build artifact and copying it to EC2 instances.

And along the way, they would run the migrate command.

And this worked pretty well for a small team.

There was not a high scale, migrations could be applied quickly and reliably against this small database.

If anything did go wrong, you know, it was a small team.

The release engineer usually sort of knew what was going on.

And, this release engineer role wasn’t a particular person.

It rotated amongst the engineering team, kind of, like an on-call rotation.

So, now, we’re doing continuous deployment, though.

We deploy 10 to 15 times a day.

Each deploy usually has a batch of commits in it.

These deployments are running totally automatically from the master branch and migrations get run as part of this.

But it wasn’t like we just flipped a switch one day and went weekly deployment to continuous deployment.

If any of you have made this transition, that’s not how it works.

It took us many months to make this switch.

And along the way, we basically steadily increased our release cadence going to sort of twice a week and then every day and eventually switching to continuous deployment.

But over the course of this process, we learned a lot about running migrations.

And while this whole process of switching to continuous deployment was going on, there was a lot of other things that were scaling up too.

The engineering team was getting a lot bigger.

We were hiring a lot more people, so the engineering team doubled and then quadrupled in size, and the scale of the site was going up a lot too.

Backwards compatibility

We were getting a lot more users, a lot more content, a lot more activity on the database.

So we learned some things, and the first thing we learned really even before we even started this process, and that’s backwards compatibility.

Your migrations must be backwards compatible.

When you deploy a new version of the code, or when you deploy a new version of your application, if you’re doing something like adding a new field to a model, you have to add that column before you can start running the code that has the reference to that field.

Otherwise, your application is gonna look for it in the database and it’s not gonna be there yet.

So you have to run the migrations first.

But there are some migrations that are not backwards compatible like renaming or dropping a table or a column, and you can’t run those before you deploy the code because then you have a version of your code running that’s looking for the old name of the table or old name of the column and the database has the new one.

You can’t get around this.

It does not matter how fast your deploys are, whether you’re in went blue, green or whatever, you must run your migrations first and then deploy your application code.

So the fact here is that some migrations you just can’t really do.

It turns out that dropping a column, you can actually break down into a series of steps, each one of which is individually backwards compatible.

But some stuff like renaming columns, you really just can’t do it.

So, at Nextdoor, we’ve learned to just live with some poorly-named columns.

The second thing that we learned is that nobody wanted to run all these migrations anymore.

It had to become a real pain.

We’re doing more releases so, more often, you have to do this manual process of running the migrations.

There are more engineers who are generating more migrations, so it’s taking longer to do them.

And it also just increased the chances that this was gonna get screwed up.

This is a manual process, so people would log into the wrong server, they would fat-finger commands.

And we said, “Okay, you know, we need to figure out a way to make this less time-consuming and error-prone.”

Scaling your migration procedures

And so the first thing we tried was just…running the migrations, how hard can it be?

Just add manage.py migrate to our deployment scripts, and everything will be fine, right?

Nope.

The first issue we ran into was the issue of more data.

Database was much bigger, and something like adding a NOT NULL column to a database table means that you’re gonna have to rewrite that whole table to make sure that every row in that table has a value for the new column you’ve just added.

And you have to do it as part of that same transaction that you’re adding the column in.

And this isn’t a big deal if you have like hundreds or even thousands of rows in your table, just do it.

It might take some tens or maybe a couple of hundred milliseconds to rewrite a table with 3,000 rows in it.

But if your table has millions or billions of rows in it, that statement will take hours or even days to complete, if it ever does.

So if you go online and search for something like zero-downtime migrations, there is a solution out there to this.

You basically break it down into three steps.

You add the column without a null constraint so that you do not have to rewrite all the tables that could have a null value for that column. You then go ahead and backfill that column with new values and write to it for new rows, and then you add the constraint later.

So this seems fine.

And then when we went to do this, fail.

Did not work for us at all.

And so, you know what went wrong here turns out to be a little subtle.

It’s something that you’re only gonna run into when you have a lot of concurrent users and a lot of concurrent database sessions.

And, it has to do with the locks that the database is acquiring when it’s doing these operations.

So we’re gonna have to get a little bit into the weeds here.

I know it’s late in the afternoon and a long day, but, you know, get your machete or whatever, and let’s talk about lock contention.

So let’s take this simple scheme as an example.

We’ve got two tables, Profile, which is like the users of your application and Content, which is like a tweet, or a post, or whatever that they are writing and creating.

And we wanna keep track of what users have written, which pieces of content, so we’ve got an author_id column on our Content table that has a foreign key to our Profile table.

Pretty straight forward.

Locks

And if we go ahead and look at this PG locks view that Postgres provides (in front of MySQL there is an equivalent to this), we can see what locks get acquired by running a SELECT statement.

And actually, there’s a few of them.

There’s this virtualxid lock, which we’re not gonna talk about today.

There is a lock on this profile pkey relation, which is the index we’re using to run the query.

But the one we’re interested in today is this one in the middle, which is a ShareLock on the table itself.

And this is a ShareLock, so it can be shared with other locks.

Like, you can have lots of queries running against the same table at the same time obviously.

You can have lots of ShareLocks acquired on the same table.

And if we look then at what locks an ALTER TABLE statement will create, it gets a couple of other locks, but we’re interested in this one.

This is an ExclusiveLock on the table.

And exclusive means that it will conflict with any other box, other ExclusiveLocks, other ShareLocks, and this sort of makes sense.

If you’re altering a table, you can’t select out of it at that very second because you don’t know what columns it has at that point.

And I know this can all be a little bit hard to follow, so let’s sort of walk through an example here.

We’ve got two database sessions and the table.

The Read Session is running a SELECT query, so it requests the ShareLock and gets it.

The Migration Session then attempts to run the ALTER TABLE and cannot get the ExclusiveLock because it conflicts with the ShareLock.

When the SELECT statement finishes, the ShareLock gets released and the Migration Session can now acquire its ExclusiveLock and execute this ALTER TABLE statement, that green section that we see here now.

And if the Read Session attempts to acquire the ShareLock again while the ALTER TABLE statement is still running and that Migration Session still has that ExclusiveLock, it ends up blocked—and that’s the red section here means that that session is blocked.

And eventually, the ALTER TABLE statement finishes, Read Session can get its lock back.

And this kind of looks bad.

There’s a lot of blocking happening here.

But, in practice, this isn’t that big of a deal.

Reads are fast.

Like the median query time for read query at Nextdoor is like 900 microseconds or something like that.

We run a lot of really fast queries.

And the actual time that an ALTER TABLE statement takes to execute on Postgres at least is some tens of milliseconds.

And it doesn’t matter if the table has 10 rows or a billion rows in it, it’s the same 25, 30-ish milliseconds.

And, in practice, we found that if this is sort of what happens, like, we can handle a latency spike on our read queries of 25 or 30 milliseconds.

A real-life example of locks

I wanna walk through one other series though, which I think illustrates something that’s kind of important.

So now we’ve got another Read Session in the picture here, and it’s sharing this lock.

These are ShareLocks, so they can both have them at the same time.

And then our Migration Session attempts to acquire its ExclusiveLock.

And one of our Read Sessions attempts to acquire a ShareLock before the other ShareLock has released, but after the ExclusiveLock is requested to be acquired.

And, this blocks it.

It’s now gonna have to wait until the ALTER TABLE statement finishes.

Lock acquisition is serialized.

So if one lock request comes in and then another one does and they conflict with each other, the earliest that the second lock request will acquire its lock is at the same time as the first one.

And that’s only if they don’t conflict.

If they do conflict, they need to wait until that first lock is released.

So this looks a little bit worse now.

We have this longer blocking period, but this is still all very short timeframes.

You know, the maximum amount of time that this thing can possibly end up blocked is the time that it takes to do a READ statement, a SELECT statement, and run the ALTER TABLE, which is basically the time it takes to run the ALTER TABLE and just isn’t that long.

It’s not that big of a deal.

What happens if you bring a content table in?

Let’s bring that content table into the picture now, though.

So when we insert into that content table, this is the one with the foreign key, we acquire a whole bunch of locks.

We’ve got this lock on like the content_id_seq, which we’re using to generate the ID for the new row, and we have this lock on the content table.

And that profile_pkey index gets locked as well, and we also get a ShareLock on the profile table again.

This sort of makes sense.

We need to make sure that we don’t delete the row that we’re referencing with our foreign key out of that table while we’re trying to insert a row that references it.

And this means that if we look at a series like this, and it is a ShareLock so it can be shared with reads.

Reads and writes can coexist fine.

If we look through another example like this, the Write Session can end up blocking the ALTER TABLE.

And then the ALTER TABLE blocks the SELECT, and the SELECT has to wait for the ALTER TABLE to finish.

This seems like it might be worse.

So, our scenario, just to be clear, Session 1 opens a transaction, inserts a row into content, Session 2 opens a transaction and tries to add a column to profile, and now every read that we have on profile is blocked.

And I’ve got the Lock Queue here to show you what’s going on.

And this ends up being a much bigger problem.

Writes take longer than reads, and more importantly, we are frequently doing them inside of transactions.

And transactions run for as long as whoever opened the transaction leaves it open for.

So, it can end up being that the time that a read query ends up blocked is the length of your slowest transaction plus that ALTER TABLE statement time.

And it can be tough to track down every slow transaction that is writing to a table that has a foreign key to the table that you are actually trying to modify.

Lock acquisition volume > read/write volume

The real takeaway here is that if you’re trying to figure out if a particular DDL statement is going to be risky, it’s not the read or write volume on that table you need to be worried about.

It’s the lock acquisition volume and how long those locks are going to be held for that will determine if a particular migration is going to be safe to run.

And that’s a lot harder to figure out than just what’s the read or write volume on a particular table.

Now, you might get lucky.

Maybe there are no write statements ahead of you in the queue when you try and run that ALTER TABLE statement and everything will just go through fine.

We found that we weren’t getting lucky, and this seemed like a hard problem to solve.

It is a hard problem to solve.

So, we sort of gave up on this for the moment and decided to try some other things to reduce the load on our poor release engineer who’s still having to run all these migrations all the time.

And the first idea we had was the migrations release.

So this would be, once a week, we would run migrations.

We’d still deploy code every day, but those releases wouldn’t be allowed to have any migrations in them.

This is sort of like having a scheduled maintenance window, and this did not work at all.

It wasn’t just that this was like bad for development philosophy because now if you wanted to make a migration change, you’d have to wait for this window.

If you had this big batch of a week’s worth of migrations, the chances that one of them would not work for some reason was…it got higher and higher the more there were.

And if you couldn’t run one of the migrations, you couldn’t run any of the migrations after that because migrations are sequential.

They depend on each other, and you have to run them one after the other.

So that would block the whole deploy, and then we’d have to wait till the next day to try and get things out.

Now we’ve just lost the whole point of doing continuous deployment in the first place.

There was one other sort of issue with this, which was that no one would remember what day the migrations release was going to be, and they would constantly bug whoever was the release engineer about, “What day are you actually gonna run the migrations? When can I commit my PR that has a migration in it?”

What if everyone ran their own migrations?

So, this didn’t address the fundamental issue here, which is that we have too many engineers generating too many migrations for one person to be responsible for running them all.

So we have everyone run their own migrations.

The idea here is that whoever wrote the migration, they would get a container built that had their branch running in it.

They would get that container deployed into staging, they would run migrate inside that staging container, they would pick a time that they could do this in production, and then they would do the whole thing again in production.

They didn’t.

And it wasn’t just that everyone forgot to do this, that they forgot that, “No, the build isn’t gonna run your migrations anymore, you have to go run them.”

It was hard to deploy this weird container that we built off of somebody’s branch that wasn’t gonna get really deployed.

It was this extra path through our build scripts, which was not great.

So then we tried having people just run the SQL themselves so we didn’t actually have to deploy it.

You can generate the SQL that a migration would run with a command, but then you also need to insert into that Django migrations table.

And people basically never remembered to do this, so it didn’t really work.

It turns out there’s a reason that all those frameworks provide a command for you to do this because if you try and do it yourself, you’re just gonna mess it up.

And the other thing we learned is when we stopped trying to run the migrations on the deploy, if somebody had forgotten to run a migration, the deploy wouldn’t block it first.

So now we just deployed an application version that depends on a migration that hasn’t been run, and now lots of people are seeing errors and we need to roll back.

Some key takeaways…

So this whole thing was not very fun, but we at least did manage to learn some things on the way.

The first is some migrations are going to require locks that just make them hard to run.

I know this is called “Migrating Data Safely and Quickly,” and it turns out, well, it’s hard to do it safely.

You just need to get those locks sometimes.

The next thing is that all the other migrations, the ones that aren’t altering tables that have lots of foreign keys to them or modifying tables that have 200 million rows in them, those we can just run normally.

Bringing manage.py migrate worked fine for those still.

We don’t want code that…this is what I just mentioned, we don’t want code that depends on migrations that haven’t been run to get deployed at all.

We want the deploy to fail if that’s the case.

We also picked up that we don’t want really long deploys.

If your deploys are really long, then you can’t set aggressive timeouts, and aggressive timeouts are really important to running any system safely and effectively.

So some operations like CREATE INDEX concurrently for those of you who write in Postgres or just creating an index in MySQL, these are actually safe to run.

They won’t cause this locking problem, but they will take a long time, hours, days potentially.

And we decided we didn’t really wanna have our deploys running for hours or days.

And the last thing we learned is that we’d prefer if nothing changed in Dev and CI because while these problems in production are happening, we’ve got tooling to run manage.py in our testing environments, and everybody in Dev is used to just being able to run manage.py migrate to bring things up.

A simple workflow for migrations

So, we didn’t wanna like rewrite a whole complicated system to do this.

We wanted to just extend what Django was already giving us but try to decouple the database schema from our application version a little bit.

So here’s what we came up with, and I’m gonna try and move out of the way a little bit so that maybe you can see this.

This is the code, and it’s not very complicated.

We check if we’re currently running in an environment in this ENVIRONMENTS_TO_RUN array.

And if we are, we just call our superclass, which is just gonna apply the migration totally normally.

If we’re not in one of those environments, we iterate through this list of callables calling each of them and making sure that they return true.

If they all do, we return the project_state, which is gonna trigger Django to do the insert into that migrations table like the migration had just been applied, even though we didn’t just apply the migration

And if any of the safety checks do not return true, then we raise the SafetyCheckException, which fails the migration and fails the deploy. And the Dev tools engineer gets paged, and he reverts your command and sends you an angry Slack message.

I didn’t explain what these safety checks are because they’re really the key to making this work.

And basically, the idea here is that instead of inspecting that migrations table for what our current database schema is, we inspect the database schema directly.

Postgres provides a bunch of views in this information schema that let you inspect like what columns are in the database.

So if you wanna know if a particular column has been added to a particular table, you can just query that view and find out.

And there are equivalents that you can do for like whether indexes exist, what the column type is, whether it’s nullable, if a trigger has been added or dropped.

And if you’re running MySQL, there are similar features of MySQL that will let you directly inspect the database schema.

A sample migration

So here’s what the whole thing sort of looks like.

This is an example of adding a column.

By convention at Nextdoor, we add the DDL that we’re actually gonna run in production as a comment to the migration just so that people can see what has been done in production a little bit easier.

The safety check here, of course, is checking that the column exists in the database.

And when we run this migration in production, if that column exists in the database, it will insert throughout the migrations table and then continue.

And in Dev, it will actually do it.

It will actually insert the row for us.

So what does this workflow look like?

You create a PR, you generate the SQL using that SQL migrate command that I mentioned earlier to see what SQL migration would run.

Once it’s approved, you are responsible for running the SQL in staging and production.

You can then merge the PR.

It gets deployed normally as part of our continuous deployment process.

The safety checks will run.

And if they pass, the deployment proceeds and everything’s fine.

And if not, angry message from a Dev tools engineer.

So that’s…and hopefully, most migrations don’t end up involving this actually, right?

Most times we say, “You know what, that migration, adding a new table, that’s always safe.”

So, for those, we don’t even do this.

We just let the Django migrations run normally and you get all the normal goodness that you expect from your migration management system.

Lessons learned

So I wanna just leave you with a few takeaways here.

The first is that running migrations at scale is tricky.

It’s hard to figure out if a particular ALTER TABLE statement is going to trigger that locking behavior.

You have to analyze not just the read and write volume on that table, but the entire schema of the database and all the transactions that are potentially happening against it.

So, sometimes it’s better to just be safe and just run things as an environment where migration do that ALTER TABLE statement in the middle of the night or during scheduled downtime.

The next thing is like if you are not using migrations tools, a managed system to migrate these things, like, do that for sure.

It will make your life much easier.

Don’t give those up.

Even if you run into issues like we do, try and extend them instead of replacing them.

The tight coupling, unfortunately, that you get with those migration tools between your application version and your database schema can cause some problems, particularly if you’re continuously deploying.

And automate everything.

You will notice that I brought up that we weren’t able to run some manual processes here a lot in this talk.

It’s hard to consistently run things manually.

So, don’t do it.

Take the time to write the tools you need that you can make processes as automatic as you can.

Even if there are some things like adding columns to very heavily used tables that you have to do manually in scheduled downtime, off-hours, you live with that.

But everything else really needs to be automated.

So that’s all I got.