The Service Map for APM is here!
Move to the cloud, double in size, or automate MySQL scaling: Pick three (Shopify)

Move to the cloud, double in size, or automate MySQL scaling: Pick three (Shopify)


Published: July 12, 2018
00:00:00

From flash sales to rapid scaling

If you don’t know Shopify, if you bought something online from somewhere which isn’t a marketplace, so an independent retailer in the last year, chances are pretty good that you bought it through Shopify. Our mission is to make commerce better for everyone, online, in person, craft fairs, and some of the biggest retailers in the world. We’re also probably the number one platform for flash sales.

If you don’t know what flash sales are, which I didn’t before I joined Shopify, they are limited releases of exclusive items that are timed, and that are normally associated with a celebrity. So, this is an example. Kylie Jenner sells Lip Kits cosmetics on Shopify, and we also host flash sells for Kanye West, and for various sneaker brands. They’re a big one.

People desperately want there their sneakers and makeup. A tweet from Kanye or an Instagram post from Kylie Jenner can do this to our databases. We can have 10 seconds’ notice to go from 1,000 or 2,000 queries per second to 100,000 queries per second. And then it can take several minutes to recover from this load.

A lot of this traffic, unfortunately, is generated by bots, but an enormous amount of it is still people. And when those people can’t get their limited releases, there’s always going to be people who are unhappy. But it’s very different to tell people something is sold out than it is to serve them a 500 error page.

So this presents some interesting challenges for Shopify and our scaling, because we can go from needing one times to 20 times the servers in less than the time that it can take to boot most virtual machines.

So, hi, I’m Aaron, I’m a Production Engineer on the datastores team at Shopify. This talk is going to largely talk about datastores and how our data was moved to the cloud. There is a bigger effort that’s outside of all that, but I’m going to try and give you a general overview of what Shopify did.

So I’m going to talk to you basically about what the last two years were like for my team, and for our department, as we moved from traditional data centers to a public cloud.

What is Shopify?

So, the obligatory number slide to put the rest of this in perspective. We have about 600,000 active merchants. That’s 80,000 requests per second during peaks. 26 billion US dollars of GMV, which is gross merchandise volume, which is basically sales, through our platform in 2017. I believe the total number of sales for our platform, forever—the project has been going about 14 years—is in the region of 60 billion US dollars.

So, we did nearly half of our all-time traffic—our all-time sales last year through the platform. There are about 800 developers working on Shopify. Approximately half of those work on core products. And we also have lots of services around the edges. And we do about 40 deploys a day, which is neither high nor low depending on you know, who you compare yourself to.

Good, there we go. We are primarily a Ruby on Rails shop, although, like a lot of other medium-sized startups, we use a bit of everything. There’s certainly Android and iOS for mobile apps, but there’s Python, there’s lots of other things.

Shopify Core, which is the main application that generates most of our revenue, which we refer to as our majestic monolith, is the direct descendant from the 2004 Rails application that our CEO wrote after being frustrated trying to sell snowboards online, and not being able to find software to sell snowboards with that met his standards for ease of use. Yeah, it’s never been rewritten, and it is probably one of the oldest Rails code bases in existence.

It predates the release of Rails 1.0. It actually predates the release of Rails, because Toby was just given a zip file of an early version of Rails by DHH, so it’s pretty old.

We run MySQL as our primary datastore. We actually run a lot of MySQL. We run six nodes per shard, and I’ll get to that in a moment, and that’s three in each location.

So we’ll have an active location and a passive location. Those locations map onto different failure domains. Historically, that’s meant a data center in a different city or a different state in the US. Nowadays, that means different regions with a cloud provider.

And the region will be a whole region, not just one availability zone or one zone, depending on your provider. We will write off, for example, all of us-east-1 if we start seeing issues, and not just try and reduce things to a single failure—availability zone. And we run over 100 shards. So, just even that basic napkin math—and I’ll save you, it’s about 600—is a lot of MySQL instances. Those are for the core application primarily virtual machines. They’re not any hosted solutions. They are not RDS. They are not Google Cloud SQL.

We also run hundreds of other MySQL instances which support the other applications that make up a lot of Shopify around the edges, selling on Amazon, selling on Facebook, those sorts of things. Shopify is sharded, so it’s a multi-tenant application like a lot of software as a service applications.

So each shop—and we’re very lucky in this instance, because pretty much everything that we need to store belongs to a shop. So our sharding key can just be “shop ID”.

There’s very few things which need to actually operate across multiple shops, although those things exist. And we have a master catalog, which you can see on the right there, which knows where each shop is in relation to each shard. So you know, given a shop ID, it can figure out which shard ID has that data on. Big shops, small shops, you know, we just try and tessellate and bin pack them into these different shards.

And Shopify is also podded. Podding is the process of basically taking Shopify and turning it into smaller Shopifys. So we have unshared a portion of our infrastructure. So, MySQL is the obvious one, through sharding, but also Redis for our job queues, our Memcached service, and the distributed replacement we use for cron, which actually happens to be a Resque scheduler. Those are all isolated to a pod. So we’ll put x thousand shops on a pod, and those are isolated from the rest of Shopify. If the pod has an issue, the rest—the other pods, which as I said is like up to 100, are not going to—generally speaking, are not going to see any issue from things that are isolated to one pod.

There are resources which are shared. Search happens to be one of them. And in order to deal with the flash sales that I showed you at the start, the web and worker resources are largely shared. Which means that as long as you know, multiple sellers are not having flash sales at the same time, we can use the spare capacity in that great big pool of web to help smooth out large flash sales.

For clarity, the Shopify pod is not the same thing as a Kubernetes pod. Unfortunately, we chose the name before we moved to Kubernetes, before Kubernetes was as popular as it is now. We actually do run all of our stateless workloads on Kubernetes, so the “Do you mean Shopify pod or do you mean k8s pod?” is a regular question that comes up in our war rooms, in meetings, in every single situation. But it’s probably too embedded now for us to change our minds about this. So if I say pod today, I’m generally speaking—talking about a Shopify pod. I’ll make it very clear if I’m talking about Kubernetes. Sorry.

So, each of these pods can live in multiple—Each pod exists in at least two failure domains. And they can be flipped between one failure domain or the other. As I mentioned, they—resources which are podded, are isolated from each other. So, a huge burst in Redis activity in one pod is not going to affect merchants in another.

Because of this isolation, there wasn’t a single second in 2017 where the whole Shopify platform took an outage. Which is not to say that we didn’t have issues, and we didn’t cause merchants pain. But it just was the case that the work to isolate all of these things, which was largely done in 2016, really paid off for us in 2017, by avoiding the kind of situation where you take the entire platform down, or one enormous sale, or one DDOS takes down the entire platform.

Before the cloud

So, 2016 was the last full year where Shopify didn’t run any of its production services in a public cloud. So, we’ll talk a little bit about what the infrastructure was like back then. It looked like a data center. These were pretty much textbook pets, and this is pretty specific to datastores.

So by this point, Shopify was a containerized application. The Rails, web workers, and jobs workers are containerized, running on our own container orchestration. But the databases are still running on extremely powerful—not Dell R610s—but extremely powerful machines with the fastest disk located in the same chassis, low latencies on a very stable network.

These were machines which only basically took outages for hardware maintenance, critical security updates. So we would see uptimes in the hundreds of days, and a lot of Shopify relied on the stability that both the machines and having a stable controlled data center network provided.

So, for example, we took out—Rails has as part of its connection tooling, it will do a check by default to like ping a connection once it checks out to go, “Is this connection still valid?” We removed that, because why would you need to check that? Clearly the database is going to be on the other end of that socket. It’s just a waste of CPU and network.

These are the kind of assumptions that moving to the cloud kind of teased out, because where you have variable latencies, where you have variable disk speeds, things that you always assumed would be faster, always assumed would complete in a certain amount of time, do not necessarily do so. And these machines were of a small enough number and with a small enough rate of change that they were pretty much configured just by making changes to Chef cookbooks that lived in GitHub through a PR process and review. This is how we handle failovers. We have virtual IPs backed by BGP.

This is our—the six-node setup that I showed you in the previous slide. Things with “(R)” are potential readers for where we send just SELECTs and reads. Things with “(W)” are potential writers. Only one side is ever truly right at load time. It’s not necessarily easy to see, but it’s the one with the black circle around it. Pretty much the worst thing that can happen to Shopify is to end up with two databases being written to at the same time. That’s going to cause a split brain where we end up with different sets of data that we cannot reconcile into one thing.

It’s considerably better for Shopify to either be read-only—and bear in mind, sorry, I should say—a pod of Shopify to be read-only or be completely down than to let a split-brain scenario happen. We used to open thousands of direct connections to the databases. This meant that as we scaled out machines for workers and for web capacity, we were putting more and more load on the same basically fixed-size databases that were backing those connections.

You can see the pink bars here line up with the end of the deploy, when this was taken. Each deploy takes about seven minutes to roll out. We’re deploying about every 15 minutes, so during this time period, we were basically deploying 50 percent of the time. Nobody really—excuse me—was running the deploy code for the other 50 percent. This worked well with virtual IPs, which is the reason that we kept this situation, was because nothing needed to know the state of how to reach a database, other than some routers on the edges of our network. The performance MySQL did suffer from doing this, but it’s definitely a frog-boiling kind of problem, you know. We increased the heat, we added more connections.

Performance got a little bit worse, and no one really, you know, had any trip wire with it that they crossed, where they’re like, “We need to fix this thing.” And you can also—just to point out—you can see the connections never actually dipped to zero during the deploy because we’re only deploying about a quarter of our capacity at any given time. So we roll out a quarter, that boots up, then the next quarter goes down, which is why connections never hit zero.

Back in 2006, our failover tooling was a solid but incredibly unfriendly bit of Perl called NHA. It’s a CLI tool. This is a typical output, and this is typically when you get woken up at 3 o’clock in the morning, you have to figure out, “What are the right parameters that I need tell it for the dead host?” What is the new host? And there’s considerable amount of configuration that’s actually hidden in those config files as well. So it’s definitely something that’s taxing for someone to wake up and have to do.

And it’s a high-risk operation. So these failovers of databases—I mean they represent pretty much some of the most important stuff to Shopify, which is our customers’ data. Once you’ve done a failover—and you don’t need to read this slide. There’s a lot of text there. But again, remember that I was reading this at 3 o’clock in the morning. You need to take something from the output and then apply that manually to other machines in order to fix the replication chains where you failed over a database that has failed.

This used to be really one of my biggest fears, would be wake waking up, having to do this, and getting pretty much any detail of this wrong was going to cause problems further down the line, and may even lead to data corruption. So, pretty much all a high-risk, intricate, brittle situation, things that seem like they would be a great fit to automation.

The move to the cloud

That’s where we started from, moving from a known environment with known speeds and known capacities, to an unknown environment, even one which we had bench-tested and done our due diligence on, meant that we’re going to have to build confidence over time in that platform. We couldn’t just move our paying customers and hope that everything worked.

When you were kids, did anyone ever tell you that if you cut a worm in half, both ends grow back and you end up with two worms? That, through the course of looking it up for this talk, is not true, so don’t do that. You’ll just end up with one short, maybe angry worm that has to grow its tail back. Where it is true, and why I needed it for this talk, is in databases. This is our standard six-node replica setup or six-node setup, five replicas, three in each of the failure domains.

In 2016, how Shopify grew was by basically growing a new tail on these databases and adding 12 nodes. So that’s another six nodes that need to have their—backups restored onto them, replication set up and be put into this replication topology. During the maintenance window, we’re basically going to cut that line down the middle, and we’re going to say, “Half of the shops are going to live on the left side, and half the shops are going to live on the right side.”

So these are going to have shard numbers. Let’s call them “Shard one” and “Shard two.” We update the master catalog. Once you’ve created Shard Two, basically off the back of Shard One, and we decide 50 percent of the shops around them are going to move, we update their master catalog. We update the master catalog to say which shops have moved.

So the same data actually exists on both shards, even when we separate that line in the middle, and we basically create the new shard, the old data from the other half is still on there, and needs to be cleaned up asynchronously. So we don’t reclaim a lot of disk space by doing this. The reason that this is safe is that every request that comes into Shopify, through our NGINX load balancers, is doing a lookup to go, “Okay, I have this shop. Where is the shard for this?” Which maps onto, “Where is the pod for this?”

So, that requires us taking a pretty considerable maintenance window, where we make everything read-only, while we perform the split. And that, as I’m just going to show you, is a lot of steps. So between building those machines and automatable things like setting up DNS, Chef, informing people, creating a notice of maintenance, because we’re going to disrupt tens of thousands of customers. So we put home cards, which are a little personalized notifications we could put in people’s shops. So, that’s not a particularly scalable way to carry on doing things.

And at this point, we had approximately in the tens of shards, and we knew that we were going to have in the hundreds, which we do now, two years later, and that we’ll probably have in the thousands. It’s not scalable to have a 103-item, human-driven process. This used to take three people approximately two days of work. And then the final steps required them to get up very early in the Eastern Time Zone morning to execute this. And you know, it still annoys customers.

So shard splits, which is what that process is, are clearly disruptive. They require a short outage, but they affect an enormous amount of merchants. They’re the sort of thing that we can only really afford to do a few times a year without adversely affecting our service-level objectives or agreements with customers potentially.

They also, because we can only run so few of them, they basically require us to move half a shard at one time, which is still thousands and thousands of shops. When we’re trying to build confidence in a new provider, we don’t necessarily want to move thousands and thousands of customers, only to find out that, “Oh we’re having some issues.” We either need to move them back with another maintenance window, or those customers are going to effectively suffer because of infrastructure choices that we’ve made.

So how about moving one shop at a time? So every ID and every table in Shopify is unique, which means that it is possible to just move people from one shard to another. We don’t have to worry about a duplicate ID or duplicate row. This is the historical version of the shop mover, the 2016 version. We lock a shop to make sure that no changes happen. We copy all of their data over, row by row, table by table, for a given shop ID, into the new database on the other side. And then we basically update the master catalog to say that person has moved and we unlock—I’m sorry, that’s me showing the copy step.

So yeah, it’s literally as simple as you know, selecting from one table and inserting into basically the same table, but in a different database. The reason for the lock, if that isn’t clear, is to make sure that if there is a new order created on the old shard on the left side, that we don’t miss it when we’re copying to the new site. So we basically have to stop everything in order to move it, or we did in 2016. Once it was done, it would unlock the shop, it would update the master catalog.

And new customers coming in are now coming in, and they’re hitting a new shard on the right-hand side. So the downtime for this could be considerable. It could range from minutes to hours to—for very large shops in which we didn’t try to move this method—to actually days of time where they’re read-only. And during the time they’re read only, that means nobody can buy anything, they can’t update their stock, they can’t add new applications, they really can’t make any changes at all.

So pretty much the first lesson that we learned is that real data doesn’t stay still. For a products company where our number one feature of our product is that it’s available, it’s not possible to take a large maintenance window. There’s also a correlation between the size of the data and how long it takes to move, but there’s a correlation between the success of the merchant and the size of their data. And there’s a reverse correlation between the success of a merchant and how much of their downtime they’re willing to suffer. Some of the larger shops the data centers took 100 hours to move. And I can think of very few large merchants and basically no small merchants who would be willing to take five days of downtime to move their shop.

So Shop Mover affects a few merchants, but it affects them for a very long time. So it’s a much smaller blast radius but it’s a much larger impact. What we’re really looking for, the sweet spot, is something which is going to affect a small number of merchants, or a variable number of merchants, as small as we would like, for short of time as possible.

So we need a third option, which is Ghostferry. Ghostferry is a Shopify tool, developed in 2017, and open sourced this summer under MIT license. It’s the Swiss Army knife of my MySQL migrations, is how we market it. And the algorithm has been formally verified using TLA+, so basically a finite model. Which is not the same thing as saying that it’s been proved, but we have gone through the effort of going, if you’re writing to this table and you’re, you know, performing these actions, we can with a degree of confidence backed by some math that I don’t understand, because I didn’t do this work, say that you’re not going to lose any data.

So this is pretty similar to the previous slide that I showed you about shop moving, but you’ll notice that there’s no lock. We’re going to run several threads of execution at the same time. We start doing the copy. The copy is still slow. The copy can still take four or five days, but the important thing is that there’s no disruption to the merchant while its happening. We haven’t locked their shop. They can still do everything that they want to do. They can add products, they can sell things, and they can update stuff levels.

At the same time that that’s happening, we are tailing the replication logs, so there’s a replication log that MySQL is keeping, which is basically a record of every change to every bit of data. Ghostferry connects to MySQL, pretends to be another MySQL database that wants to replicate. And it’s reading those records in, and it’s only really looking for records that’s relevant to it. So basically records that have a shop ID that matches the shop that it’s looking to move. And while the copy is still ongoing, it’s streaming new updates into the new shard. Once the copy is done—yeah, it’s okay.

Once the copy is done, the stream is going to continue running until basically the lag between the old shard and the new shard converges down to zero, or is close to zero as we’re happy with. It doesn’t need to be exact, because in the final stage, which is called “cutover,” is that we lock the shop for a very short period of time, we’re just making it read-only. We wait for replication lag to definitely hit zero, and then we update the catalog, and we unlock the shop. So the total time the shop needs to say “read-only” is usually seconds.

This is a post from Unicorn which is our internal system for sharing praise with our colleagues. I can’t say who the merchant was, but their shop was pretty much largest in Shopify, 1.1 terabytes in size. And we moved them, while it was being written to extremely heavily, with 13 seconds of downtime. So the blood, sweat, and tears that I mentioned in this come from fact it did take several attempts to do this, and we learned a lot while doing it. But the important part is that those blood, sweat, and tears are on the engineering team’s side. The customer did not suffer for our attempts to move them, which was an important part of moving our infrastructure.

You know, our customers should not be disadvantaged by the fact that we have chosen to move to a cloud provider. So, Ghostferry was the name of the generic open source version of that tool. Why it says “pod balancer” on here is that there is basically a Shopify wrapper, which does some Shopify specific things. It knows about shop IDs, it knows how to announce locks on a shop.

This is the tool that pretty much meant that we could move Shopify without having to take multiple-hour maintenance windows for customers and deal with the wrath of people on Twitter when we did that. And as you can see, it’s the ideal fit for the problem. I’ve kind of loaded that for you by making everything green.

Background jobs

Another problem during this move is that there’s actually a lot of background tasks that happen on Shopify. Some of those are initiated by customers and they might happen—they’re on a per-shop basis, but there’s also—and that might include things like re-indexing in Elasticsearch. There’s also things that happen across shops which might be like, “We need to update a template globally.” So you’ll have a job that just works its way through Shopify, making these changes.

There is no such thing as a quiet time for Shopify, and there is no point where we can take an outage window for these. The way that payments work on Shopify is they actually go in as background jobs as well, which means that effectively, background jobs have exactly the same SLO requirements that interactive requests from customers have or merchants have. So, it was going to be necessary to have developers involved in this process.

So, it’s a complex code base, as I mentioned. It’s basically 14 years old. Hundreds of developers work on it. One of our teams, which exists to give developers the tools to not have to think about their infrastructure and not have to think about how it scales, so there’s less cognitive load for our developers. We basically allow our developers to mark a job as unsafe during shop moves. That means that if any moves are going on, that job won’t work. For normal jobs, which are not marked as unsafe during job moves, jobs aren’t going to work on the shop while that shop is being moved. They’re just going to wait until the job is finally moved. They’re going to get renqueued again and then they’re going to work on the shop once the shop has moved to its new shard, its new location.

There are certain jobs which, through having a 14-year-old code base, nobody necessary thought that, hey, we’re going to be in a situation where we’re maybe taking a small outage window on a shop-by-shop basis. That we gave them a tool to go, “As long as you’re okay with your shop not running, and bearing in mind this could be for 80, 90 hours, your job not running,” then you have the tool to do that.

Empty shards

So now that we can safely move data without causing too much disruption, we needed somewhere to move it to. Ever since Shopify was sharded, there existed a new shard script, it had never been run because every time that we’d ever created new shard, we had taken it from—using the worm analogy—from the tail of another shard.

We went to use that tooling and, as you can see from the bullet point, spoiler: it didn’t work. So we needed to come up with our own tooling to do this. As I said, we were going to move from doing tens of these in a year to doing tens of these in a day. So there was considerable pressure to make that tooling as automated as possible.

This was the first draft of it. 23 items is still a lot less than 103, but it’s also still a lot. And a lot of these items are also the super-time-consuming, high-touch things like, “talk to this team,” “make sure that no maintenance is going on with this thing.” There are things that are hard to make a computer do. Some of these things involve opening PRs, or they involve GUI tools that are slightly harder to automate.

It’s still a reduction and it’s still win, but we felt that we could do better.

Moving targets

And early on in the move, pretty much every time we tried to follow that checklist we would find something that didn’t work, and that led to more checkboxes being added to future checklists. Working around maintenance was a particularly tricky thing. It’s hard to make a computer aware of when it’s not okay to do something because there’s a disturbance in the force somewhere in Shopify. And also we were blocking developers by doing this, so the longer that a pod was in the state where it had been added to Shopify, but wasn’t being actively used, it meant that migrations couldn’t work. People couldn’t ship features.

So we needed to be aware of timelines around features which have been announced and promised. For example we have Unite, which is a Shopify-specific external conference, and there are features that need to launch with Unite. We had GDPR, which a lot of companies here will have had to deal with. And that’s an immovable deadline that we had to comply with. So those are the sorts of things that actually make it the hardest to automate.

The computer stuff is basically the easy thing. So we refer to this as maintenance-replacement automation. You’re basically taking things that a person manually does, and you’re assisting that person by taking the easy parts, effectively, or the most automatable parts, and making those scripted. But you’re still talking about having a person in the driving seat, and initially we felt like this is a bit of a failure. We had really been trying to follow the SRE model and automate everything, but there are some things which are not necessarily the highest impact to automation.

This was something which, if the automation—you know, if this went wrong, it doesn’t cause pain to our customers, it just causes inconvenience for the engineer who’s running these scripts. So it’s not glamorous to allow some tool to remain, but we felt like in this case it was the right business decision. Which is not saying that we gave up on automating anything. Pretty much the name of the slide is “Automate anything.”

Automate anything

When faced with something where you don’t feel like the task can be fully automated, our initial feeling was like, “Okay, well, this can never be done.” And that’s not what you should do. I think that you can break the task down into, “These are things that are easy to automate, so let’s do those, these are things that are hard to automate.”

Well, let’s use the time that we saved on automating the easy things to work on either making those things easier to automate, have those difficult discussions, talk to other people and say, “Do we even need this step?” Or just to make it so that when you’re actually having to go and create these things, that at least all the easy stuff is done, at least all the automatable stuff is done. And all you’re left with is the annoying high-touch parts.

And it’s okay if that work never gets finished. It’s okay if you reduce down to as few manual steps as possible. And there’s just a baseline of manual steps that you just can’t get rid of. At least you’ve avoided the pointless parts of the toil. Cool. There’s a mouse cursor there, which is confusing me. I’m not going to play this whole video. I’m just going to have it going. I’m not going to play any of this video apparently. What that would have been was the process of signing up for a store.

This was something that, historically, we’ve always done. You create a new shard on Shopify. You create a new pod on Shopify. You sign up a store for yourself. You set a Pingdom to point to that store. It is really a backstop, last-ditch bit of monitoring if for whatever reason you know, your Icinga, your Datadog, whatever other monitoring systems you have in place, at least you’re literally doing an HTTP check to go front to back, to go, “The stack is working. We are serving web pages.”

We got so good at automating the easy stuff, that the most time-consuming part of adding any shard was signing up for a Shopify store, launching it by basically taking a password off it, and then adding it to Pingdom. Which was always, like historically was the most trivial part. Like if you did a 103-item checklist, and you spent three days on it, adding something to Pingdom was no big deal. But once you’ve automated 100 and so of those items, you’re actually left with the Pingdom part being the stickiest.

So it was actually worth spending engineer time on—particularly as you know you’re going to do this hundreds of times, you can spend five minutes, and it does take about five minutes doing this. Doing it 100 times, it’s easy to make that justification that an engineer should actually spend their time doing something which kind of seems trivial, like signing up for a store, automating the sign-up for a store, and adding to Pingdom. And this is pretty much the last version of that checklist. There are still checkboxes there, ready to be ticked, but it’s an achievable amount, and no one needs sign up for a Shopify store anymore.

Resiliency experiments

While we’re doing all of this, we ran resiliency experiments. Again, this is an old, existing code base that’s constantly worked on but never rewritten. We discovered when we crossed from nine shards to 10, that we actually had regular expressions in Shopify with a \d that were looking for a shard ID. And by switching to double-letter, double-digit shards, we had suddenly broken some functionality.

So there had never been a shard removed from Shopify before. We wanted to make sure that the first time that we did that wasn’t going to be in production with live customers who were going to see a failure. So we ran experiments. First of all, we skipped a number. We skipped adding 49. We went from 48 to 50, and caused some confusion, but no real problems. And then we added a shard, put one shop on it, moved the shop off, and then removed the shard again to see what happened.

Some pagers went off. Some monitoring was too twitchy or misconfigured. And we were able to tell those teams that shards going away from Shopify is going to be a common practice in future, and that’s something that your monitoring and your automation need to deal with.

Double in size

Double in size. Not doing well for time by the way. At the rate that we were moving merchants to the cloud, and based on our natural growth, by the time we had hit the 50 percent mark, our merchants which were still left in the in data center had also grown by approximately 50 percent. Which meant that one year into our move, we had pretty much the same amount of data to move as we did when we first started.

That was also the time where it basically became clear that we were going to go through Black Friday and Cyber Monday in the data center and the cloud. So BFCM is the biggest day for commerce in North America, and actually, increasingly, English-speaking parts of Europe. It’s kind of our movable—immovable feast and will be a day where we’ll see our traffic more than double.

That meant that we’re going to have to run both locations at peak performance. We were going to have our—we were in the process of moving out of the data center, but we had to put new equipment in, and we had to have the best BFCM that we were ever going to have in the data center. Which also means that nothing is basically deprecated while it’s in production.

It’s very tempting to look at your old technology stack that you’re moving away from, and consider that frozen, “We’re not going to make any more changes to that.” That wasn’t a situation that we felt we could be in. Initially, we thought that we were, we created new cookbooks for the cloud, and we started making all of our improvements in the cloud. But when a move goes from six months to 18 months to maybe two years, you don’t know when you’re making those decisions how long you’re going to have to deal with that technical debt.

Best before removal

So that meant that our data center had to be in its best shape even though we knew that months later were going to be decommissioning it, getting rid of the hardware. When the datastores decided to—Yes, covered that. Sorry. So we ended up merging the two versions, the data center and the cloud versions, over automation of our Chef. We merged those together. We actually ended up refactoring the data center code, again, knowing that it was going to be deleted shortly, to basically leave it in the best position that—best factor you know, best tested that it had ever been, shortly before it was deleted.

And before leaving the data centers we prepped up on one last scanning problem. Again, bear in mind that half of our customers are still in the data center. So it was important that we not again not disadvantage any merchant just because we had decided to make a technology change. And we didn’t want to disadvantage them because through random luck they ended up being in the data center instead of being in the cloud. As I mentioned MySQL doesn’t like having thousands of connections open.

And as we were heading up to the last BFCM in the data center, we’re going to get to the point where those connection counts I showed you earlier on were starting to get critical. So we tested and deployed a new proxy layer between the application servers and the databases. Which had a huge performance benefit. The left-hand side of this chart and the right-hand side of this chart show MySQL doing the same work. What we’re actually seeing here is the threads inside, or the threads which are actually executing inside the MySQL kernel.

So the overhead of just having tens of thousands of connections doing almost nothing, versus having a few hundred connections which were actually doing work, meant that we were spending—we were able to actually improve the performance of the databases and avoid hitting this sort of C10k, you know, connection—issues with having larger numbers of open sockets. The important part of this, why it’s relevant, is that we actually did this work twice. The environments are different enough, in our move to the cloud to Kubernetes, that we ended up doing this work once in containers, using Kubernetes services, both cases using ProxySQL. And then we also did it once in the data center with Chef-managed hardware, bare metal, virtual IPs.

It was worth putting in engineering effort to effectively build a new feature in a data center which was only going to leave live for a few more months.

The people problem

And it’s not just databases that doubled in the last two years. So has the team. Again, it was very early on. It was tempting to say that the people who joined the team after this transition shouldn’t be burdened with having to learn how the data center used to work. But that is a mistake, which we rectified. We went down that route. And what it leads to is an increased burden on the people who are the old guard, who have been there forever and know how the old systems work. But also it stops people who are new and being on-boarded from feeling like they’re having the same impact as people who’ve been there. So while you’re running two systems, while you’re Roman riding, I think it’s important that everyone in your team feels empowered to work on both the old and new systems.

Building trust

Okay, so this was the starting point of our failover automation. Someone would get paged. Someone would perform some actions directly on the database. And then eventually—sorry—someone performs some actions in the database. Again, as I mentioned at the start, these are things which felt like they were ripe for automation, but they were also the things which were the highest possible risk to Shopify. They could cause large outages. They could cause data corruption, which as I mentioned is the worst thing that can happen.

The easier part was setting up code to perform these automated failovers. The hard part is actually building trust in those tools that you feel that you can take a publicly-traded company’s data, and hundreds of thousands of merchants’ data, and trust it to a tool that’s going to make decisions on your behalf. And do something which used to always be done by DBAs who know what they’re doing.

So how can you use verification to build trust? I mentioned earlier that Ghostferry is formally verified. It does have the advantage of having a fixed set of inputs and a fixed set of outputs, so it’s easier to do that. In reality, our application, like a lot of applications, is basically a distributed system made up of other distributed systems, which break. And all the interesting things happen during the points where they break.

One way to start is to include a person in the decision. This is Orchestrator. It’s a failover tool written by Shlomi Noach at GitHub. Earlier in the year, when MySQL went down, someone would still get paged, and they would go to this web page, and there would be a button. And they click the heart button and that recovers MySQL for them. Why make someone click the button? Well, sometimes they clicked but and the wrong thing happened. Sometimes they clicked the button and nothing happened, and sometimes the button didn’t appear. In each of those cases, we would run a root-cause analysis to figure out what went wrong with this.

Similar to the NASA talk that we all went to earlier. It would be, “What went wrong?,” “What went right?,” “How can we prevent this from happening?” After having a string of these, or after having a long period of time where we didn’t have to run one of those investigations, we felt like now is the point where we can be confident in letting computer press the button for us. Spoiler. What do you do when you have told the computer to perform failovers for you, but it can do that at a speed that you’re not expecting? So you have given it a clearly defined task to do, and it’s going to do it at a scale and speed that you weren’t expecting.

One pattern that we had, which Orchestrator also has, and other tools of ours do, is to use a cool-down timer. If you’re performing a dangerous operation, and you need to perform—If you just performed a dangerous operation, and need to perform another dangerous operation after, then don’t. There is a five-minute window where after doing the database failover, if the database tool thinks it needs to perform another failover five minutes after the first one, the chances are the automation is broken, and not that another machine is broken. That’s a window and a risk that everyone’s going to need to tune for their own organization. What we don’t want to have is 100 machines fail, or think that they failed, and 100 failovers happen, which is a disruptive process for our merchants and for our systems.

Holistic checks

If you want to be more resilient to unreliable or asymmetric networks and network partitions, you can check the whole health of the service from multiple points. This part of the database cluster, this is a traditional check-based monitoring system, we’re checking writer, can we reach it on 3306? If that check goes red, the easy thing to do is to go, “That writer is down. We should perform a failover.”

What we actually do is, we’ll check with the things which are connecting to that and go, “Can those things see that the database is down?” In this case we’re talking to the replicas, and if the replicas are able to talk to the master, then we can consider that the master is up.

Moving from this low-latency and reliable network to the cloud, which has its advantages and disadvantages, meant that we could no longer assume that, just because you can’t reach something, does not mean that it’s actually down. How many failures do you need before you page someone? If you’ve ever configured an active monitoring tool like this, you’ll know that you have a trade-off between two out of the last three failures, or x out of the last n failures is enough to wake someone up.

What we have moved towards, and it’s like a Datadog plug, is actually monitoring the outcomes of our application. How many times does a database report an issue? How many times does the Rails worker having an issue connecting database? Those are the things that should wake someone up, not whether port 3306 is reachable by Icinga.

Sanity checks

So instead of reacting to the latest state change, our tools also do a sanity check to make sure that the state of the whole system is basically the same where it can perform safe action. If a single database node fails, it’s pretty reasonable to mark it as failed and take—yeah. If every database node fails, there is not necessarily a safe action, and the correct thing to do in this situation is nothing. And even, going back to when the first node fails, that may be a monitoring glitch. Like, just because something, in this case, our service discovery, reports the database has gone away, doesn’t necessarily mean that it has. So we actually check the database health directly, and we will not remove a healthy database, even if service discovery thinks that it is down. And then we page someone to let them know that the automation has failed.

You will fail (and that’s okay, sometimes)

I’m going to have to wrap up. I’m a little over. Pretty much the important thing to take away is that automation is a matter of trust, and that trust can go up and down. Things will fail, and that has to be acceptable. You set an error budget, you set an SLO with your stakeholders, and you say that this is acceptable or this is not. When things fail, you run root-cause analysis into why did this happen. And it’s very possible at some point, even after adding automation to your system, that you realize your failure rate is too high, and that you need to take something which has been automated and put a human back in the mix. I’m just going to skip to the last couple of slides, and say thanks.