Well, hi, everyone. I’m so excited to be here.
Thank you all for coming. How are you all doing? We’re excited to hear about some system crashes, a little bit of chaos.
I’m gonna tell you about two system crashes.
It’s specifically related to a system, a tool, a solution that we built and rolled out in order to deal with what we thought was mostly just a lightweight problem, just your standard thing, and then the accidental chaos that followed along with it.
But before we go on, let me introduce myself, give a little bit more context. My name is Bonnie. I use she/her/hers pronouns, and I’m a Software Engineer at Flatiron Health.
Who here has heard of Flatiron Health?
Great, okay. Well, for those of you who don’t know, Flatiron Health is a health tech company based in New York. And what we do is we leverage technology to better transform how cancer is treated and understood.
And at Flatiron, I work on what we call Linked Data Infrastructure team, Link Infra, for short.
Before I was at Flatiron, I was a student at Yale, and I was studying Architecture and Computer Science.
Our tool and the problem
And so, what was our tool and the problem that came along with it?
Well, in order to really understand, like, our tooling, our infrastructure, everything that goes on, you really need to understand Flatiron’s mission, how we function, and our needs as a company.
And this is Flatiron’s mission: to improve lives by learning from the experience of every cancer patient.
That can mean creating software for electronic health record systems or dealing with and leveraging and analyzing patient data.
And since we’re dealing with a lot of patient data, that means we have to be HIPAA compliant.
There’s a lot of rules and regulations in place that deal around patient data, so we have to follow all that.
And not only do we as employees have to be compliant, all of our technology has to be as well. So there’s a certain subset of all the great open-source and other tech that’s out there that we are allowed to use.
One of the main ones that we use is AWS, Amazon Web Services. Some of you are familiar with it. Some of you guys just met it yesterday at the game day.
But we use it a lot. One of the main features is their EC2 instances, which if you don’t know what that is, it’s basically just like VMs.
But also, because of this HIPAA compliance that affects our technology, we also have a lot of homegrown tech that we make ourselves. It looks really similar to other open-source software out there, but we build it and make it ourselves.
So my team, Linked Data Infrastructure, we support a subset of Flatiron tech teams called Real-World Evidence, RWE teams, and those teams are a bunch of engineers. They analyze and process a lot of data.
So day in, day out, the main feature of their job is they’re running data pipelines, ETL pipelines. They’re doing it like all day, every day.
A peek beneath the hood of Crane
So, we build tooling for them. And meet Crane.
This is one of the tools that we built for them. You can see it has this really cute little UI. And it’s our internally-built tool. It’s a Flask application. And its sole job in life is to just run data pipelines day in, day out. That’s all it does year after year the whole time that it’s been up.
And it’s really great for our engineers because it allows them to run data pipelines remotely. They don’t have to worry about a massive data pipeline running on their local computer and killing it. They can just submit it to Crane, let it do its thing, go and find it later, get the results, etc.
The way it’s been set up and the way that was set up is that it has N number of EC2 instances that we call workers. It’s a static number. So probably just 20 instances, like up at all times, just running all of these jobs.
And there’s a single primary machine that takes in jobs from users, runs them, schedules them to workers, and that primary also services the UI.
So to go into just to visualize that architecture a little bit more, we have that primary instance. Like I mentioned, it services the UI. It also handles all the scheduling logic, all of like the logic of displaying it, etc.
And it’s an EC2 instance.
And there’s also a series of workers, right, N number of workers static. Like pretend this is 20. I couldn’t fit 20 on the screen. They’re also all EC2 instances.
And it doesn’t change unless we’re, as an engineer, sitting there bringing one up or taking one down.
And so every 10 to 20 seconds, a worker is asking the primary, “Is there a job I can take? Is there something for me to do?”
It asks this all the time, no matter what. Even if it isn’t able to take any jobs, it asks the primary, and the primary checks, basically, for three things.
There are no jobs available, you can’t take a job, you have too many jobs right now, you’re at maximum capacity, or, yes, here’s a job. Take the job.
And so, users or engineers or downstream clients, they interface solely with the primary. They don’t see any of this background logic with all the workers and the scheduling and asking for jobs.
All they see is that UI. All they see is the results of their jobs and the logs that they can get from that.They don’t see any of this other stuff.
And so, when a worker is asking for a job, it’s still doing this.
Let’s say this user submits a job and they submit it to the primary. They can either do that through that UI we just saw or through the command line.
And the primary sees that like it’s a job, and a worker is asking for a job, so it assigns the job to a worker. The worker runs it, takes however long, executes it, and then, it returns an exit code to the primary machine.The exit code will either be a success or failure.
If it’s a success, then it also returns an ID for that job it just ran so that whoever ran it can like access it, look at the logs, look at it in more detail, and see what happened.
This architecture was fine. I mean, it was more than fine, it was great. I mean, we still use it, but like it wasn’t an issue until something started happening over the past year.
And in less than a year, those real-world evidence teams that we service, that we support, that depend on this tool every day, they went from having 12 engineers to 26 engineers, which not only means that there’s more engineers, but also means that the load on Crane dramatically increased.
Where Crane began running into problems
And so, we began to see things like this Slack message on top of someone asking at 4:00 pm, why their job from 8:30 in the morning didn’t end up running still. Or we see this queue length from the UI that’s saying, “You still have 200 jobs that haven’t been run.
And it’s all backed up, and people have been waiting eight hours for it."
This happened because our data pipelines are really spiky. And so what that means is that most of our data pipelines, they take less than a minute to run, like less than 60 seconds, super quick, super easy, no big deal.
But there are some that take like hours to run, as in like, eight hours. You kick it off in the morning, you come back the next morning, and then you can get your results then, and like that’s how long they take.
And when you run all those jobs at the same time, those big huge jobs are inevitably going to be blocking those tiny little jobs.
And therefore, you’re getting all these tiny jobs backed up or you’re having people who can’t run their jobs at all for eight hours because these massive pipelines are taking up all the resources that Crane has.
But then, those are on the days where everything’s spiking. It’s a spiky day.
On non-spiky days, the machines, they’re just running some of those quick jobs, but otherwise, they’re just kind of chilling. They’re not doing anything. They’re not running. We’re not using the resources. So we’re wasting money on all of these machines that aren’t being utilized properly.
What are some possible solutions?
So, as an engineer, Crane user, you have two options.
One, you can do that whole song and dance of waiting for your job to run hours and hours and just hoping that it gets run in time before you go home for the day, which is not the best solution.
It’s not really viable, especially the more and more this happens, or you can file a ticket with my team asking us to add workers, more instances to the Crane fleet.
Also, not a great thing because now not only are you waiting and like wasting time, we’re also sitting there bringing up these machines.
And also now, these downstream clients are dependent on another team in order to get their work unblocked, and that’s not an ideal situation at all.
And so, we desperately needed a solution to this problem.
AWS Auto Scaling
And that came in the form of AWS Auto Scaling.
This is a feature in AWS EC2, perfect for us because Crane is all just EC2, so it fits right in.
And it will automatically resize your EC2 fleet, your Auto Scaling group, based on a set of predetermined conditions that you get to control.
It helps with fleet management and it replaces unhealthy instances. So now you’re saving money because not only do you not have machines that you no longer need, but you’re not wasting money on machines that aren’t functional at all.
And so, this is hands-free EC2 management. And this was a perfect solution for our spiky pipelines because now whenever we had a lot of jobs and there was this huge queue, we’d just bring up a bunch of workers, and it’d be great, and they would run all the jobs.
But when there weren’t any jobs being run, then we would just have no workers. We weren’t wasting money on idle machines that were just sitting around doing nothing.
And so, Auto Scaling plus Crane, we adorably dubbed the project Auto Crane and still call the system as such.
And this is how it’s set up. And this is how it exists now.
There’s a Crane primary still. It’s a single EC2 instance that services the UI, schedules all the jobs, gets the jobs back, etc.
And then now there’s the Auto Scaling group.
So that N number of EC2 instances, those 20 instances we saw before, they’re now handled by the Auto Scaling group, but instead of being 20, it’s variable. It might be 20 one minute and then like five the next minute.
And so, this, as a unit, was just called Crane.
Auto Crane’s architecture
And every five minutes in order to control this Auto Scaling, we had a Python script kick off. And within this Python script, we had a method called The Reporter.
It got a report on the status of the Crane, and it will ping Crane, the primary, every five minutes and be like, “What’s your status? What’s going on?”
And the primary would respond with…its fleet status.
It returns the number of unscheduled jobs it has, the queue length, the number of empty workers it has, the number of empty job slots that exist in the system, number of jobs slots per worker, and current number of active workers, the current number of EC2 instances in the Auto Scaling group.
The Reporter takes all this information and passes it to the decider, which makes a decision on what to do with this information.
And the decider then executes, creates a decision based on a series of if/else conditions, the first being do we have enough empty slots to handle the queue, the number of unscheduled jobs, and also maintain a reserve pool of instances (just in case we get like 100 jobs submitted between Auto Scaling script runs)?
If the answer to that is, no, scale up by however many machines you need to address this first condition and make this condition true.
Else, do we have an empty machine that we can terminate, and can we scale down while maintaining this first condition?
If yes, you scale down by one machine. Otherwise, you don’t need to scale. You’re in the current proper state.
The decider passes the delta of machines that wants to scale by to the Scaler.
And the Scaler takes it and returns that desired capacity to the Auto Scaling group and AWS.
It returns it to AWS directly.
And if the decider has said that we are terminating a machine, then it tells the Crane primary, the worker idea of the machine being terminated.
This is what Auto Crane looks like now. This is not our first iteration. This is what happened after a lot of testing, a lot of failures, a lot of confusion around what was going on in the best-case scenario.
But we made all of these design choices based on our preparations leading up to this.
And so, obviously, a lot of technical preparations went into place.
Avoiding job loss
And one of the main questions that we faced was about job loss.
How do we avoid job loss where a job loss is a job gets submitted and it just disappears. It goes into ether, it goes like somewhere, somehow it got lost.
And so, a hypothetical scenario of how this could happen goes as such.
The Auto Scaling script, it kicks off, it does what it’s supposed to do, it’s running every five minutes, and the primary machine then sends back its state because this is how it goes.
And then the script says, “Crane, you have too many machines. You don’t need all those machines, so let’s scale down.”
And the script chooses a machine to terminate.
AWS starts a termination process for this machine because we’ve sent that information that we want to kill the machine to AWS, and it kicks off the termination.
While this termination process has started, the primary schedules a job on the machine that’s about to be terminated.
And while that machine is starting to do this job, the machine gets killed by AWS and the job is lost.
This is one scenario. There’s a lot of different ways where this can happen, but this is one of the main ones that we drew out, and this is the biggest one that we are concerned about.
So how do you avoid this scenario or any kind of job loss from happening?
We ended up putting in three preparations.
One, you’d only kill empty workers. That way, you don’t have to worry about like a job being on it, like is it going to finish in time? Do we have to drain this worker? Like, what’s going on?
Two, you don’t let the primary assign jobs to terminating workers. That’s why we send the primary, the state of the worker that’s about to be killed. We tell it what worker is about to be killed. That way, it knows not to assign jobs to that worker.
And three, this one came a little bit out of left field. It’s not something that was in our Auto Scaling code or those preparations in that Python script, but it came from how AWS Auto Scaling is set up.
When you tell AWS Auto Scaling that you want to kill some number of machines, it takes any number of machines, it chooses a random selection of EC2 instances to kill, and just kills those. Like you don’t really have any control over it, which then renders our first two preparations moot.
And so what do we do about that? Well, in AWS Auto Scaling, you can also protect your instances, meaning those instances can’t be killed, unless you run a specific command to unprotect that instance, and therefore it can be killed.
So, we set our entire Crane worker fleet to be protected by default.
And that way, you can only kill the machines that we have chosen based on the first two conditions when we run that command to unprotect that instance.
And, you know, beyond all this job loss questioning, we also stress-tested the system.
We ran our biggest data pipelines, made sure it could do it in a timely manner, made sure that like other things could run at the same time as this.
Some other considerations for EC2
We answered a couple of questions like how do you launch 50 EC2 instances at the same time with unique but human-readable, understandable, and easily searchable URLs, or how do you launch these machines in a timely manner?
Because if you have a job that takes 60 seconds, and you’re bringing up a machine to address it, and that machine takes 20 minutes to come up, it doesn’t really make sense in the context of that 60-second job.
So you want to launch your machines as quickly as possible.
And beyond technical things, we also prepared a lot of communications.
This is our doc that we wrote on how we planned on productionalizing this system.
There’s a couple of things, but you can see that we have a schedule on the exact schedule of testing when we’re rolling it out, monitoring, when we’re going to finally deem it like, “This is good. We’re going to take a step back.”
Like an Auto Crane, how it’s technically implemented, the couple of technical concerns that came with it, like the race conditions, the job loss, or like all the concerns about terminating Auto Scaling instances.
And then our roll-out plan. What were we going to actually do to roll it out? What were the preparations we were taking? What were the communications that were going on?
And we wrote this and we communicated it to our downstream teams. We sent it to them in both Slack and email.
Making sure Crane users are all on the same page
And we try to communicate this as much as possible to those teams that we’re improving Crane for because they relied on Crane every day.
We don’t want to roll out Crane and then break it when they need it the most when they’re relying on it really heavily, when they’re trying to do things within Crane.
And in this communication, we also included a rollback plan for Auto Crane.
Basically, what are we going to do if something goes technically, horribly, incredibly wrong with Auto Crane when we roll it out?
Basically, our decision in that was, we’ll just keep the Crane fleet, the static fleet, as it is. We’ll just push it to a different URL. It will live at that URL. Auto Crane will live at that main URL.
And then if something goes horribly, horribly wrong, we’ll just search the URL’s minimal downtime, easy fix.
And after a couple of weeks of sending a lot of a lot of updates, updating the doc, telling people that, “Auto Crane is coming, get ready!"—we rolled out Auto Crane.
And in our initial roll-out, we ended up having a bug. And the symptom of that bug was this exact kind of job loss that we were working so hard and that we wrote a complete section, its own section, in the doc for.
The cause was that third condition and the separate preparations that we had taken for Auto Crane. We had accidentally launched all of these Auto Scaling, all of these machines, this Auto Scaling group, without protecting all of those instances.
So despite the fact that we are telling Crane, the primary, that, “We are killing this specific worker,” Auto Scaling group was just like, “Nope, we’re gonna kill this other one because that’s just that one that I found.”
We rolled it back. We executed that rollback plan that we described under Communications.
It went really well. We’re super proud of ourselves, and we fixed it. We fixed the bug, we made sure that that bug was not gonna happen again, and we rolled it back out.
And after a couple of hours of monitoring, everything seemed very okay.
Some unexpected obstacles
At this point, I want to say life was good, but the title of my presentation implies otherwise, and so, these are the bad, unexpected things that happened.
The first thing I’m gonna share with you, it’s not necessarily bad, but it was incredibly unexpected for us.
This is a graph of our Auto Scaling group at any given time.
The X-axis is dates. This is a week in December, a couple of weeks after our launch.
And then the Y-axis goes from zero to 50, it’s a number of EC2 instances in the Auto Scaling group.
The blue line, the really spiky blue line, those are the number of EC2 instances at that time.
And the solid red bar here, that’s the number of machines that we had before Auto Crane, that static number, which is like it’s just a solid little brick.
And this solid red line here that’s going up and down a bit, that’s the average number of machines across each day.
You can see here that the average number of machines is higher than what we saw with our static Crane fleet because the number of jobs that people submitted to Crane drastically went up once we rolled out Auto Crane because they now started to see Auto Crane as this reliable, robust system that can handle all their jobs, that will execute all their jobs, and now they can really happily use it.
And that was a mistake on our part because our understanding was, “Oh, our downstream teams, they just want to use more Crane sometimes.”
It turns out they just wanted more Crane all of the time.
And so, this wasn’t a bad thing. We’re pretty proud of ourselves. I mean, great. Our teams are really happy.
But then Crane began going down, and Crane began failing because we never expected the need for Crane to scale the way that it does and did with Auto Scaling.
This giant text block here, this is part of a JSON file of how much of that store is Crane state.
Basically, all the workers, all the jobs, the state of those jobs currently, which workers have what jobs.
With Auto Scaling, with like thousands and thousands of more jobs, there’s that JSON file that was okay with a couple of hundred jobs per day. It’s gonna drastically blow up with a couple of thousand jobs per day.
And the primary machine began failing because this JSON file went from being like some tiny number to being an 8.1 megabyte JSON file that every time a worker asked for a job, every 10 to 20 seconds, the primary then has to take this JSON file and serialize it and de-serialize it for every single worker ping every 10 to 20 seconds.
The UI went down, people couldn’t submit jobs. It was a whole mess.
It was not a great look, but eventually, you know, we’re like, “Okay, we have this system of this is how we’re gonna deal with it. Here’s our solution for the short-term, also, here’s our solution for the long term.”
And everyone was really happy with it.
Oh, there are other issues where Crane…I mean, like the workers also weren’t built to scale this way. There were issues with setting up the Python environments and where the packages were installed in some race conditions around there.
But like I said, we found solutions to those. We presented those to our downstream teams. They were very happy with the timelines and the solutions that we came out with and we moved on.
Auto Crane’s first Git outage
And that brings us to our first outage, which is our internal Git server. We have an internal Git server at Flatiron.
And, you know, we’ve rolled out Auto Crane.
And the day after the launch, a couple of days after the launch, of Auto Crane, people began to complain, “I can’t use Git right now. Git batches aren’t working. Git pool isn’t working. Everything related to Git is not working.”
Everyone was concerned. I mean, for Git to go down, that’s kind of a big deal.
And so that became an incident, but then it resolved itself after 20 minutes. Everything just kind of blew over.
Some initial investigations happened, but we were like, “Oh, I guess Git was just processing something really, really big, but it got through it. Now, we’re moving on.”
Several days later, Git went down again. For Git to have gone down twice in such a short period of time in the same manner that it had the last time when this had never ever happened before, it needed a closer examination.
Even if it was going to resolve itself, at this point, why would you just let it sit and maybe it’ll resolve itself after 20 minutes, but then it’s gonna happen again? You don’t wanna do that.
And this was brutal for us because, one, engineers fundamentally cannot do work anywhere that requires Git. So that means testing your virtual machines that you need to push changes to or like uploading new code that you need to roll out. Anything like that disrupted because of this Git outage.
And one of these Git outages ended up coinciding with one of our new hire classes, and especially at Flatiron, if you’re an engineer on-boarding on your first day, on-boarding fundamentally requires that you have access to Git so you can clone all of these very important repos that you need to also set up your machines.
Not a great first impression and not great for us, and we also can’t do work.
And this whole time, our team is just kind of sitting there and we’re like, “Interesting.”
The first time Git went down we’re like, “A little bit of a coincidence but like, coincidence? It resolved itself. We don’t need to worry too much about it, right?”
No, no, no. And then Git went down again, and once Git went down again, we’re like, “It’s a coincidence, probably not a coincidence. We should investigate this for sure at this point.”
And lo and behold, we found that Auto Crane had caused our Git server to go down.
This is what it looked like. The top, that’s on our actual internal Git server. You can see that the CPU is maxed out. The memory is also maxed out. And from this lovely Datadog dashboard, you can also see that the CPU usage, while spiky before, suddenly, just spiked and did not go back down. It just maintained being maxed out.
And the memory, I don’t know if you can see this, but also had a spike after being fairly low.
The clash between Auto Scaling and Git
So what happened? What did we find when we did that investigation on Auto Scaling and Git in that connection?
Well, Auto Scaling logic dictates that when there’s a number of jobs in the queue, you should bring up however many machines you need to address that number of jobs in the queue.
So, for example, if the queue is a hundred jobs long, then we would bring up 50 machines because each machine, each worker, can handle two jobs. So 50 times 2 equals 100, it covers all those jobs.
Since workers poll for jobs every 10 to 20 seconds, within 10 to 20 seconds of workers being brought up properly, start asking the Primary for jobs, all of those jobs, all those hundred jobs are scheduled.
And every time a job starts, the worker does a Git fetch, and they’re done by job. And so that means that the worker can do two Git fetches at any time. It doesn’t matter whether this is the first job this worker has run, the 200th, the 50th, it doesn’t matter. Every time it runs a job, it does a Git fetch for each job it’s running.
And so if we’re using the example from before where we have a hundred jobs in the queue, we just brought up 50 machines to address all of those, now, in 10 to 20 seconds, the Git server has just gotten a hundred requests for a Git fetch, and this happened.
Pretty easy way to take down Git. I recommend not doing that. We fixed this. I mean, we talked to our DevOps team, and we fixed it. Our temporary solution was, basically, we will not scale up by however many machines that we need. We’ll scale up by, at most, five machines because Git can handle that kind of load. We’ll be conservative about it.
It’s better than just killing every single internal server that Auto Crane needs to access.
In the meantime, they will be working on a solution for it to make sure that we can scale it by however much we want, but then also Git won’t go down.
Ultimately, ironically, their solution to this problem was adopting our auto-scaling model and using it for Git. Great for us.
Losing all hosts
And so let’s fast forward six months to our next outage, which is almost all of our hosts.
At Flatiron, our current paradigm for launching EC2 instances is you go to your computer, you go to your terminal, you type in your command to launch an EC2 instance, and a Slack channel will then tell you…that tracks your EC2 instance launches, well, everyone’s EC2 instance launches. It tells you when it started provisioning the machine, it gives you the IP, and it tags you to make sure that you definitely know this is happening.
And then when it finishes, it’ll tell you when it finished and whether it was successful or not. So we can see here that we brought up a bunch of machines. I mean, this is exactly what it looks like.
There’s an IP address starting at linkfra-eng, and then however many minutes later, it succeeded…It’s done. All hosts were manually launched.
An engineer was sitting there in front of their computer making sure that everything went up properly, everything was dealt with properly, like that these machines came up fine. They are waiting for the succeeded message.
Auto Crane was the first example of this not being the case. It was the first example of automatically-launched EC2 instances of an engineer. And they’re not monitoring every single launch every five minutes.
And it also logged to the same channel of these manually-launched hosts, and workers were kind of like, “Well, that sucks. It’s a little bit annoying. It’s spamming the channel and that everyone just enters and leaves as soon as they can. But whatever. Shrug. Let’s move on.”
One day, one fateful Tuesday at 5 p.m., it was actually at 5 p.m, we noticed that the Slack channel went from looking like this to looking a bit more like this.
Unfavorable, definitely not what you want to see. Definitely, not what you want to see.
And this was really bad because you would scroll up in the channel and there were dozens and dozens of Auto Crane machines that are failing to launch.
And our team began investigating, and we figured it was an Auto Crane problem because we had no other context gone. I mean, 5 p.m on a Tuesday, no one’s watching EC2 instances and testing them and making sure.
And so all we saw was all of these Auto Crane failures.
In the end, what we found was that it actually was something that affected all of our hosts at the most basic level.
There’s a part in our provisioning where it pulls a key to verify a package, and that method of accessing the key to verify packages that were installed had changed.
And so this wasn’t an Auto Crane problem. It wasn’t my team problem. It wasn’t a Flatiron problem. It was literally anyone in the world who was trying to access this package’s problem.
And so unintentionally, Auto Crane had become this canary in the coal mine.
You know, we obviously had other monitoring in place to track like, “Oh, machines are going down.”
And also when you have manually launched machines all the time, and that’s the only thing you have, if someone notices that their machine goes down, then they’re there on-hand to just be like, “I noticed that this is happening. Why does it keep happening? Also, I’m just going to go and investigate.”
But Auto Crane showed all of these failures than any of the monitoring did, and like more than any manually-launched instance would show because it was doing it automatically all the time every five minutes.
If any change happened in between, then Auto Crane would catch it immediately.
And it was significant because all of this alerting, happened on such a massive scale because it was failing with so many instances all at once.
Normally, you expect a little bit of smoke like one or two people’s instances failing, and they would raise something, and then more complaints would start to trickle in.
Auto Crane, there was no smoke. It was just full-out wildfire. That was it.
And that’s what it ended up looking like. And so, two pretty major system outages.
Some relevant takeaways
Hopefully, we learned some things, and we did. And hopefully, you will take these and make sure that this does not happen to you.
And the first lesson that we learned and that I found really important happens a bit on a micro-level. And that’s it’s really important to take time to understand the importance and impact of your changes. But not just for yourself, not just your team, not just for your downstream teams, but for everyone across the company.
Both of these system outages had effects for everyone. Like Git outages don’t just affect my team and the teams that use Crane. It affected everyone because everyone at the company used Git, even the new hires had to deal with this outage. It affected everyone.
So do EC2 instance launch failures. Everyone’s launching EC2 instances. That affects all of them.
And you’ll see that just all of these ripple effects began happening that weren’t just directly related to myself and the downstream teams.
And that feeds into the next point. Investment in infrastructure is crucial for development and innovation across all levels of the company because you cannot build a system that is more reliable than the infrastructure it’s built on top of.
So we can compare this to the idea of nines when it comes to high availability. And so, for example, for anyone who doesn’t know what that is, when you say something and high availability has some number of nines, you are promising that number of nines in this percentage of up-time for a given year.
So if something has one nine, it’s 90% up-time, two nines, 99%, five nines is 99.999%.
And to like actually understand what that means, if you have one nine, 90%, you’re allowed 36.53 days of downtime across the year.
If you have five nines, 99.999% up-time, you’re allowed 5.26 minutes of downtime across the entire year.
If the internet is only four nines, you cannot promise that your website on the internet, that needs the internet in order to exist, has five or six nines.
You cannot say that your website is gonna be more reliable than the internet that it relies on.
So similar to availability, we can’t build and innovate and push our systems and make sure that their scaling can handle all of this expansion if what we’re building on top of hasn’t been similarly developed to do so and handle that kind of usage and scale and development.
So infrastructure—investing in it—is vital to maintain any kind of engineering velocity because otherwise, everything that they start to build will begin to crumble because it can’t handle it.
And that’s, again, related to the last point.
Like I’m communicating all of that because so many of those ripple effects that we experienced could have been avoided if our teams who own that infrastructure that we were building on top of had known what we were going to do.
We had assumed a lot, but they’re more knowledgeable about what their systems can handle than we are.
A bit of a heavy point to end on, and I know I’ve just been talking about all the bad things Auto Crane has done, but, ultimately, Auto Crane really has been a huge success for us and for our downstream team and it has been amazing.
Before Auto Crane, we only had 20 machines up, and that usage was, it wasn’t the greatest, but like people tolerated it.
But now, every day for Auto Crane, we bring up 150 machines to handle all the load that people are throwing at it.
And since we’ve launched Auto Crane in November, we’ve brought up over 32,500 machines to handle it.
It was originally a lightweight solution that was a block to something that was a blocking annoyance to our downstream teams and for us.
It ended up becoming this heavyweight tool that’s extremely successful and has changed how Crane has become used.
Ultimately, a huge success.