Building a Culture of Shared Outages at Segment | Datadog

Building a culture of shared outages at Segment


Published: 7月 12, 2018
00:00:00
00:00:00

Building intuition

So, I’d actually like to start off by sharing a little bit of the genesis for this talk, which came from my friend Evan, who’s on the data team at Stripe.

Now, Evan’s been at Stripe for about six years, and the company is well over 800 people now, but he joined when it was just 15 or 20.

And Evan and I were trading stories and strategies about what makes for a good on-call setup, and he told me some interesting things about how he runs the on-call for their data team.

And the data team, in particular, is responsible for all their core data that lives in Mongo. It’s things like payments, charges, etc.

So, every time they have an outage, they’re potentially losing millions of dollars in transactions.

And Evan walked me through their process, where he said, you know, “We do all the regular things: we have fire drills, gamedays, etc.

But there’s really one big hurdle for new on-calls, which separates the folks who’ve been there for four or five years from the people who are new."

And that’s this concept they called building intuition—being able to take a bunch of different alerts and have a sense for what might be going wrong within your infrastructure.

So, over today’s talk, I’d like to explore that topic and ask ourselves, how can we build intuition? And particularly, how can we build intuition in an environment of rapid growth?

Where, maybe for a lot of you out there, the number of engineers on your team is doubling or even tripling year over year.

About Segment

So, for this talk, I’d like to first cover a little bit of the background, where I’m coming from, what our scale is at, what I’m seeing, then talk about the two major tools that we’ve used and had a lot of success with at Segment: developing entrypoints and exploratory tools.

Finally, I’ll wrap up with running through an outage in production and showing how all of these techniques come together.

And then, finally, ways that you, each of you out there, can reprogram your organization and take some of these lessons and implement them.

So, first off, a quick sense of scale. To give some background for those of you who might not be familiar with Segment, we are a single API to collect data about your customers.

This might be things like page loads; customers adding items to their cart; customers, in our case, signing up, logging in, inviting teammates; if you’re running a music app, maybe it’s something like playing songs or adding items to a playlist.

Segment takes all of that data about your users that you send once and we send it to over 200 different tools: places like Salesforce, Marketo, Google Analytics, a data warehouse.

And how that manifests is it looks kinda like this: we take data from various sources and move it over to various destinations.

By the numbers, that means that we’re processing roughly 300 billion events every single month, with peak incoming request volume of about 350,000 events per second.

On the outbound side, that translates to well over 400,000 outbound HTTP requests every single second, sending data to hundreds of different services and thousands of different customer-supplied webhook endpoints.

To run all that infrastructure, we’re running on 16,000 containers, primarily run within AWS using their ECS service. And we’re running for over 4,000 customers, with tens of thousands of free users who are sending data.

Scaling infrastructure + on-call

And while that sounds like a lot of infrastructure, where it gets even trickier is how we’re growing as an engineering team. Today, we’re about 70 engineers. This graph’s a little out of date, but by the end of the quarter, we’re trying to get up to 80, and then, we’re likely to double over the next 12 months.

And what we’ve been struggling with is that when infrastructure and personnel both change at the same exact time, building that intuition becomes really hard, you have to invest a lot in doing it.

So, what should we do about it?

Well, there’s actually a lot of good material out there and kind of a standard answer to this question.

Generally, people will tell you, “Well, you have to really start with observability. You have to understand what’s going on in production.

From there, you can get to a root cause.

Then, once you understand the root cause, you can get to a fix."

And fortunately, there’s a lot out there on observability.

There’s a lot of good blog posts—people have done a lot of thinking about what you should measure, what’s important, how to reason about it.

But I think there’s still a missing piece from the discussion, and that’s that underlying observability, you have to even understand what your systems are doing.

If you don’t have that, then you might have all the observability and instrumentation in the world, but you aren’t really going to understand what it all means, how to reason about it, and how to put the systems together.

So, for the focus of this talk, I’d like to share some techniques that we’ve learned at Segment for building that understanding.

Tool 1: Entrypoints

So, the first weapon that we found really effective in our quest to build understanding is entrypoints.

In particular, entrypoints which tell our engineers, one, just how does the system even work when you have hundreds of different microservices? And two, what should I be looking at? And three, how do I solve the problem when I’m getting paged at 2:00 a.m.?

Onboarding

Now, I’m going to start with the basics, which is probably something that all of you do for engineering onboarding.

For onboarding, we didn’t have a set program until about six months ago.

And, at this point, we actually did a really major rev on it to answer some of these questions.

We started with general onboarding, where you kind of learn the toolset, you understand what’s going on, but then, a new engineer actually meets with every single team every two weeks.

These are automatically assigned, they’re rolling sessions.

Once it’s been created, the process just sort of runs itself.

And, for us, what that looks like is a set of paper documents, which people go through, and they understand, “Hey, here’s what I should be doing in terms of onboarding; let me stand up my own service.” But they also meet with members of each team so they can understand how their team fits into the whole.

Additionally, each engineer goes through a set of detailed architecture diagrams so they can understand, on a quarter-by-quarter basis, where we’ve been and where we’re going.

Of course, that base level of knowledge doesn’t give you greater insight, right? There’s only so much that you can learn within a two-week period.

Directory

And so, the next question engineers usually have is, “Well, how does X system work?”

Like, how does the outbound part of Segment work versus the API, versus the deduplication system in the middle?

And for that, we built a second piece of tooling that we call our Directory.

Now, this is a system that isn’t open source yet but we’re hoping to open source soon, which you can think of as the entrypoint to various systems within Segment.

You can click over to a tab for all the systems and search by each one, in particular, depending on what it is that you want to find out.

And I should note here that this system—or Directory itself, the tool—is grouped by systems because, oftentimes, the system will comprise 3, 5, 10, 30 different services.

And so, it’s important to give your engineers a good logical grouping for what a system is.

In our case, here, we have this new visibility feature, which we just rolled out, which we created a Directory entry for, and there’s a couple of main entries in the Directory that I wanted to call out.

The first is that we link to high-level documentation on what the system is and what it’s supposed to be doing.

We’ve included both API-level documentation in case you want to access it, but also internals in case you’re operating it.

We also link to the code and configuration for that system.

So, if you’re trying to understand, “Hey, I understand at a high level what the system does, now let me dig in and really read the code that’s going on recently.”

And then, finally, we have links to runbooks and monitoring. These tell you, “There’s something going wrong—what should I be looking at?”

Additionally, we pull in data from a bunch of different third-party tools that we use.

One of them is ECS, which we’re using to actually run all of our services.

The second is RDS.

But we also embed stuff like Datadog graphs, which give you a high-level view that this system is working properly.

Finally, we also embed the system diagram, so that if you’re an engineer who’s on-call, you can actually reason about that component and understand what’s going on.

So, Directory has helped solve this problem of getting to the next level of depth and fidelity within our organization.

It’s a curated entrypoint split by individual systems that embeds a bunch of the popular tools that we’re using, that engineers are already producing documentation in, but this pulls them into one place so they’re actually useful.

Characteristics of good alerts

The next question, once you have a sense of the system, is, “Hey, there’s an outage, what things should I be looking at?”

And there’s actually a very simple answer to this that for a long time we didn’t leverage effectively, and that’s thinking of alerts as entrypoints, too.

Now, I was thinking about showing an example of a bad alert here, but I’m sure all of you have been paged at some point by a bad alert, which just woke you up in the middle of the night, didn’t tell you anything about the system that was going on, and you kind of had to stumble through and understand who created it, and why in the first place.

But we found that actually good alerts can be significantly better for very, very little effort.

In particular, we found that a two-sentence description at the top of the alert, just telling you what’s going on, helps lead you towards the problem.

We link directly to the runbook so you can see exactly what’s going on, and we also point you towards dashboards and a one-liner to pull logs for that service.

This way, if you’re an engineer who’s on-call, and maybe you’re not familiar with the system, in a case where minutes matter, you’re able to pull the information you need quickly.

Additionally, in many of our alerts, we’ve categorized previous root causes and included places where we think the system might be going wrong, but where you might want to make adjustments to fix based upon prior outages that we’ve had.

For some of these, we’re automating our way through them, but for ones which don’t have automation yet, the alert update is a really simple tool that gives you a similar level of quick response.

So, good alerts tell you where the problem might be, point you at other places, give lots of links, they list potential mitigation options, and most importantly, they tell you what to look at.

I always say that good alerts are the ones that you want to read at 2:00 a.m. If you have to be woken up, the alert might as well be good.

Finally, there’s the question which I always have for the first time that I get paged: “What changed?”

And for that, we keep a couple of systems around for basically changelogs.

At a top level, if you’re using ECS, you can use this dashboard that we built called Specs.

It gives you a first high-level window into the exact services that are running in production, what versions of containers, what tasks, how many of them.

Dashboards as entrypoints

The second that’s a little more general-purpose is this idea of a superdash, or a single dashboard as an entrypoint for your organization.

We found that a lot of the time, it’s not the configuration which changes, but instead, some difference in your load pattern.

In order to get that difference at a high-level view, there’s probably three to five top-level metrics that will influence your thinking on how to deal with an outage.

I’ve heard at Facebook, it’s just like pure number of 200s coming back from their edge. In our case, it’s end-to-end latency across the pipeline. That’s the only high-level metric that we really care about.

Additionally, we use tools like Terraform internally to manage all of our change control. We also use Terraform Enterprise, which will give you a nice dashboard into which systems have been changed and what’s been applied recently.

And we also set up a specific system of change control, where every single change that goes into production actually has an accompanying JIRA ticket.

In our case, what we found is that JIRA tickets help us to think through changes before they actually happen.

For each one, you’re required to say what the change is, what you expect to change in terms of metrics or behavior, and then, what your rollback procedure is.

And actually, since we’ve implemented this a month and a half ago, we’ve seen a greatly reduced number of outages.

The entrypoint stack

So, obviously, there’s a lot of tooling for this, but the most important thing in growing your on-call team is that you have a good entrypoint into what’s changed, not only from code and configuration but also in terms of data and load.

So, I would say this is our entrypoint stack.

They guide you toward solutions, they give you a little bit of a window, but point you toward many other places to dig in deeper.

We only have a handful of them at the top level so that every engineer can be well-versed and only have to keep track of five to 10. And then, we try to layer them effectively.

So, we have onboarding as this base. If people want to know more info, they move to Directory. If they are getting paged, they move to alerts, and then, they can move to changelogs, logging tools, exploratory tools, which I’ll cover in a minute.

And, at every step of the way, we put a premium on usability and developer experience.

Tools are only useful once they’re really usable and the team is working on them.

So, obviously, entrypoints alone won’t get us to this place of understanding, right?

That was the whole pitch at the beginning of the talk. You need something deeper.

Tool 2: Exploratory tooling

And once you have your base, the next part where we focus as an engineering organization is on exploratory tools.

You have the base, you understand how things should work, what about digging into that next level to understand what’s actually going on?

And, I’d say, the two places where we’ve nailed exploration are, first, on combining multiple perspectives, which I’ll talk a little bit more about what that means in a second, but also making sure that our tools are both: A, composable, and B, shareable.

The importance of multiple perspectives

So, multiple perspectives.

If you only have one view of your data, it’s really hard to find something outside of that view.

And I’m kind of reminded of the children’s story, where I think there are maybe 8 or 10 blind mice who are all looking at this elephant.

And one of them says, “Oh, it’s a palm tree, I see this big trunk,” which is the elephant’s foot.

Another says, “Oh, no, it’s a rope,” and they’re only looking at the elephant’s tail.

Basically, you wanna have a bunch of different ways of slicing and dicing the same data so you can put it all together and you can really get a holistic picture of what’s going on.

And that’s the whole idea behind multiple perspectives: multiple ways that you’re measuring data, and multiple ways that you’re visualizing that data.

The customer view

And we see this take basically four levels of the stack.

At the highest level, we have the customer view, then we have client and internal views, and then we have kind of a system-level view.

So, starting with the customer view, I’d say the biggest learning that we’ve come away with from this past year is that, no matter what, you wanna be measuring how your customers measure.

If that’s their exact page load and JavaScript errors that they’re running into, that’s one thing. Maybe it’s amount of successful requests, if you’re running an ad website, that they’re able to bid upon.

In our case, its end-to-end alerts for latency.

We want to be able to send data in and verify, using the exact same APIs that our customers are, that all that data made it back out.

So, we’ve actually invested a lot in this.

We created our own QA tool which treats Segment effectively as a black box.

It sends data in with request IDs which are correlated and it logs those to an attached RDS instance that it’s running.

And from there, we wait to see that webhook or that data webhooked back out so that we can actually ensure that for this data, we received 100% of it, we received no duplicates, and that it made it there quickly and on time. We’ve set alerts on all of that.

And what’s great about this is it actually tells us, ahead of our customers, when something is going wrong so that we can alert them proactively, not the other way around.

Additionally, for each one of these places where we’re measuring in the customer view, we put it up publicly on our status page.

This is hosted by Statuspage.io, and the metrics are powered by Datadog, but this gives us the level of transparency that our customers are demanding, and it also holds us up to a higher level of service.

The client and internal views

Second, we have the client and internal view.

And this is basically saying, “Okay, instead of the system just working overall properly, what’s the next level?”

And, for these, we use a couple of different techniques, but the biggest one is by publishing metrics via StatsD.

In our case, we use a combination of the Datadog Agent and this tool that I have up here called Veneur.

Veneur is a drop-in replacement for the Datadog Agent from a StatsD perspective, and it’s incredibly fast.

Which is important if you’re publishing data from thousands or hundreds of containers on an instance like we are, where you might have a lot of contention on that single process.

The way this works, kind of from an under-the-hood perspective, is that each of our programs or each of our containers are outfitted with a stats library, where they’re publishing UDP packets into each of these StatsD sinks.

And then, from there, the StatsD sink is sending that data up to the Datadog API at the same time that our Datadog Agent is actually posting those checks to the API as well.

What this does is it gives us the ability to correlate those two metrics and effectively have programs self-report on everything that’s going on.

And because we use StatsD as our common interface, it means that everyone is recording metrics in roughly the same way.

The system view

Finally, there’s the system view.

And these are all the metrics that are automatically pulled via CloudWatch as well as metrics pulled by the Datadog Agent itself.

This also comes in the case of pprof view, which we’ve built our own server for.

For those of you who are running Go programs, this will give you an automatic window into what a program is doing based upon its pprof endpoint.

It’s one of the most powerful tools that we’ve adopted, and I’d highly recommend for anyone running Go that you’re able to set this up, because it tells you places—or it gives you information—where stats won’t.

Let’s say that your program is hung, waiting on a lock somewhere, or perhaps it’s unresponsive because it’s using too much CPU. This view will give you that window in places where your stats or logs might be missing.

It can tell you exact Goroutine stack traces, it can tell you memory usage, it can tell you lock contention. It’s incredibly powerful.

And by combining those perspectives, it gives us a window into the full system.

Composable views

What about making these tools composable and shareable?

It’s kind of the next step, where once you understand what’s going on, you’ve dug a little deeper, you’ve cross-correlated against multiple views, now you want to actually be able to tell your teammates about it.

And for that, we have this principle that each view of your data should be easy to export and put into another view.

That takes a couple of different forms.

For logging, we found that form to be just via the CLI.

If your log provider doesn’t have a CLI tool, you’re effectively limiting the lingua franca of what your developers use every single day.

We basically found no better interface for developers than just grabbing logs quickly and being able to pipe them through grep, cut, jq—whatever it is that you wanna be using—but we’ve open sourced a library to do this with CloudWatch Logs.

Oops.

Datadog Notebooks

The second place that we like for exploring is Datadog Notebooks.

This is probably the most underleveraged and underused Datadog feature I know of, but what it does is allows you to combine metrics in a really lightweight way when you’re trying to explore and say, “Here’s everything that I’m seeing in this particular notebook,” kind of like an IPython Notebook. “Now, let me use those metrics and make some guesses by seeing the multiple views on them.”

And what this does with our views combined, actually, and pulling all this data in one place, is give us a dashboard, which, previously, we’ve never had insight into.

We’ve never been able to see what’s going on in our programs, versus what’s going on from an EC2 perspective, versus what’s going on from JMX perspectives for things like Kafka.

But now, we can.

We can see that Kafka itself is producing individual metrics at kind of a program level, we’ve got service-level metrics which are coming from ECS, EC2, all these spare services, and we’ve got host-level metrics which are coming from the Datadog Agent.

And all of these are combined on a single dashboard to give us that full 360-degree picture.

Shareable views + tools

The second piece which I’ll only touch on a little bit is just whether these views are shareable, which we also put a very high premium on.

We wanna make sure in each case that it’s really easy to answer the question, “Are you seeing what I’m seeing?”

And so, for that, we put these in Slack, obviously, but we also have shared alert channels.

So, all alerts for a given team always go to the same channel so everyone on that team knows what’s up. And additionally, PagerDuty alerts go to that as well.

So, if you want to see who’s getting paged, when, why, everyone has the same access to that.

I already talked about the shared dashboards that we keep, but we also do shared tooling, which I think is something that’s underleveraged across the majority of organizations that I’ve seen.

In our case, we’ve set up our own Homebrew installation for automatically installing tools via macOS, and we’ve set up similar tooling via various packaging libraries for Linux so that developers can just run brew install, and they’ll immediately get the same tools that everyone else is running.

So, these are the three pillars of our tooling here.

First, we have the entrypoints to build that understanding. Second, we have the exploratory views to dive in a level deeper. And third, we need to share that tooling so that everyone on the team understands it.

A real-life example: Incident 8

So, to see how all of this works in an actual real-life scenario, I’d like to discuss “incident 8” that happened maybe a month and a half ago.

And as a reminder, Segment takes all this data from various sources and sends it to various destinations.

And most of our customers, as long as we do that within an hour or so, they’re totally fine with it.

But there’s a certain amount of customers who actually rely on Segment as an internal pipeline or internal service.

And for them, they’re monitoring us constantly, where they’re sending in data, making sure it comes out the other side.

And if it goes more than 20 minutes before they receive their data, they’ll actually block their entire build system.

And if it gets bad enough, they’ll escalate, and it’ll block every engineering team’s build system.

So, an outage of this magnitude is really bad.

The symptom: End-to-end delivery time spiking

In particular, like I said, this is the metric that we pay attention to most: the time to deliver data.

Most of the time, it’s a couple of seconds, but, occasionally, it can go wrong.

And so, at about 9:20 a.m., we actually get this alert.

Remember, this is measuring from the customer’s point of view, where it says, “Hey, this end-to-end data looks bad.

It seems like it’s gone above three minutes, which is our paging threshold—seems like things are going wrong and not on the road to recovery."

Issue 1: DNS cache memory leak

So, we start digging into the problems, and actually, the first problem we’re able to spot pretty easily: it happens with DNS.

We’ve recently migrated over to a new DNS cache service called CoreDNS, which runs on every instance.

And, as it turned out, CoreDNS had a memory leak that would only show up after a little bit of time.

And so, all of our instances were running out of memory at about the same time. It would crash the process—suddenly, DNS timeouts were happening to every single service in our infrastructure.

So, that problem was really easy to identify. We kind of got the alert for it.

So, we made a change which we posted in a Slack channel and said, “Hey, just switch out the base image that we’re using to use this newer good version of CoreDNS.”

And Achille, right in that same alert channel, says, “Hey, I’ve got this—I’m booting 10 new instances with the new AMI. Everything should go back to normal.”

Just so we can categorize follow-ups, I end up posting in this #announce-sev channel, which actually everyone in the company is hanging out in, or at least all the customer-facing folks plus engineers, and let them know, “Hey, there’s something going on with our infrastructure, you should all probably understand it.

And if it affects you, you can join the Slack channel where we’ll be keeping ongoing discussion."

Issue 2: Rate limiting on AWS APIs

With that, we set this description using this tool called Blameless, which I’ll cover in a second, and we think the problem has gone away for the most part, until we hit problem number two, which is actually rate limits which come back from the AWS APIs.

So, not long after that, because all these services are timing out, we start seeing really anomalous traffic.

And this is the volume of data flowing through on a per-second basis.

Obviously, this graph is not a good graph.

And what we find is that we’re actually having problems making changes to our infrastructure because we’re seeing rate limits within the AWS UI, and those are also having problems booting up new containers within our infrastructure.

And so, because of a combination of these factors, we’re getting even more and more delayed.

So, we try to scale up, and we loop in some more of the on-call folks, folks from the platform team, folks from the destinations team, and more and more people are starting to join this outage and understanding what’s going on.

Issue 3: Deadlock in transaction database

And as we’re doing that, we actually hit this third problem that happens because we’re running through these areas of having a very low amount of containers which are able to run, and then very high amounts, which suddenly burst what we call our transaction database.

Now, to understand how this works, Segment has this single transaction database which is keeping track of messages which are read off Kafka, and those messages are sent into this other system to deliver data.

And we’re basically using this database to keep track of Kafka ranges, where we should say, “Hey, here’s 100 messages, send them along. We checkpoint them, they’re good, here’s 100 more.“j

And what we were finding is that actually some of these transactions were hanging around for 3, 5, 10 minutes, which meant that this range of messages was just not being processed.

And what we found is that, okay, it seems like there’s actually a deadlock, which is happening due to too many pending transactions, which we found through Datadog.

We posted to Slack, and we actually tagged here as an important point, once we knew about it, that gets auto-recorded by Blameless.

And we thought at first, “Hey, this is definitely due to client timeouts.”

It turns out that the database itself has a different timeout than the client, and the database was setting this timeout for eight hours while the client was disconnecting, and it would result in this deadlock where the database wouldn’t properly give up the connection.

So, we make the change, Jeremy jumps in, has a change that everyone can see, everyone can understand. It gets logged, but it seems like that’s not quite everything.

The smoking gun

And so, Rick over here asks, “Hey, do we have this database access and status?”

And we paste in the thread and that’s where we actually find the smoking gun.

We see, hey, what’s actually happening here is that we’re waiting for this table metadata lock. And it turns out that if you have this create table if not exists—which we have in a number of places our infrastructure—if a bunch of containers all come up at once and try and run that command, it will actually deadlock the database because, for each one, it has to lock around this metadata.

And if you’re creating multiple of those in a single transaction, it’s possible that you’ll just remain locked for a while.

And so, Rick finally identifies this as the root cause. Achille says, “yep, we can fix that,” and we manage to start recovering. But obviously, there’s a big amount of lag here.

And so, if we actually look at the overall damage from this incident—which we’ve also gotten from our Datadog monitor here, where we can actually check our SLA—from the time period of about 9:15 until 1:15 p.m., some amount of data was delayed for up to an hour and a half, which is way above our SLA.

And after the dust had settled and we’re digging into why, basically, one bad DNS deploy caused crashing containers due to timeouts, caused rate limits against AWS services that, when they came back up, suddenly caused this transaction timeout, which caused multi-hour delays.

Problems are no longer simple

And that’s really the hard part of running a distributed system like this, right? Problems are no longer simple.

You might have one problem that leads to another problem that causes this psychopathic behavior among different parts of your system that you didn’t even anticipate.

And what’s more is that alerts for this, where every alert was firing at once, can basically be a red herring.

And at a time when minutes matter for this data, it’s really hard to know what’s going on.

And how do we solve this complexity problem?

Well, in our experience, it’s building that understanding.

It’s something we’re still getting better at and we’re still trying to improve as an engineering organization, but, going forward, we’re trying to experiment with ways that we can further reinforce that behavior.

Lessons and takeaways

And so, I’d like to end with a few lessons that we’ve learned and, hopefully, takeaways that you can use and borrow throughout your own organization to improve your uptime.

And I’d say the two biggest ones, or biggest improvements that we’ve had over the past six months, our first, just around how we run retros, and the second, how we run incident review.

Running retrospectives

So, for retrospectives, the number one goal of any retrospective should be, do we all understand the root cause?

And this sounds fairly simple and benign, but most of the time, I walk out of a room, and I’d say 7 out of 10 people understand the root cause, but a handful still don’t.

And only when you get to that point should you focus on, “What should we do about it?”

So, the first step is, we always just build the model. Someone will whiteboard and explain, “Here’s the different systems, here’s what they do, here’s how they interact.”

And then, second, we’ll actually establish the timeline for the outage.

We’ll say, “Hey, here’s exactly what happened, here’s the commands that were run based upon our knowledge of the situation.”

In our case, we started using this tool called Blameless to actually autocomplete that timeline for us. The tool listens on whatever Slack outage channel we’ve created, and, in many cases, it will actually pull out the most important pieces of that discussion.

So that if later, you’re trying to figure out what signal amongst all this noise and false loops which people are going down, you can use it to get kind of a clean view of the situation.

It’s still early, but it seems really promising.

Then, finally, we will establish the root cause.

And for this, typically, we’ll create a Datadog Notebook which has, again, for the outage period, the exact metrics we were looking at and what caused us to believe that was the root cause.

And then, we’ll document it.

And the documenting is really important because five, six months later, you want to make sure that you actually knew what the root cause was.

Incident reviews + prioritizing cleanups

The second big improvement that we’ve had in our outage process in terms of upgrading reliability is our incident review meeting.

And this is a weekly meeting which happens, where we review the biggest outages and the biggest issues that we’ve run into.

And there are a couple of important pieces to it.

The first is that, for every single outage, we take stock of what the action items are and we make sure that we actually do those action items.

Typically, we’ll create JIRA tickets, which each have a due date that we’re going to do them by, as well as what we call a DRI or a directly responsible individual. And this person is basically on the hook for making sure that these changes happen.

And when it comes to priority, for a long time, we’d punt these follow-ups and we wouldn’t do them, and we’d say we would and we wouldn’t actually.

For all of you, as leaders of your engineering organizations, I think it comes down to setting a clear priority.

And the only priority that we found to work is: basically putting this follow-up and cleanup above all else is what actually drives your engineering organization.

So, you’ll put new features aside, you’ll put other work aside, you’ll say, “No, instead, the one thing that we’re doing this week is focusing on the cleanups.”

And only once you’ve gotten there, then start doing maintenance and fixing bugs—and then, from there, new features.

Conclusion

So, overall, our lessons learned dealing from hundreds of outages over the past six years.

We try to build understanding above all else. It really just solves a bunch of the problems related to new engineers trying to understand what’s going on quickly.

Our biggest wins come from focusing on the entrypoints. Provide user-friendly places where your engineers can go to understand the system, how it interacts, the problems.

Provide multiple perspectives for exploration.

If you only have one view of the data, you’re probably missing something.

Invest in tooling and training on that tooling from day one for new engineers. And good UX actually matters a ton.

If a tool isn’t usable, your engineers aren’t going to add to it, they’re not going to develop it, they’re not going to edit wiki pages, whatever it is.

Any knowledge you get is going to get lost because it’s not easy.

And then, finally, the checklist and process should really reinforce what matters.

If reliability is the most important thing to your organization, that’s the place you should focus.

So, to leave you all, I’m hopeful that these lessons can be taken advantage of at all of your organizations as well, and that together, we can actually build a more reliable internet.

Thank you.