Hi, I’m Ben and I’ve spent the last few years working on reliability and observability at Airbnb. Before I get into this more, I’d also just like to note that last week we turned off our Graphite main box that was servicing our StatsD. So just clicked the Terminate button.
Incident Response Team
So I’m going to talk about how our Incident Response Team grew out of our engineering org, and how its unique characteristics influence our use of alerting. We’ve been using Datadog for many years, and it’s become a core component of how we think about our monitoring.
So we’ve grown extremely rapidly as both a business and as an engineering team. As you can see from this chart, the uptick has been dramatic. It was easier than getting it cleared with press. This makes staying on top of things particularly challenging. So we’ve grown, this has shaped the way we do engineering, and it has influenced some of the structures that we’ve built to support it. In the early days, long before my time, you really only had one person handling systems and operations for what was fast becoming a tidy, little business.
This engineer was great, so things were usually fine unless he wanted to go camping or take a break, or really have any sort of life whatsoever. We’re a product-driven company, and so we were hiring more product engineers because that’s what we needed at the time. We weren’t focused on building out a large infrastructure team to pass the pager around. So it just wasn’t where we were. Still, it wasn’t fair for this one guy to always be on call.
So to help share the load, some other engineers volunteered to learn about our systems and how to operate them. They were then able to take over the pager. These engineers came from all different teams. They did not necessarily have any experience with incident response. Some of our most enthusiastic were front-end engineers, so this was pretty interesting for everyone. But there were plenty of opportunities to learn in these chaotic times, and so people quickly ramped up on skills and gained all of this experience and they dubbed themselves the “SysOps”. Hats were made.
So our growth frequently tipped things over. It was not uncommon to see this group congregate around a cluster of standing desks, known as “Standy Land”, and they were trying to figure out why the site was down and how to fix it. It was actually a pretty fun time, all things considered. People were just like yelling around, just like on a trading floor or something, just yelling theories and brandishing graphs, and eventually things worked out. So as new engineers joined the company, they’d see the spectacle and some would want to join. As it turns out, people will go to great lengths for hats.
So another aspect of this is that these volunteers were getting really good at operating our systems and responding to incidents. They were growing as engineers, and also taking that knowledge back to their regular work. This was going very well, so we kept it and built on it. We formalized the lessons into training sessions that covered topics like the different components of our infrastructure, our monitoring tools, and approaches to fix common problems. These trainings are held in our very serious-looking, Dr. Strangelove–themed conference room at the office, which is shown here.
It definitely makes the content seem way more serious, but it became apparent that the content here was broadly relevant. So we opened these trainings up to anyone with an interest, even if they weren’t planning on joining the on-call rotation. These days, we have about 50 people on the rotation, and 30 percent of engineering has attended the training. We also made the switch from hats to hoodies. We also, literally, as a group, went yak-shaving once.
Monitoring and alerting
So all of this does tie into how we do monitoring and alerting, for the simple reason that with 50 people on the rotation, any individual will only spend a few weeks out of the year on-call, and these tours may be rather spread out. This means that we can’t depend on institutional knowledge to know which pages are okay to ignore. We’d also probably have a harder time attracting volunteers if we kept waking people up at 3:00 a.m. All pagers go off at 3:00 a.m. It’s the rule. Don’t question it. This would be especially true if it’s because of disk full alerts on some misconfigured test box.
Additionally, the SysOp will very likely have never seen the particular alert before, and may not have even heard of the system that is alerting. But we still need to keep things under control and get a handle on it. As such, we have a very high bar for alerts of the pager class. We tend to reserve these for system or business metrics that show that we very likely have a problem. These include things like top-line application error rate, and things like reservations not going through or messages not being sent. We strive to keep these sensitive, but still to minimize false positives.
So we have a screenshot of our Slack notifications during an incident. Messages created dropped and new listings dropped are about as serious as they sound. So we’re glad to get that notification pretty quickly. But we also have a lot of non-paging, sort of informational alerts that can either inform us about cleanup tasks, or give us situational awareness when an incident is ongoing, so we can kind of see what else might be leading to that and get to a diagnosis faster.
So since the SysOp may lack context around the alert, we try to include all of that context in the alert message. This often comes in the form of links to other dashboards, to saved Kibana searches, or to runbooks. This will hopefully be enough to determine the impact, and how to either attempt a resolution or determine whom to escalate to. So in this example, we walk through the fairly convoluted steps of getting a backtrace with Ruby debugging symbols from a particular misbehaving process that just likes to stall out sometimes. Then, we list some common remediations. I’m sure you’ll never guess what the most common fix to these problems is.
So on top of all of this, individual engineering teams are also encouraged to set up their own on-call schedules and direct alerts of both the paging and non-paging variety at themselves. We’re constantly building new services and adding new metrics, all while our traffic and scale changes. We have to keep adding to and updating our alerts to keep pace with reality. So this ends up with a ton of people working on alerts, and also depending on them. It also makes common standards, shared knowledge, and review of how we’re doing things even more important.
Monitoring as code: Interferon
So to make alert management easier, one of my colleagues, Igor, wrote a tool called Interferon that allows us to manage our Datadog monitors via a repository of executable alert specifications. Monitoring as code. So we really like the pattern of storing as much of our configuration as possible as code. Almost everything that we might want to sync up with an external provider or even some of our internal configuration is at least stored in Git and is very often executable.
So this allows us to use the tools that we’re familiar with and already use everywhere else when we’re working on whatever. It also means that we don’t have to depend on our providers to implement things like versioning, search, or the right APIs to pull out all of the configuration values that we clicked in through the web interface. So we also like allowing the configuration to be executable, because it allows for more creativity. You can then adapt things inside of the actual alert specifications based on your needs at the time, rather than depending on the framework creator to have anticipated all of them.
We’ve seen in the past, when it’s been like pure data, people just generate ad hoc scripts to create massive amounts of data, and that’s hard to maintain and confusing as well. But allowing for code here also does allow things to get messier. You can’t just take from one JSON format, pass it through a simple transformation, and output in a new, clean format. You have to account for all of the different patterns and all of the arbitrary code that could be included in here.
That said, the tradeoff usually works to our advantage. Really, I can’t overemphasize how important grep and Git, and all of the ecosystem around Git, like our internal GitHub and pull requests, and all of that stuff is for knowing how things evolve and change over time. So Interferon is our DSL for specifying Datadog alerts. It’s open sourced on our GitHub, so you can check that out. But let’s walk through creating an alert with this framework. If you are familiar with Datadog’s monitor management interface, you might see some similarities.
So first, we define the metric. In this case, we’re looking at the average of bytes sent from each machine with the Chef role “thumbor,” and thumbor is an image resizing proxy. So then we set the alert conditions. These are thresholded as a 10-minute average of less than 200 kilobytes per second. If we’re serving less than that, we’re probably not serving images and that’s a problem.
So the Datadog UI actually has really good tools for exploring the metric versus the alert condition. So we were probably cheating in plugging that in to preview. So next, we need to describe what is happening in the alert. We go into detail with information on where to find logs, advice to try turning it off and on again, and a link to a higher level dashboard. Finally, we set who gets notified. This is important. Images are a huge part of Airbnb, so we want this to notify the SysOp’s PagerDuty.
But we also want to separately notify team members of the team that directly own this service. So we hit Save on this file, we open a pull request, deploy our changes, and then we’ll see that our monitor is synced up on Datadog. But things get way more exciting, because we can do more from inside of this alerts repo. We can write alert specifications that use additional information that’s gathered about our infrastructure from various sources that we might not have available in Datadog, or might not be as intuitive to use from inside of Datadog.
So one of our key concepts here is the host information source. We use this to pull in data about every instance, RDS database server, Dynamo table, and even far weirder things. Like we have some set of offline jobs that we know about and monitor, and pull in data as though they were hosts. So in this example, this is some of our DynamoDB monitoring code, and the key thing with Dynamo is that you provision the throughput. That’s the knob that you twist with this. Once you hit 100 percent, you have a bad time. But then, you just click a button and 15 minutes later it’s scaled.
So we set alerts when we go over the read capacity at 80 percent. We create these alerts for every table that we have there, and this is automatically set from—we’ve pulled the list of tables we have from the AWS API, and we go over that. We can use the other information we get from the AWS API, like what region it’s in, to define the alert message. Then, the really important thing here though is that we know what the read capacity is and what the provision capacity is there. So we can pull that value down, multiply it by 0.8, and thus know when we’ve crossed 80 percent of it.
Using tags for targeting
This also demonstrates for the notifications, we’re using tags that we’ve set via the AWS Console on the table to hold the team name and the names of other owners of it. We just plug those in to the notification settings to know who to bug about this. So this approach also allows us to code review alerts via GitHub. So in this example, an engineer had noticed that we didn’t alert on an error spike in an application, and proposed to set up alert conditions. Multiple members of her team reviewed and signed off on this change. So this helps get exposure to know that people are working on alerts, as well as helping make sure that we’re alerting on the right things and using common standards.
So this was overall a resounding success. We have 730 alert specifications that expand out to over 11,000 Datadog monitors. The internal adoption was actually such that we ended up having so many alerts, that syncing them was running into performance problems. So one of our SREs, Jimmy, has been working on fixing this, and has pretty much solved the problem for us. He also wrote a parser for the Datadog query syntax so that we can parse syntax errors in alerts while we’re running continuous integration on them before we push them out.
We’re still working on getting these changes cleaned up, but we hope to have these open sourced and pushed to our GitHub soon. So with all of these monitors, reducing alert noise is very important. It would really be terrible if we had 11,000 monitors triggering constantly. It takes very little noise before trust and the sense of urgency degrade. It’s also way easier to configure email filters than it is to go through, evaluate, and fix alerts. With alerts that page, at least they annoy people into fixing them and making them better. We also start from a smaller and more severe set to begin with.
Practicing alert hygiene
For informational alerts, it’s harder and it’s a constant struggle. Sometimes we’re doing very well with it, and other times we’re not. We found that the only way to really tackle this is to have someone who owns this, who’s constantly iterating on both the alerts and the thresholds, and following up, and making sure they’re relevant and they’re triggering for the right things, and just really watching to make sure that it stays useful. This is important, because no one likes filling out that line in the post-mortem, where you discuss the relevant alert, that trigger that would’ve prevented the whole thing. But it was buried under the 50 other things that were irrelevant and also triggered at the time.
So one of my colleagues, Willy, has really owned this at Airbnb. It’s really advanced the state of the art for us. But there’s another reason that this needs to be owned, and that’s so that our monitoring can evolve with the new capabilities from the vendor. We use Datadog and a lot of whatever-as-a-service, in large part because their offerings continue to get better, even when we’re not looking. But we still need to incorporate these new features to take advantage of them.
So a great example of this is the support for the anomaly detection that you guys just heard about. So our metrics experience pretty severe seasonality layered upon an overall trend of up and to the right. So our metrics are highest on Mondays, when people are planning their trips. But gradually decrease as the week goes on, hitting their lowest points when people are actually on their trips on the weekend. You can see the difference between peak and trough is substantial.
So the anomalies function is particularly useful for the business metrics that we favor for paging. We’ve worked with Datadog to take advantage of this, and it’s quickly become the function of choice for a lot of the things that we page on in a lot of our most critical alerts. It actually took me a while to find an example of an alert that paged that wasn’t using the anomalies function when I was preparing this presentation. But the success case for this is kind of boring. It’s like we alert when we should, and we don’t alert when we shouldn’t alert more of the time.
Failures are more fun, and we actually ran into a great example of where our understanding of anomaly detection didn’t quite match up with reality and resulted in things that definitely should’ve fired not firing. This all occurred on Friday, when we were all being DDoSed by household objects. So before I go any further, the case we hit was well-documented and well-discussed, and you just heard about it a few minutes ago. But our understanding of it we thought was complete, but it was incomplete.
There are some subtle things here, and anomaly alerting requires a shift in mental models and in your mindset, compared with a threshold-based alert or a stable, percentage change–based alert. So anomaly detection is visualized as this gray band representing the expected range of values for a metric. So in this case, in the weeklong view, things look pretty good. The gray band hugs our metric, and we can see the anomaly on Friday during the Mirai DDoS. However, because of data-smoothing, this does not show the whole story.
It’s not just the shape of the line that is changed by the data-smoothing. It’s also the shape of the expected bounds. So we zoom in to a few hours around the incident, and things still look pretty good. The gray line is nice and tight, and we see the anomaly in the metric. It actually even comes out clearer here, so things are pretty good. But just to be diligent, the evaluation window for the alert is set to 30 minutes. So let’s zoom in even more just to check, and the anomaly has escaped our detection.
It’s escaped our detection even as the metric is hitting up against origin. This illustrates that it’s not just the metric line that changes. It’s the bounds that change based on the time window. So what may be anomalous when rolled up, may seem to be within the normal range when expanded. So the other thing to note is that the expected band in the week view has a width that is three times smaller than the width of the expected band when zoomed in.
So in the weeklong view, where the band’s width is about 40 percent of the midpoint value of the metric over that time period, whereas for the 30-minute period the width is 130 percent. That allows us to go to zero without anything being out of place. This was just a choice of the wrong algorithm in this case. The previous expected band was generated with the Agile algorithm, and it’s designed to update quickly to account for intended level changes. It’s explicitly less robust in the face of longer-lasting anomalies, like what we experienced.
We’re also very likely in the worst case for this algorithm, as we were at lower traffic time and the degradation came on somewhat gradually. So we fix this by switching to the Robust algorithm, which is a better fit for what we wanted to accomplish, and the results are shown here. It does not dramatically adjust its values based on the ongoing incident. So this is a comparison over the weeklong view, and you can see the difference in algorithm style. When the metric is on the downswing in the Agile model, we give it plenty of room to continue on that trend, whereas Robust is keeping it a bit tighter and doesn’t expect it to deviate as much.
Learning from false negatives
But the real point of this is not that we misunderstood what the algorithms were meant to do. The real point here is that you need to examine and understand your false negatives. Or put another way, never let a good downtime go to waste. We would not have known that we weren’t protected by this alert if my teammate, Jason, hadn’t investigated what was going on. So perhaps it was a good thing that Twitter and Reddit were also having problems.
But I do wonder if there is like an opportunity for some sort of simulation or some scenario-based thing, so that when you’re setting these alerts you can kind of simulate how the anomaly detection will react to different sorts of circumstances. So when working with anomaly detection alerts, it’s probably a good idea to keep your existing alerts on the metric in place initially, or to even add some simple safety rails threshold-based alerts just to kind of catch strong deviations that you might otherwise miss. Then, during the evaluation period, when one alert fires but the other doesn’t, compare and see what was the correct behavior, and try to understand what was going on there. This really prevents being complacent about the alert that will never fire.
So looking towards the future, there’s still a lot that we can do to improve our alerting situation. There’s the obvious stuff like we can continue reducing noise and making our alerts more relevant, provide better context. But there’s also some bigger shifts that we can make. So we’re very much in the email and Slack world of notifications still. But we could take much more advantage of Datadog’s webhook alerting support, as well as their other integrations.
So I saw Cory’s JIRA tickets, and I want them. We frequently have these cleanup tasks that can be done later. They don’t necessarily need to be done immediately. The DynamoDB example that I showed actually would be a prime candidate. We only need to drop everything and handle it if it’s rapidly on its way to 100 percent. If it hit 80 percent, and it was at 79 percent yesterday, and will be at 81 percent tomorrow, we have some time.
Apparently, Datadog has an integration specifically for JIRA integration, so we’ll be exploring that. We also frequently have alerts around things to give us situational awareness. But when stuff is going wrong, like a lot of these are firing at once, so it can be kind of hard to filter through that. So we almost want a way to kind of group and surface, and aggregate, and kind of combine all of these secondary alerts together to give us a context dashboard and also just avoid getting overwhelmed during an incident.
If we have an alert that says no requests are going through to this service, and possibly with a more specific reason why, it would be terrible if that was completely overwhelmed by alerts that the CPU was idle on 150 hosts. Like we’d get 150 emails and not really be able to go through that. It would also be great to include links in the messages of the alerts that you can click on, whether the alert was helpful or a false positive, and kind of categorize it from there. I also liked Cory’s feedback links there. But I think collecting statistics on when an alert fires incorrectly, it could really help us focus our attention on the noisiest ones.
So in conclusion, much of our alerting and monitoring culture has evolved out of the challenges of keeping the on-call rotation accessible to people from across engineering. Our volunteer SysOp group has helped the entire org share both knowledge and the burden of keeping Airbnb up. It’s fostered links between teams that would otherwise be rather disconnected. Engineers have picked up new skills and gained exposure to detail that would otherwise be hidden. We’ve worked together to achieve greater reliability.
This background has led to our adoption and creation of tools that play to our philosophies. Datadog gives us a metrics and alerting platform that we trust that requires little from us to maintain. We built Interferon on top of this so that we could easily create shared and standardized alerts, and make alerting very easy and auditable. But there’s still plenty that we can continue to do for the future, and we still constantly need to improve what we’re doing and make our monitoring more relevant, and make better use of the data that we generate. Thank you. Are there any questions?
Q&AMan 1: Do you deploy the code automatically, the Interferon code?
Ben: Okay, yeah. So we have an internal tool called Deploy Board that we use for pretty much everything. It has a really fun interface, so that caused us to think of everything as a deploy, because there’s lots of boxes that light up in cool colors. And you really feel like you’re doing something when you deploy. So what we have is we put it on a box. We have a box that runs the alerting code, and when you make a change you push that out. It does the SSH loop, gets the artifact in there, does a run of the Interferon framework, and that syncs up with Datadog.
Then, we also, to catch changes that happen with information, I believe have all of the alert specifications reevaluated every hour or so. So this picks up changes, like when we’ve added new DynamoDB tables or added new hosts. We don’t have to do a deployed to add monitors to them. They’ll just pick up the changes automatically during the scheduled period.Man 2: We’ve got time for just this last question, I think, and then we’ll try to get back on track.
Man 3: You seemed to show that there’s a code review–like process for deploying a new monitor. Who does the code reviews? For example, do I get to say that before you set up and I get paged at 3:00 a.m., I get to do a review of the monitor you’re setting up for that?
Ben: So I mean, it doesn’t usually come out to that. So it varies team to team. We try and kind of democratize that. So in that case, the only people that were involved in that pull request were members of that team. If you start adding the SysOps’ PagerDuty or other teams to get paged, we might have to have a chat. But overall, it hasn’t been an issue. If you want feedback on it, you can tag the observability team or the SRE team and kind of integrate that.
But really, probably in some ways, because we have the high bar for paging, sometimes people even forget that that’s an option. Luckily, we have a good base level of coverage. People really don’t like waking each other up. It’s kind of nice.
Man 4: You said that you guys have 50-plus volunteers as part of the on-call. What motivates them? Like as an engineer, what motivates me to sign up for on-call?
Ben: Did you see our hoodies and our hats? But in all seriousness, it’s a very good way to get exposure to the organization and a specialty. Also, when you respond to an incident, you get to be the big damn hero, and that feels pretty good too sometimes. A lot of people who just have an interest in kind of learning about like back in systems and more of what happens when an incident goes along—we have a very supportive culture and write post-mortems about everything. Sometimes they veer to the side of sagas celebrating the heroism of the SysOps, and I think that’s just fine. It’s just a good opportunity to meet and work with people that you wouldn’t otherwise. Again, you’re on call for a couple weeks a year. It’s actually not too bad at all.Thank you.