The journey to chaos engineering begins with a single step
Published: March 14, 2017
This little guy, Chaos Monkey
Bruce: How many people have ever heard of “chaos engineering” before?
All right. Awesome. So, no talk is complete about chaos engineering without talking about this little guy, Chaos Monkey. Chaos Monkey was built in 2009 by a good friend of mine. His name is Greg Orzell. I actually went to high school with Greg. That’s how I ended up at Netflix. And so, if you don’t know what Chaos Monkey does, Chaos Monkey looks for groups of servers and terminates one randomly every day.
And so, this is, kind of, the start. And it was the start of Chaos Monkey, and the start of that kind of culture, and notion, and strategy of purposefully injecting failure into a system.
And so, fast forward a bit. I like to say, this is Netflix’s journey of chaos engineering. And I usually title this talk, “The journey of chaos engineering begins with a single step.”
Okay. And in this, for Netflix, it started in 2009 with Chaos Monkey. I joined in 2010 and that was, you know—I was quickly acquainted to the likes of Chaos Monkey. I was on call, I was carrying my pager, and I would just see my servers randomly go away. So, I got used to that very quickly. I got acquainted with the notion that things in the cloud can fail, that software fails, that everything fails.
A picture of a beach
And so, fast forward a bit. In 2012, Netflix built Chaos Gorilla. Chaos Gorilla took out entire availability zones. Why? Because availability zones can fail. And then later on, I actually built a team called Chaos Engineering. I built a team that set out to double down on that strategy of failure injection. And we’ve delivered Chaos Kong to take out entire regions. Why? Because regions can fail.
Anyone on call during the recent S3? Fun? Yeah.
And so, the point that I like to make here is—and I put the years here— is that chaos is a journey of a thousand miles. Chaos takes time. Resilience takes time.
And so, I did a few tech blogs about this at Netflix and it got followed up by ReadWrite about how chaos engineering should be mandatory everywhere. And, you know, that actually got me to really reflect and stop. While I loved working at Netflix, I loved working on an entertainment company, and I love the team that I just built, it kinda made me stop to realize that perhaps chaos engineering—there are more applications that were more important than streaming TV shows and movies.
That’s what brought me to Twilio. So, Twilio is a cloud communications platform that enables developers to build rich communications into their apps. What does that mean? Some of our customers, whether it’s an Uber driver picking you up and calling you, whether it’s an Airbnb confirming your reservation, or whether it’s PagerDuty waking you up at 3:00 in the morning.
And so, I joined Twilio, and I have this MO, it’s chaos engineering. And I’m there and I’m looking at learning about the products, I’m learning about the customers, I’m learning about the use cases, and I’m strategizing on how do I get chaos engineering started here, what could we use chaos engineering for?
The original title slide I put had is actually a picture of a beach. And the quote is, “A journey of a thousand miles begins with a single step.” Now, when I talk about chaos engineering, I talk about Chaos Kong killing regions. It’s like, the picture of a beach in paradise is probably not the picture you have in your head. You probably have a picture in your head that probably looks a lot more like this: you’re on this journey, there’s a rickety bridge, you could fall to your doom, you’re going through a mountain range with a lightning storm.
Right? And she’s probably looking and thinking, “Really?” Like, “You really wanna do this? You really wanna kill the servers and stuff?” And so, I will say that chaos engineering is definitely a journey of a thousand miles and you don’t have to get started with killing U.S. east.
We have a saying at Twilio that I really, really like. And that saying, it’s one of our leadership principles, and that is, “Progress over perfection.” It is prioritizing making progress on a problem over coming up with the perfect solution.
And I really stop to think about this: What does “Progress over perfection” look like for Chaos Engineering? And so, this talk, I designed this talk for the tired, for the pager fatigued, for anyone who’s ever gone to a Netflix talk or read a Netflix blog post and gone, “That’s cool, but I could never ever do this.”
Master of Disaster
And so, progress over perfection actually is present in many, many professions, not just our profession. Whether you’re a pro athlete, whether you’re a medical professional or a professional musician, each of these has a long journey. You could say that they have their own journey of a thousand miles.
And they also have a bunch of firsts, whether it’s that first competition, whether it’s that first patient or that first performance. You see, it takes years to develop all the knowledge and skills, and it really takes practice. Knowledge is not a replacement for skills and practice.
And so, to help me tell you the story about the journey that we’ve been on at Twilio, I’d like to invite up James Burns. He’s a tech lead on my team at Twilio.
James: So, when we originally started this—Our journeys into chaos engineering as the team started when Bruce came to me and asked if I want to be a Master of Disaster for Chaos Game Day. And I had the same question that maybe you have which is, “What does that possibly mean?” So explain to me: a Chaos Game Day is a controlled exercise where the team knows that there’s going to be an incident, they’re set up, they’re following their standard instant procedures, and then the Master of Disaster causes the failure, usually in the stage environment. If you like to go big you could do it, but probably not starting off.
And so, when he told me this, I thought about all the different kinds of failures that I’ve seen, I’ve thought about the craziest things that, and I thought, “How would I take that to next level?” I was thinking, “I’ll be like this guy and I’ll be sitting across from the team.” I will be like, “Network partition, drop prime number packets,” all kinds of crazy ideas.
And so, I talked to Bruce and asked him, “You know what, what should we do?” He’s like, “Let’s start simple.” So that’s what we did.
sudo halt. Sit down across from the team to close that incident. I shut down the box. It should be easy, right?
The team, sitting there, and they’re looking at their dashboards, and their graphs, and they’re like, “What did he do? What did he do?” And they’re not seeing that I shut down a box. They knew I’d done something, but they couldn’t see it because of the way the things are built.
And so, what did I do? What do you think? I shut down another one and another one, and another one until there are no boxes left. They’re like, “Oh, boxes are shutting down.”
And so, after that, we practice the next skill that you need to have, which is not just incident response, but postmortems and then using postmortems to drive improvements. So, we sat down and we do a postmortem of an incident in stage. As part of that postmortem we went through the steps of defining the timeline: “Here’s when James started shutting down boxes, and all the different steps of that.” Looked at how we can generate betterments out of that. And one of the major betterments out of that is we need to develop instrumentation that reflected the heath of the system, not just certain subparts of the system that you wouldn’t see the impact until you’re at 100 percent down.So, got that done a few weeks later, all the betterments were encoded, were deployed, and we decided we were going to do another Chaos Game Day.
Eventually, full failure
So, same thing, I was sitting there trying to figure out all the crazy things I could do. And I decided, “You know what? Let’s just start with
sudo halt.” And because of the changes that they made, because of the changes that they had and how they looked at the system, they immediately saw the issue, they’re immediately able to respond, replace the capacity I removed, and we declared the end of that incident.
So, went onto the next thing, third party APIs. So, the system that we’re building that was still pre-production had dependencies on a few different third party APIs. I was like, “I wonder what will happen when those go down?” Because third party APIs go down, that’s what they do. And so, I took an off-the-shelf traffic shaping solution and slow down a box, and started dropping 30 percent packets to one of those critical APIs. And I have the same stats up, all I knew, fancy dashboards, and graphs, and everything.
I was looking across from the team and normally, I try and have this poker face so they can’t see what’s going on. I was so shocked that I’m like, “That’s not what I expected to see.” Because I couldn’t see it, I knew when I was creating a failure, I knew we build all these dashboards so that we could see failure and it wasn’t there. And so, turned it up from 30 percent, 50 percent, 50 percent to 70 percent, 70 percent to 98 percent. Eventually, full failure.
I could see it visible. We have a full failure of that API, we can see that’s impacting the system and the way that we expected to see. And then, of course, we declare End of Incident, I roll back the changes, and we then go through the postmortem process. And out of that postmortem process we said, “Hey, we should do more instrumentation. We should build graphs that show the performance of these third party APIs.”
Do you think those numbers are going to help you?
So, let me talk to you really quick about numbers. Lots of people when they’re doing presentations like—they thought of big numbers. You have these numbers and it’s 2:00 a.m. in the morning, do you think those numbers are going to help you? So, once you’ve gone past that, once you’re like, “Okay. I need numbers, but I need them in the context,” you end up with something like this, a line. Then you’re looking at that line and you’re like, “Hey, that line is pretty good. It showed me this incident, but it didn’t show me this other one.” So, then you have a bunch of lines, maybe you try heatmap, maybe you try some stacked bar graphs, whatever it does.
And part of the beauty of Datadog is it gives you all these different tools that allow you to build this highly customized visualizations for the different kinds of metrics. But part of that power is that you need to use it appropriately, you need to be able to find the appropriate way to visualize the problem that you have, so that when you’re sitting there at 2:00 a.m. in the morning and you’re looking at all these different graphs, you actually can come to a conclusion about what’s going on. So, what you need to ask yourself is this question: “What visualization will I need to see this kind of outage?” And one of the easiest ways to do that is chaos the outage and see whether you would see it or not. That was one of the main lessons and skills that we developed as part of this chaos engineering exercise.
Next one is, “How do I validate my visualization?” So the same goes for alerting as well. How do I take this thing that I built? I built instrumentation in my code, I built instrumentation around third-party APIs, whatever it is, how do I validate it? And last, can I tell a story about my system using these dashboards? Can I say, “I’ve got a failure,” and I can go through and ask the particular questions to the dashboards that will give me the story of what’s causing that failure, get me to root cause, and get me to mitigation to stop my customers’ pain?
So, these are some of things that we learned during our Chaos Game Days. One of the first is you need instrumentation. If you don’t have the data, you can’t graph it and you can’t understand it. You need to make sure that you’re instrumenting all the different places. One of the interesting developments that we saw was that our engineers started thinking about failure first, then they thought, “You know what? I’m going to need to metric here because this is probably going to fail and I’m going to transition out that metric into a graph here, and then I’m going to validate that during the Chaos Game Day.” You want to understand the SLAs that you need from the APIs that you depend on, whether those are internal APIs or external APIs. And they’re probably not going to be the same as the actual API provider. So, being able to monitor the performance of the APIs you depend on, understand them, and then make decisions based on that.
And that was this last thing, architectural change. When we instrument, not the API that I’ve made fail, but some of the other APIs, we found that they weren’t providing the performance that we thought they are going to. And so, we ended up talking with a lot different people in the organization, and we ended up getting their product manager. And they’re like, “Yup. That’s not what the API was meant to do. It’s not going have the SLA you want. You should probably use something else.” Which is actually a great answer, because then we aren’t building a system that will fail.
We made a change. We start using a different a API for that particular part of the system. And a result of that, in the first two weeks that we pulled this out of production, we scaled a hundred x with no problems. Because we were ready, because we had done the testing, because we caused the failures, we understood how our systems were going to scale, and we understood how our dependencies would scale as well.
We’re rock stars now, you know?
And so, since then we’ve also gone through x, passed that. So, no problems, lots of confidence in the team and the skills to understand when there are failures. And so, we got all this done. Group’s like, “We’re rockstars now, you know?” So, we decided to do another Chaos Game Day, and same thing, sitting there and go now. Now, I can’t do
sudo halt, probably not gonna work.
And so, what I did, I sat down, and Bruce was looking over my shoulder during this. I’m like, “What if I turn down the quota on the dependency that I know? So, secondary dependency of this other API over here, I bet they would get them, I bet they won’t be able to see it.” So I did it, the alerts fired, the graphs changed. The engineer found it in five minutes, mitigated it immediately and knew exactly where to go, and that was it. But that’s success, that’s success that even trying the cleverest thing that I could do, it just wasn’t there.
So, I suggest you all try this. It’s an easy process. And back to you, Bruce.
Outages make engineers better
Bruce: Thanks, James. So, one of the things that James didn’t tell you was a conversation I had with him. And I had a conversation recapping this, kind of, the journey of our Chaos Game Days and kind of our sense of where the team was at, and what skills that were being built. And James comes in and turns to me, and looks at me, you know, he goes, “You know, it was really easy being Master of Disaster the first time, but it’s actually gotten very hard to break the system in a way that the system won’t be resilient to and that we don’t have instrumentation.” And I turned to him and I looked at him, and I said, “Good. That’s what we want. This is success.”
And so, back to the professions. So, whether it’s your first competition, you know, or you’re medical professional, even medical professionals call it a practice, right? Or, you’re musician, it takes years and years to perfect these crafts, and it takes years and years to perfect our craft. Execution is not luck, it’s skill. And so, generally speaking, outages make engineers better. Outages help you understand how the system works or doesn’t work. It helps you understand complexity. It helps you understand engineering. The difference is, chaos lets you learn all that without affecting your customers. And so, it actually helped, I noticed, that it actually helped level up the team’s abilities and skills.
This is Sneha. Sneha went to UMass. She interned at the AWS on the EC2 team and she was new grad talent on my team that just joined our team. And we saw her grow over these Chaos Game Days tremendously. And so, thanks to these game days, Sneha has become extremely efficient and proficient, and confident in her on-call responsibilities. But more importantly than that, she keeps failure top of mind. And so, when she’s developing that new feature, when she’s working on the system, she’s actually thinking about what is the instrumentation that we need to add to this. She’s thinking about designing the system for failure and resilience.
How many of you guys are hiring? Anyone hiring in the room? Right. Okay. How many of you are hiring people out of college? Can you imagine how many outages it takes for an engineer to become proficient, right? Everyone remembers their first outage. I remember my first outage. I remember the first time I carried a pager, right? And so, this is a way to actually give your teams and your talent a way of leveling up, a way of experiencing failure without actually worrying about, you know, killing the customer.
Chaos gives you that accelerant
So, kind of a recap. The thing that we realized in retrospect of this, chaos helps your culture and your skills. By experiencing this, we develop skills faster. It’s a good accelerant.
The first time we did this the team was a little bit nervous, the team was like, “I don’t know.” Like, “I don’t want people to see that I don’t know how to do things.” Like, they’re really nervous and timid. By the end of that first one, the team came back to me and was like, “Hey, can we do this like, every week? Because it makes us better.” As a manager, when your team is asking you to do it like—it’s like hygiene, right? It’s hygiene or doing the dishes. When your team is asking you like, “Can we do the dishes all the time?” Like, “Yeah, of course.”
This is also about talent. So, it’s about developing your talent. Everyone has talent. Everyone’s hiring new people. Whether you’re a senior engineer or you’re a junior engineer, you need to learn about that system. The systems are complex in nature. The systems that you are coding for takes a long time to learn about these things. Chaos gives you that accelerant. It accelerates your learning curve for any system, and ultimately it gives you resilience. Because it keeps failure top of mind. And because it’s top of mind, because you also understand the system better, how it fails, you’re proactive in thinking about the resilience of the system.
And so then, the system doesn’t actually go down very often because you’re thinking about that up front, and you’re validating that your code is or does not work in the face of failure. And so, the net result is you actually need to continue doing game days because the system is not going down very often. This is about retaining and maintaining your skills on call to know how to answer that pager at 3:00 a.m. in the morning.
And so, the other journey I’ve been on is with Twilio. Well, Twilio powers the Ubers and the Airbnbs for the world. They also power a few other companies like these, the Polaris Project built-in app called BeFree to rescue people from human trafficking. Trek Medics International built a community-driven or emergency medical response for countries that don’t have 911 infrastructure. And Crisis Text Line answers texts for people reaching out in crisis. And to quote CEO and founder, Nancy Lublin, “Crisis Text Line dispatches active rescue eight times a day.” Lives are changed and lives are saved. Thanks to companies like these. And so, in retrospect for myself, I’ve kinda realized that this is why we have to do chaos, this is why chaos is important for Twilio to do.
So, in closing, when you wish upon a blue moon, may your CPUs overheat, may your hard drives crash, may your humans make mistakes, and, of course, may your network’s partition partial and full. Thank you.