Another journey of chaos engineering (Stitch Fix)
Published: July 12, 2018
Bruce: Scared, timid, anxious. These were some of the feelings that my team experienced the first time they did their first chaos engineering gameday. You see, we had just spent six months building what would turn out to be Twilio’s largest-scale system ever built and we were getting ready to launch. And while I was able to influence and launch chaos engineering at Netflix years ago, it was my first time trying to launch chaos engineering at a new company. Different people, different products, different technology—and to be honest, part of me was anxious. It was time to prove whether I could do it again or whether it was luck the first time.
My name is Bruce Wong. This is James Burns, and this is a collection of stories and lessons that we’ve learned integrating chaos engineering into Netflix, Twilio, and Stitch Fix.
Chaos engineering going mainstream
How many people have ever heard of chaos engineering? Awesome. The first time I gave this talk, that was never the case. There was only a few people.
For those of you who might not know, there’s a startup called Gremlin. They define chaos engineering as this: chaos engineering is a disciplined approach to identifying failures before they become outages. By proactively testing how a system responds under stress, you can identify and fix failures before they end up in the news. You know, I’d never thought I’d be able to actually say that there’s an entire startup around chaos engineering to help bring this practice everywhere.
Back in 2014, I launched the team, the vision, and the charter for chaos engineering at Netflix. And to be honest, I never thought it was gonna be anything but a cool team at Netflix. ReadWrite did a nice write-up on that launch, and it had a bold claim: that Netflix’s chaos engineering should be mandatory everywhere. And I will say this got me curious: curious if this strategy could actually indeed travel to other companies and if I could be that catalyst elsewhere.
Chaos engineering at Twilio
So back to the Twilio story. I still remember the first time I explained to James what we were going to do. He was the tech lead on the team and I wanted him to play the role of master of disaster. I have never seen anyone so happy to get a chance to break the system. He had all these ideas about the mayhem he could do to the system: partial network partitions, he was gonna try and drop prime-numbered packets, random kernel panics. He was like a kid in a candy store, dreaming.
So we all got into a conference room. The team has their Datadog dashboards up, and I declare start of incident. And we all were ready, and James has his poker face, and we hear James type a single command. For those of you who don’t know, this shuts down a box. The team looks at me, the team looks at James, they look at their dashboards, and I’m like, “Why are you looking at me? I’m just here facilitating.” And so James does it again, and again, and eventually, he killed every single box that we had and we were hard down.
We were only in stage at the time, because we hadn’t launched yet. We were getting ready for launch. And then, the team saw the impact 45 minutes into it. We learned a lot that day, and what started out as a scary, timid, anxious exercise—it actually ended with excitement. It ended with a sense of learning. It ended with a challenge, and that excitement didn’t stop. In fact, the team later asked me if we could do this regularly, and so we did. We did a gameday every single sprint, before sprint planning, so that we could take the resilience stories and prioritize them right after that.
And we had an amazing launch. We launched the largest-scale system at Twilio. We 50x-ed our traffic in two weeks, and we had zero issues. That success actually led to other teams around Twilio starting to do chaos engineering themselves. But we didn’t stop there. We kept going even after launch. We indeed attempted to chaos-engineer all the things.
The best part of the story and the best part of chaos engineering is when you actually get to see your work pay off. Eventually, we had the annual AWS region-wide outage. And the most interesting thing happened. For our system, it failed exactly as designed. We had perfect observability to prove it, we had confidence in that observability, and our postmortem had zero surprises and zero follow-up items for an AWS region-wide outage. At this point, I felt pretty validated that chaos engineering was now growing organically across Twilio, my small team had just built the largest-scale system with the highest availability, and we had zero surprises from a region-wide outage. So perhaps it wasn’t luck after all.
Chaos engineering at Stitch Fix
That brings me to Stitch Fix: another new company, new challenges, new people. For those of you who don’t know what Stitch Fix is, I like to say we are pioneering our way through what I call the personalization economy. The way it works is, you sign up—you fill out a quick profile, it takes about 15 minutes. We send you what we call a fix, which is a box of clothes that we think you’ll love. Try it on, you keep what you want, return what you don’t. In some sense, we are bringing the mall to you. This journey was even more validation that chaos engineering is here to stay.
I knew that awareness had grown substantially since I had launched it in 2014. But this was one of the most bad-ass adoption stories I have ever, ever seen. And this, I had nothing to do with, which is fantastic. Our platform team wrote Chaos Monkey into the container platform from day one. So we migrate all our apps to containers, because that’s what everyone’s trying to do, right? And some of the software engineers, I actually noticed that their containers mysteriously vanished. And so they asked the platform team and their response was, “Oh, sorry. That kind of happens sometimes. The underlying host sometimes dies. Just assume it’ll happen and, you know, make sure your software can handle that.”
Three months later, the team actually revealed that we were running our own version of Chaos Monkey to terminate hosts randomly all the time in production on everything. So, effectively speaking, Stitch Fix got to 100% Chaos Monkey adoption as soon as our migration was complete. There was no opt-in, there was no opt-out. Just on by default. So at this point, I’m convinced that chaos engineering is here to stay. There’s a thriving community, there’s numerous open source projects, and there are many companies taking action to invest in roles and teams to do chaos engineering. And when I reflect on the impact that chaos engineering has had at the companies that I’ve been at, it really changes the way these companies think about resilient systems, DevOps, and even engineering talent. Ultimately it enables teams to move faster and safer.
What chaos engineering means for the industry
And so as we attempt to look into our crystal ball, me and James have been thinking a lot about what this means for our industry and our professions. And we’ve realized that the role of senior engineers is actually changing. And with that, I’d like to invite up James Burns, and he’ll take us through some of the myths and reflections on what it means to be a senior engineer.
The evolving role of the senior engineer
James: Thanks, Bruce. So, as Bruce said, I was excited about breaking things, about being a master of disaster, but I was also a little worried, because master of disaster sounds a little bit like being a disaster. And you know, who wants to be known as the disaster? But when I saw the growth of the team, when I saw what chaos could accomplish, I was sold. I was bought in.
But it made me start thinking, because my team had accomplished things that I thought were what senior engineers did. When I started this process, I thought I was a really good senior engineer, and I thought that was because of my experience. I thought it was because I’d worked in kernels and networking stacks, security, embedded systems. I thought these were the things that made me able to be effective as an operator. But when I saw my team members starting from all different levels of experience be able to accomplish the same things, I started to wonder, “Well, what is a senior engineer? What have I been believing about senior engineering that just has been proven not to be true to me?”
So before we get to these three myths, let’s look at how people think about what senior engineering is. This is a word cloud from the descriptions of senior engineer posts at places like Microsoft, Google, Facebook. And the words that pop out are things like software, experience, engineering, development, technology. And some of the particular line items are things like 12 to 15 years of experience, or a degree component like BS/MS in computer science, or perhaps something like proven experience building, shipping, and operating reliable distributed solutions. While I benefit from this definition, I’ve come to see that it’s wrong.
Myth 1: Senior engineers fail less
So let’s look at the myths. Myth number one: that senior engineers fail less.
So this is an accident in Japan involving 20 supercars and a Prius. I’ll get back to that. There’s this idea that becoming senior is a process of developing specific skills, of paying dues, of being paged at 2:00 a.m., at 5:00 a.m., at 7:00 a.m.—that this is how you learn how to operate systems. And, more generally, that this is a process of learning how not to make mistakes. That when you start as a junior engineer, you make a whole bunch of mistakes all the time, you develop into senior engineer and you make no mistakes. But let’s talk about reality.
So first, a poll for you all: How many of you have ever made a mistake? These are the humans. How many people have broken production? These are the senior engineers. Last question. How many people have cost more than your yearly salary in a single incident? These are the architects. But really, the reality is that, like these cars, we’re very expensive, and like these cars when we crash, it’s also very expensive.
Myth 2: Senior engineers prevent failure
Myth number two: that because senior engineers are failing less, that they can also prevent failure. The idea is that our experience, our time, gives us some kind of foresight into the future. And so we’re expected to do reviews, we’re supposed to look at a design and say, “Oh, I see you’re following this pattern here, and in two years, it’s going to cause the system not to scale.” Or more often, in code reviews, we’re supposed to look at a particular line of code and see, “Oh, I see this is != instead of ==.” Or “I see that you’re missing a comma after the bracket, and that means that the object is going to be instantiated differently.” Or “I see that even though all the tests are passing, you failed to require a dependency. And when you put that into production, that’s going to cause an outage.” These are not random examples. These are all code reviews that I approved and broke production.
The reality is that our experience doesn’t give us some kind of mystical foresight. Things are changing, the landscape is changing, designs are changing, and the review process that isolates senior engineers to a certain part means that we don’t even get to learn from our mistakes in real time. Somebody else ends up deploying that to production and they say, “Hey, you approved the bad code review.” It’s not effective for our learning, either.
Myth 3: Senior engineers can understand the system
So myth number three: that senior engineers, because they’re preventing failure, are doing it by understanding the system. There is the belief that they can keep track of more things, that they can effectively know what they don’t need to keep track of, or that they can just reason better. The reality is that the systems we deal with every day are so complex that no one person can understand them. And overconfidence in seniority is the path to extended outages.
So, the three myths of a senior engineer: that senior engineers fail less; that because they fail less, that they can prevent failure; and that they prevent failure by understanding the system. So, as I saw how these were myths, as I saw how the team developed, as I saw what this meant for my career, for my development, I wondered, “What is it that senior engineers are supposed to be doing? What am I supposed to be doing? How can I be the most effective helping my team, my company?”
Senior engineers make failing safe
And after a lot of reflection, I came to the conclusion that the fundamental role of a senior engineer is to make failing safe. Let me say that again. The fundamental role of a senior engineer is to make failing safe. If there’s one thing you take away from this, let it be “make failing safe.” So let’s look at how they make failing safe.
Senior engineers run gamedays, because it’s safer to fail at 3:00 p.m. than 3:00 a.m. It’s much safer to fail when there’s no customer impact. And it becomes safer to fail in production when you have validated the observability.
The process is simple. It’s the regularity, it’s the cadence, it’s bringing failure in dialogue with features so that when you decide what the best thing for your customer is, you’re not deciding that based on what failed in the last week. You know where your system is with resilience; you know what your customers need; you can make the right choice. Senior engineers drive postmortem culture, because when customer-impacting failure happens, and it is when, they know that words matter. That if the question asked is, “How can we make this never happen again?”, people aren’t going to talk. People aren’t going to volunteer. People are going to feel like they’re the failure. When the question is, “How are we surprised?”, they can be open, they can feel safe, they can start thinking about what assumptions they had—where their expectations went wrong.
The transformation I saw for this was with junior engineers especially. They felt comfortable; they felt safe writing out real timelines. So, one of the things I’ve seen over the course of my career is that timelines end up being optimistic when you’re doing postmortems. And the reality is that, usually, there was a gap in your observability well before the start of incident—maybe hours, maybe days, and in the recent case, it was months. When people feel comfortable writing that down, writing down the real timeline, then you can actually see that. You can see your gaps in observability, you can see the gaps in your process, and you can fix them.
Senior engineers facilitate collaboration
Senior engineers facilitate collaboration. They don’t make proclamations of architecture. They don’t distribute design documents. The team designs a system; the team validates the system; the team operates the system. The transformation I saw with this was that my one-on-ones with team members went from real-time code reviews or talking about specific pieces of code and how they might or might not be behaving, recent incidents… They became design reviews, but of the best kind.
All my team members, from people straight out of college to people with a few years of experience, were able to bring me designs saying, “Here’s a refactor I’d like to do. Here’s a new feature I’m working on.” And they would say, “This is how I think it’s going to fail. This thing that I’m working on—I think it’s going to fail this way. I’m planning to be resilient to it in a particular way, and here’s my plan. Here’s how I’d like to validate that in the chaos gameday.” When all your engineers can do this, can do the best kind of architectural resilience work, it’s transformative to your whole development process.
So to recap, senior engineers make failing safe by running chaos gamedays, by driving postmortem culture, and by facilitating collaboration. The results are the things that we all want.
How the organization benefits
First, we end up with confident on-call: one who knows they have the tools to investigate, knows how to mitigate issues, knows when to escalate and ask for help, and knows how to communicate. One of my team members was a woman named Sneha, and she joined us straight out of college. And I saw her grow with chaos engineering in the matter of a few months to not only being able to be on-call, but also being effective on-call. And not just being effective but calm and cool and on-point in her communications. Amazing result.
You end up with resilient systems—systems that think of failure first are able to talk about how they plan to be resilient to it and are able to validate that. You end up with effective designs that accomplish their goals and express their trade-offs. And the end result of all this is productive development, because you don’t have to build the same system five times. You build it once, you validate it, you make changes, you validate those.
So, if these are the results that everyone wants, and the path to this is re-conceptualizing how we think of senior engineering, what might our job descriptions look like? What should we be looking for? Maybe the word cloud starts looking like this. Maybe the things that pop out are things like learn, grow, mentorship, safety, people. Maybe the requirements are things like “values creating experiences,” “values learning and changing their mind,” “values working to keep the team healthy,” “values making failing safe.” Imagine a world where senior engineering looks like this. These have changed the way I think about career development. I invite you to do the same. Here’s Bruce for some closing thoughts.
Conclusion: The time is now
Bruce: Thanks, James. The famous last words are, “Well, that’ll never happen.” We all try to prepare for those blue moon events, or maybe we don’t. But the companies that actually prepare for these—organizations that understand the importance of practicing chaos engineering—effectively inoculate their systems from even the largest-scale events. Companies are all at different stages in their own journeys with chaos engineering.
And so I saw that you all heard of the term. How many people are actually doing chaos engineering at their companies? All right. So next year, I hope that that’s like everyone, right? So whether you’re getting started, whether you’re a senior engineer, whether you’re a junior engineer, the time is now to invest in chaos engineering.
You know, it’s interesting. Every place I’ve been, the question I always get is like, “Well, I need to make the system ready. I need to get the system ready to do chaos engineering. I can’t just start doing it.” Well, what I say to that is that that’s like the gym and working out. You don’t get in shape to go to the gym, you go to the gym to get in shape. And likewise, you do chaos engineering to make your systems resilient.
When you’re thinking about that next job, because, let’s face it, we’re in tech. You all are going to think about another job at some point. Consider joining a company that’s actually practicing and adopted chaos engineering practices. And for those of you in your current roles who might be decision makers, reconsider the role of senior engineers, directors, and VPs, and how chaos engineering might play a critical role, and how that relates. So wherever you are on this journey, here’s some additional resources for you.
Kolton Andrus is the founder and CEO of Gremlin. I mentioned them earlier. He’s done a number of amazing talks on chaos engineering. Tammy is also at Gremlin now, and she’s actually run a number of different hands-on workshops for chaos engineering, and she’s open-sourced all of the resources. So you can just go there, and she has all the resources to actually stand up a Kubernetes cluster, get an app on there, get it running, and actually practice chaos engineering on that. If you liked hearing me and James speak, we actually did another talk on chaos engineering a couple of years ago—actually at another Datadog event. So I invite you to watch that.
And if you’re curious about Stitch Fix and you want to expand your wardrobe beyond free tech swag, I’ve included a promo link that’ll waive the first fee and get you some credits. And with that, I think we had a little bit of time for Q&A.
Should chaos testing be continual?
Audience member 1: Thank you for the talk. It was very cool. What are your thoughts on encouraging developers to not only do gamedays, but to include resiliency testing in their testing itself? Like unit tests that simulate a database being down, or Redis being slow, or something like that?
Bruce: Yeah. So, the question was, what are my thoughts on not just gamedays but also resilience testing on an ongoing basis? I would say that’s actually the goal.So, you start with gamedays because it’s a team exercise. Everyone’s there. And the goal is actually to automate all of this so that you’re constantly introducing this type of resilience testing. So that as you make changes, then you know when something broke your circuit breaker or broke your alerts or broke stuff, because you actually have that testing built and all that automation in your system.
Can blameless postmortems convey the seriousness of a problem?
Audience member 2: Hi. I have a question around making it safe to fail. And related to that, around running a blameless retro. So, how do I ensure that engineers are understanding the seriousness of the situation while still keeping it not personal and blameless?
James: Let’s see.
Bruce: Repeat the question.
James: Yeah, I didn’t hear it.
Bruce: So, if I understand your question right, it’s so when you’re making failing safe, how do you have senior engineers balance like their personal side of things and the objectiveness of making failing safe? Is that right?
Audience member 2: Right.
Audience member 2: [inaudible]
Bruce: Gotcha. You can take it.
James: Sure. It’s interesting, because this is the tension when people talk about blameless postmortems too, because people want to make sure that people feel appropriately responsible, but I think that the safety we’re talking about is at least twofold.In one sense, it is that you follow this practice before the failure so that you at least know that you failed, but you have observability.
I mean, a lot of it is… if it’s to the point where people don’t feel like they can take responsibility safely or people feel like they’re expected to be responsible for things that aren’t their fault, like these are all like general organizational dysfunctions. What the goal is is more about creating that culture, about being able to ask that question, “How are we surprised?” Because it’s the responsibility impact. I mean, everyone who worked on that system shares fault, even if it’s one person who made the mistake.
The system needs to behave however it needs to behave, and I have not experienced this dysfunction, but I’ve heard about it. What happens when you don’t feel safe is that people, instead of accepting more responsibility, they just lie about impact. And if that’s not what you want, then you still have to say, “Yes, we know this happened. Yes, this was bad. Yes, this cost more than someone’s annual salary.” That’s reality. It’s going to happen again. Getting rid of that person or saying the team who built that system isn’t any good, it’s not effective.
You need to create that culture of continual development, that culture of, that you’re not starting from perfection, that you’re trying to progress towards something. That’s still isn’t perfection but that’s better, and that safety is part of that. And if you have people who aren’t responsible, that’s sort of a different problem. You need to make it safe for all the rest of the people so that they continue to be more responsible. That’s my two cents.
Chaos testing in CI/CD pipelines
Audience member 3: Okay. Do you have the chaos principles, let’s say as a part of your CI/CD pipeline? Or is it something that runs in external [inaudible]? So do you relate it somehow to deployments as well or it’s totally isolated?
So your question’s kind of around, is it integrated with CI/CD? How do you think about chaos testing on an ongoing basis?
Audience member 3: Well, if it’s related, let’s say. So, for example, is there a time sequence over there or is it totally isolated from what happens to your delivery pipeline?
Bruce: Is that chaos on the delivery pipeline or as part of the delivery pipeline? Sorry.
Audience member 3: Well, the question is pretty much, let’s say that I change a component that actually makes something really bad happen to my infrastructure, right? So if the interval is not affected from this delivery of the component, it means that the first time I ran, my suite worked pretty great, so I get all greens, but because I had breaking changes after this component and I didn’t run the suite yet, I cannot recover anymore.
Bruce: Gotcha. Okay. So the way I would put this is, I think about chaos as something that you need to do more frequently than you deploy. Because in complex systems—like somebody’s deploying, right? Once you get into the order of thousands of microservices, like dozens of teams and dozens of services are actually changing beneath you. And the reality is, those are the ones you know about. If you use anything like SaaS, those are all doing deployments also to change everything as well. So you just have to account on change happening and so you want to try and run this as aggressively and as much as you possibly can in order to catch that as soon as possible. Yeah.
So the way to think about this is twofold. Your resilience strategy is usually around two dimensions. So it’s either containment or isolation. And so you’re thinking about your blast radiuses. And so the idea is that you want to test multiple aspects of all of that. So whether it’s isolation or containment, it’s an onion. If you only have one resilient path, then yeah, you’re gonna end up in that boat. But if you actually have a number of different measures in place to catch that, one of those will catch it. It’s just, how big is that failure going to be. And that’s why you need to exercise these things.
Chaos in the enterprise
Audience member 4: If I understand what you guys said right, it sounded almost like a self-contained, what I would call a pioneer team who just went out and did it. And then, sort of by emulation, folks said, “I want the same results,” and it started to organically grow. But in larger enterprises, things get kind of a little wedged there, and you typically have to do quite a bit of building support.
So, I just wanted to hear what you guys had to say about what I think of as the optimistic pioneer spirit in a smaller organization with emulation, versus the other end of the slider, and how hard or doable is it to build support and roll it out effectively.
Bruce: Yeah. Now, that’s a great question. So the question was around adoption at like larger-scale companies. So Netflix has about like 2,000 to 3,000 engineers. Twilio is getting close to like 800 or 900 engineers at this point. By any means, they’re not small—like it’s not a small 20-person startup.
And so, you know, the thing is, even at the highest levels, at your VP level, when they see results that are undeniable, when they see uptime that you can’t argue with, when you see this team survived an entire region outage and why didn’t everyone else, they start beating that drum too, right? And so part of this is trying to do things organically from the ground up, but when you see results and you see those type of results and like, “Okay, well, this team has half the engineers, they’re delivering their features on time and they are three times more resilient than everyone.” I don’t know what kind of executive wouldn’t want that across their organization. Right? And so that’s where you start getting executive support and stuff like that. So part of this is the art of telling those stories and making that visible.
James: Usually, you can start smaller, too. The idea that chaos has to be a full-blown program before you can see the effects, positive effects… I mean, that first chaos gameday, the output of that, the reason why it took 45 minutes, is because we had no business metrics, basically. And you can discover that, and that can have real impact on how you operate your system, just once. There’s more benefits from running a full-blown program, but it’s been a while since I’ve worked at a really, really big company. But the ability to shut down one machine should be in reach, maybe, and you can see material benefits from just these small things. Which is why we talk about gamedays and not like, you should run chaos monkey everywhere day one. That’s nice, but the reality is that you can see a real benefit by going back and doing this in your staging environment and seeing what you can’t see, because there’s usually something you can’t see.
Bruce: You know, the other organic benefit is, when the team that’s doing chaos is not pager-fatigued and they’re like, “Oh yeah, it’s just on-call.” And the rest of the organization is like, “I hate on-call,” guess what happens, right? Like, what are you guys doing that we don’t know? Right? So there are benefits.There’s results for the executives, right, because we’re talking dollars when we’re talking downtime. And there’s also just like humane lifestyle for the rest of the people on call who carry the pager.
How does chaos change the way you architect systems?
Audience member 5: Concrete examples about how chaos engineering changed your thinking about architecting systems?
Bruce: Yeah. So your question is about concrete examples about how chaos engineering changed how we think about architecting a system? Which one do you want to do?
James: If you use SaaS and you’re not closely monitoring that, and you think that’s always going to be up, you should chaos that. Because it won’t be up sometime, and if that’s a critical dependency, you’ll be very sad.
Bruce: We used to work for a SaaS company. It won’t be up all the time.
Plans and roles for a chaos gameday
Audience member 6: Hello. Hi. This might actually be sort of intimately related to what the other gentleman asked. But if you’re just getting started and you are a small company, do you just kind of introduce a gameday and say—you know, you said earlier in the presentation, you were like a kid in a candy store. Do you just sort of let people come up with whatever kind of situation they want, and break things in however way they want, or do you start with a direction?
James: I mean, practically, it’s remove capacity and make sure you can see it and make sure it has the impact you expect. So shut stuff down, and then fail network connections, and you get like at least 80 percent of the benefit. By blocking network connections between different things, that will usually show you the things that are most likely to happen and the things that are probably going to impact you most.
Bruce: I would say, it’d be good to get everyone to agree on what they’re gonna do and roles. Someone’s got to fix the system; someone’s got to break the system. That’s why we do these talks, so that you can have your boss watch it, and stuff like that. Awesome.
James: All right.
Bruce: Thank you very much.
James: Thank you.