Monitoring the Cloud's Critical Infrastructure

Monitoring the cloud's critical infrastructure

Published: March 14, 2017

00:00:00

Different approaches to monitoring

Moderator: So, you know, many of you had just sort of chatted a little bit…

I mean, you’re all in fairly large engineering organizations, or at least, you know, where the engineering organizations sorta outscale US individual observability or monitoring teams.

And so, you know, “monitoring” is often in your titles, or “liabilities” often in your titles, or “insights,” similar terms.

Where does that…where does the responsibility of monitoring and observability fall within your organization?

Is that something that your teams are, you know, handling for the rest of engineering and product and…you know, in business or is it falling, you know, with you?

And maybe we start at the end with Datadog and what kinda work back this way.

Chris: Sure. So, you know, we’re dogfooding our own product all the time.

We try and…the SRE team itself tries to set some baselines, is always concerned about capacity planning.

But we really, you know, as a company, want everybody involved setting up metrics, monitoring, observability.

Bruce: Yeah, so I think the…we don’t have an SRE team, we don’t have an OC, we don’t have an operations team.

So, that means that every engineer who writes code, runs code.

And so I like to say that the person who’s on call for your product is really the person who should own the instrumentation, monitoring of that said thing.

And so that’s certainly the case of Twilio.

We have I think in order of 60 small teams at this point, and so my team kinda sets up guidelines in best practices, does a lot of education.

But from a responsibility standpoint, it’s certainly on the engineers and on the individual teams to decide how they wanna tackle this problem.

Josh: We’re set up the same way.

Obviously, me being a team of one, I’m not gonna do it all.

I’ve never wanted…

You know, like, I’m part of the SRE org, so there is other people on the team.

We strive very hard to make sure that everybody on SRE doesn’t necessarily have to know every single service.

We try to push back to the engineers as much as possible.

They’re the ones doing integration testing, they’re figuring out how it works.

And we help educate and set things up and we try to kinda preach an idea and a goal for us to get to and kind of crowdsource getting there.

Eric: I guess very, very similar to all of the above.

The individual teams are responsible for setting up their own metrics and controlling.

We saw the example earlier with these JSON files that have that captured in code so that we can talk about it and review it and pull requests across the board.

Different teams are also responsible for kind of common frameworks sometimes, where it’s like multiple teams are involved in a service.

So somebody will basically put out, “Here’s the framework that we’ll be watching, these metrics.”

It’s on the individual service owning teams to make sure that they’re defining the right metrics for them.

The relationship between QA and monitoring

Moderator: So, Josh, you mentioned integration testing, and so I’m kinda curious, how do you do you see a relationship between, you know, QA and monitoring?

Or how do those two fit together, is it two sides of a similar coin, like, the same coin or…

Josh: Yeah, I think they’re very related.

I’ve always tried to get monitoring in as soon as possible.

I’ve come from companies where it’s the last thing, you know, where I started the company and find something’s been running in prod for two years and is not monitored.

So I try to work with companies to say…or work with the engineers to say, “Let’s get it in right away.

I don’t care if it’s dev metrics we have to throw away later or if they’re meaningless because you’re experimenting, but get them in now."

And then you get an idea of how it’s gonna, you know, trend over time.

Moderator: Cool. So I mean so when we’re…

I think it’d be…many of us, if we were to discover a service that’s being deployed to production, we’re like, “Oh, there’s no unit test.

There’s no integration tests."

You know, “Where’s the test coverage on this,” we’d say….

We’d kind of throw our hands up in the air and sort of maybe run, screaming from the room.

How do, you know, how do those tests relate to, you know, the types of things that you’re putting together monitoring-wise?

Josh: I think for the stuff that doesn’t have any existing, you know, framework for it, you start to just look at the behaviors and you kind of attack it from outside in, and then you figure out who the service owner is and start to get your internal instrumentation done.

But, you know, for a lot of the services, you know, anybody uses these days, you start to look at, you know, RPS and latency and availability, and you can kind of look at every single internal service as a web service and kind of figure out if it’s healthy or not, and then figure out how to get inside there.

Moderator: So do you see, like, the heads-up integration test that, like, your QA teams might run, is that with those…do you think…do you see those as being potentially good sources for monitoring places to start?

Josh: Right, yes, so…

Right, when you’re doing your testing you’re saying, “Yes, I know this software works.

I’m ready for it to go to stage or to prod," you map what you saw to make it work into your monitoring so that you can follow that through the system.

Moderator: Nice, cool.

You know, similarly on a QA approach, I’m kinda curious how are you all…

You know, we talked about sort of…we de-centralized monitoring.

We’ve said it’s the problem…you know it’s, “As a service owner, you’re responsible for your monitoring.

You’re responsible for keeping your service online and deployed."

You know, it’s sort of a double-edged sword, though, as well, right?

We’ve all worked with a team that gets 10,000 alerts a week, and you say, “Are you checking all of those?”

And they say, “Yes.”

And you say, “Well, that’s more alerts than there are…you know, then there is time in the day. How are you doing this?”

So how do…you know, what’s sort of the feedback loops that you’re using as you’re trying to figure out how to avoid pager fatigue, make sure that the alerts are, you know, strong signal to noise ratio?

Are there any strategies that you all have adopted?

You know, Bruce, I think you had some strong opinions on this.

Bruce: So I think, you know, the pager fatigue and, like, signal/noise ratio is actually, like…it self-incentivizes yourself to actually go do something about it, but I think there’s an education gap and I think there’s a tooling gap.

And so that’s why introducing something like chaos or failure in an earlier stage helps you…it gives your teams the tools they need to actually iterate on that signal to noise ratio.

If you don’t cause failure, trying to iterate on that signal to noise ratio, it’s like, “Great, I think I tuned the alert, and then let’s wait two months for the next outage to happen.”

And “Oh, nope.

Still got 10,000 alerts."

And so…

Moderator: [Inaudible] another way around, didn’t get any alerts.

Bruce: Right, right exactly.

And so actually being able to validate that is part of the…I think it’s part of the inherent problem.

And I think once you give developers an easy means to actually do the right thing and taking away a lot of that cognitive load, a lot of this will just happen.

You don’t need to, like, tell people, like, “Hey your signal to noise ratio is bad.”

Like, they know.

They’ve gotten paged, right?

But I think it’s understanding and having empathy for, like, “Why is this difficult?

Why is it hard to validate whether you have good alerting or not?"

Moderator: You know, we’re…I’m here representing Datadog, and, obviously, data is in the name.

So I’m, you know, big on metrics.

Of course, you know, yes, pain-driven…you know, sort of a pain-driven operations or development is one way to go about it.

And I think, you know, folks will definitely respond to the negative stimuli and try to improve the situation, you know.

But I often see it as our roles in the world of observability to help give people the tools, like you mentioned, Bruce.

And I’m curious if anybody has any stories maybe on you’ve built that feedback loop or ways that you’re…maybe things that you’re doing to help your customers within…your internal customers do you do a better job with their alerting and their monitoring?

Chris: Sure, so for Datadog, for example, we took…you know, Cory gave a great talk.

It’s from Stripe about, you know, creating a feedback loop, giving a form for every alert, actually saying, “Is this actionable?

Is it clear to me?

Do I know what to do with it?"

And you know, collecting that data up front and then being ruthless about having on-call hand-offs and going through the list.

You know, “This the one you know, 10 times it has alerted.

It’s got no action."

Well, get rid of it.

You’re not doing anything with it.

So, really, cut the noise.

Moderator: And I think you’ve built some tooling around that, just sort of POCs for within the word the Datadog.

Chris: Right.

Moderator: Maybe afterwards, if folks are interested, you know, Chris is around, can probably tell you a little bit about how he built that and…

Anybody else have any thoughts or ideas?

I mean, obviously, there’s…

I know you’re with PagerDuty, Eric.

You’ve probably got some awesome tools to tell us about.

Eric: Yeah, it’s a…we page and people get tired.

No.

So they’re…kind of the two competing pressures on alert thresholds is our failure Friday, our version of a game day is where, you know, those are where the thresholds kind of get adjusted up because that’s, again, a controlled environment where we’re inducing some sort of failure.

And one of the first checks is, “Okay, you know, pseudo-halt.”

And then five minutes later, everyone’s kinda staring at each other, “Did you did you get it?

I didn’t get it.

Did you get a page?

I didn’t get it."

And so that selection pressure kind of pushes the alert thresholds up.

Then we’ve got, you know, our reporting and our metrics, and all of the team…either a team lead or the engineering manager for each team will basically, on a weekly or bi-weekly basis, look at how many…you know, again, “How many did we get?

Did we get 10,000?

Okay that’s…"

They’re kind of proactive about, “I’m not even gonna wait for my team to complain about this.

I’m just gonna go start looking at these with the team every week."

And, “Does this feel like a lot?

Is this not enough?"

And that drives the alerting thresholds back down, usually.

Big-picture monitoring

Moderator: Cool.

So, you know, one of the things that…you know, I look around the room and at the folks that are in the audience and the folks that are on the panel here.

And you kinda get the feeling that monitoring is this engineering-centric or IT-centric domain, right?

At least in the way that we often talk about.

It’s like, “Oh, I got an alert in the middle of the night about a web server being down or maybe about a cluster, you know, behaving slightly differently.”

You know, as we’re focusing on observability, though, the idea is to, you know, get a sense of the inputs and outputs of our organization as a whole, not just…not necessarily just individuals, you know, individual hosts or containers or APIs, but how is the organization as a whole in terms of health.

Have you had any successes or challenges in getting some of these metrics to be more relevant to the wider organization?

Is there anything that you’re, you know…anything that you’ve been able to do to kind of get this in front of…again, in front of product folks, in front of, you know, folks on the sales and business side and other parts of the organization.

You know, I think we talk about DevOps and this idea, “Oh, it’s just Dev and Ops.”

And in reality, it’s this idea of working across all of the organizational silos.

So love to hear any stories you all might have and kinda how we might do it differently or the same.

Josh: Yeah. I don’t know that I could come at it from a win perspective.

But we run, I think, right now 25 different monitoring systems in order to catch internal external BGP threshold around the world.

And so where I’ve built, like, feedback loops into our Datadog alerts to say, you know, “Was this useful,” that’s one out of 25 of our systems.

And not all of them are set up to page.

Some of them you know, are just slack notifications that people mute, I assume, looking at the volume of them.

So, you know, I think that there is a place there to get more analytics on multiple systems as a whole, and to figure out exactly, you know, what we’re looking at internally and externally.

Trying to figure out, “Is our service healthy,” is you know, far different than. “Is this machine healthy?”

You know, you should be able to let machines fall apart, no worries, but if your latency is growing around the world, then you know you’ve got something else to look into.

Moderator: Have you been able to…have you had much success in terms of…

So as you’re defining these metrics that are important to page, you know, yourselves and your teams, have you found metrics that sort of fit into what the business side of the organization might be…you know, to make them relevant to them as we’re bringing them all into the same sort of idea of, “Let’s share all the data and make good decisions as a team”?

Josh: I think for…you know, for Fastly, in particular, looking at it from the customers’ perspective.

So we watch the error rates and origin hits and are we going to melt somebody’s origin server because our cache isn’t holding on to something.

And that’s very big on the business side, you know.

That’s the customers gonna call them about those specific stats.

And so we make sure that we track those first and foremost, and they are side-by-side with system metrics at this point, but trying to get them pushed together a bit.

Moderator: Would you call those the work metrics?

I think we all learned from Jason earlier today about work metrics and resource metrics.

Josh: Yeah, yeah those are the big ones.

Those are the…

You know, we put…our homepage has our requests per minute in real-time, all the time on there.

So that’s always fun, you know, when you go to look on there.

And we just recently hit five million RPS consistent, at peak.

And so we’re pushing a lot of data and handling a lot of requests.

And we can do that with allowing machines to fall apart and you know, switches to die and regular…just hardware things to happen.

So keeping those external metrics up is what you’re looking for because that’s what your customer is looking for and that’s what your salespeople are pushing when they get out there.

Bruce: So I was gonna add pager pain solves a plethora of sins.

At Twilio, we actually put our product managers on call.

And so our product managers get paged also.

So when that product is down, they’re getting paged.

It’s amazing what happens to your backlog of, like, reliability and instrumentation betterments when you put a product manager on call.

You know, the other thing I think that helps is as an infrastructure provider, as a SaaS provider, we all have status pages.

And I would hope that the product manager is the best person equipped to communicate about how the product fails or does not fail.

And so I think that actually helps improve our process and improve how we think about communicating externally, how we think about communicating internally, and also, like, what telemetry do we need to have confidence in that communication about any issues that we might be seeing.

Moderator: So are you able to loop them in, for example, on, like, things like as you’re deciding how to gracefully degrade, what does that…you know, how to make those calls or…

Bruce: Right, so I think there’s a measure of education about failing open, failing closed, what the differences are, so educating product managers about that.

Because I think those are product and business decisions about how you need to fail open or closed.

And that’s a call that a product manager or a business person actually needs to be educated and make.

If you let engineering make all those decisions, then it may or may not be what’s best for the business.

And so there’s a level of education that’s necessary, but I also think from just understanding…

Like, we have many sayings at Twilio, one of the sayings we also have is, “Wear the shoes of your customers.”

And we just feel like this is one of the most important ways that a product manager can connect with the experience our customers are having, if they actually feel that pain.

And, you know, one of the things I always say at Twilio is, you know, “You might be pissed at 3 a.m. in the morning when you got paged, but guess what.

Our outage just set off everyone else’s pagers."

It’s like, “You know when there’s another cloud that goes down and we get paged about that?

That’s the same thing that happens."

And so building that empathy in your organization, all the way from the top, all the way through product, all the way through you engineering, I think that’s where pager pain really solves that.

Eric: Related, something we’re kind of discovering internally in PagerDuty is this goes to more the broader business.

Pager pain translates to things that aren’t traditionally considered pager pain.

So we have a site that’s just a, you know, kind of a static content site or whatever, and it’s not really something that the SRE team’s gonna dive into.

It’s not something that’s gonna page anybody on the infrastructure engineering teams at 3 a.m. because it’s just hosted third party, but the marketing team really cares about it.

And they really care about it and, like, both they want it to be up and available and they’re responsible for calling the provider when it’s not.

They also have Google AdWords spend and they wanna immediately turn that down because they’re now literally throwing money, you know, into the trash because they’re buying this traffic or whatever with their AdWords and then sending it to a site that’s down.

And we’re finding…they were a little unnerved by the idea of being on call, but then they were explained that they worked at PagerDuty.

They went on call and they…again, it’s some of the things where the metrics are important for the business and then you can connect it to the rest of this tool chain that’s, like, trying to get that attention and trying to get that right level of effort on it and it works pretty well.

Bruce: You know, I also add that I think it also starts from the leadership and starts on the top.

I’ve been in post-morterms where Jeff, our CEO, will come into the post-mortem and participate.

And so, you know, I think all of my years at Netflix, I never saw the CEO go to a single post-mortem, right?

That doesn’t mean it’s not important, but I think it sets a very, very different tone, especially in a start-up setting when the CEO and founder is actually attending and even running the post-mortems.

Josh: We get that a lot.

Artur is in most of our meetings.

The one thing I’ve really loved about working at Fastly is that the…everybody there is very excited when it’s working right and very worried when it’s not.

And it’s across the entire organization.

So when something does go wrong, admin folks come in and say, “Do we need to order pizza?

We need to get some drinks."

Like, take care of people and you end up with a ton of people in your incident room, which is actually almost a problem in itself, that we have too many people in there chatting, too many business people saying, “Hey do I need to reach out to this customer and that customer?”

And you say, “Okay, that’s a side conversation for you guys to figure out.

Let’s figure out the actual problem in here."

But everybody’s very involved and it’s been…

You know, whereas we don’t have to put people on call, they’re generally…they have some metric that alerts them out of Salesforce or something weird that they pop in when you don’t expect them, and it’s generally very helpful to have somebody that’s, you know, tuned to customer communication, you know, hop in there at times.

Post-mortems

Moderator: So, I mean, we talked a little bit about post-mortems here.

There was…you know, Jason lead an interesting session earlier about post-mortems.

And you mentioned…you know, Bruce, you mentioned your CEO popping in, you mentioned your CTO popping in, Josh.

How are you included?

Are you… do you find yourselves including product and other organizations within…parts of the organization within your post-mortems, you know, like, purposefully?

Are they on the invites for these things or are they popping in because they’re interested?

How are you increasing engagement with the wider organization?

Josh: Our last big post-mortem that we did, we made, like, a live Zoom meeting.

And so we had people in the room that were there to deal with the details and go over the incident that happened, but we left it open for anybody to drop in and start listening and watching with the request that they didn’t start chiming in all the time, but, you know, kind of just be a passive audience and they get an idea of how we go through an incident, how we address it, certain issues.

And they may come back later with their own ideas.

But we try to leave it open to pretty much anybody that will…that is interested in looking at that.

Moderator: Are you able to loop them in when you’re…you know, so you’re coming up with action items and, you know, as somebody who occasionally plays, you know, product manager and also occasionally plays SRE among other things, it’s sort of, it’s…there’s always a balance, right, of, like, “I want to make all the things stable so we’re gonna put all of the new features that we’ve promised to our customers on the backlog, and we’re gonna only work on stability.”

How are…I imagine there’s a little bit…not…and, of course, on the product side, you actually want stability as well.

Everybody wants stability.

But is…you know, are you able to have those conversations during the post-mortems as you’re assigning the priorities and the schedules or is that something that happens afterwards?

Joshua: I think that would be an after conversation.

We try to keep the post-mortems as pointed as possible to solve the problem and to see what new monitoring we need, you know, to prevent the problem from happening again.

After that, teams go amongst themselves and then reorganize their priorities, but…

Eric: I was gonna say it’s a very similar process and the action items out of our post-mortems pretty much go straight into JIRA.

I have seen, you know…because the service owning team tends to own the post-mortem for that service and in PagerDuty, I have seen the service owning teams kind of negotiate, where instead of the action item being, “Do exactly the thing that would counter this particular incident,” they’ll know that they’re refactoring the service in two sprints anyways, so the item will kind of get changed in the middle.

I’ve seen that be both successful and not.

So but having…again, that’s the multiple people on that team kind of sitting there going, “What’s the right thing to do going forward?”

Josh: Right. We’ve been investing a lot in our incident command program.

And generally, the people who run incident command are not service owners.

So it gives checks and balances in a post-mortem because the incident commander runs the post-mortem and a service team then comes back with what they can do to make things better.

That gets around that negotiation a little bit.

Moderator: So, Bruce, you talked a little bit about putting other, you know, parts of organization on call and paging them, as did you, Eric.

Does that mean we’re doing game days for product managers and sales folks and other folks in our organization or…

And what does that look like?

Bruce: Yeah, I think that’s something that we’ve been toying around with, like, upping our game on the level of game days.

Like, you know, losing an entire AWS region, what does that look like from an incident response perspective?

We happened to just practice that a couple weeks ago. But…

Moderator: I think everybody in the room had that experience.

Bruce: Um, but I think that’s, you know…I think, you know, don’t wait for the next one, right?

I think for our…

I think what’s unique…a little bit unique about us in our incident management is we actually have two separate roles.

We have an incident commander, who’s much…who’s very focused on the traditional incident command coordination, coordinate the attack.

But we actually have a separate role that’s all comms focus.

It’s communications focus, so it’s about understanding how the customer is impacted, how we need to communicate to our customers about this impact, and, you know, how do we actually be helpful in that case.

And that’s actually a really good role for a product manager.

Product managers should understand the kind of terminology of your said product and what is wrong and what is not working.

And so we found that that’s a pretty good role for product managers.

Moderator: I mean, other folks on… Chris?

Chris: Yeah. I mean, so we take a similar approach.

So, you know, have somebody focused on, you know, customer communication, assessing the impact, and then, you know, have a team of people focused on driving closure to the issue.

And following up on the earlier discussion, really get a solid post-mortem that you could actually sit in the room and invite everybody over and, you know, have a brown bag and present, you know.

This is a learning experience, learn from it and get feedback from the rest of the teams as you’re prioritizing the follow-ups.

Failure testing

Moderator: So I’m curious.

I think…does everybody on stage run some sort of a game day or, you know, test, you know, failure testing?

I mean, we know we know Bruce does.

Him and James gave us a fairly long talk about that earlier.

But, I mean, the rest of you?

Chris: Sure. I mean, one of the pain points for us would always be, you know, database failovers.

If it’s painful, do it more.

Just keep doing it, doing it over and over, so…

Bruce: He just said break things more.

Chris: Yeah, I broke my mic.

Gameday for the mic.

Moderator: I don’t know if we have HA mics in here, but we’ll…[crosstalk]

You know, you’re…on the Fastly end, I mean, you’re…

You know, as Bruce was talking about earlier, he mentioned Netflix is 35% of internet.

I imagine that Fastly’s, you know, somewhere in that range.

I won’t ask you to quote numbers, but, you know, that’s…QAing at that scale is probably tough.

So, I mean, how do…

Josh: We don’t do any actual game days.

I think the natural course of the internet provides us plenty of opportunities to troubleshoot.

There’s a number of different things we have to tackle on any given day.

I think we’re looking into how to make things better, you know, always long-term, especially as we move more towards Cloud services.

You know, we can do more in our control plane in that type of area.

And all the hardware on the edge stuff, we try to run those as hot as we possibly can.

So, you know, it’s hard to say, “Take one out, you know, right now for fun,” because chances are, it’s gonna come into some problem, you know.

Some S3 thing will go down or, you know, there’d be something out there that will affect us.

Moderator: So, you know, I’m just curious, I mean, we’ve heard… it sounds like we have 100% game day participation or implementation here on the stage to some extent or another, depending on your organization.

I’m just curious from the audience, how many of you all are running…

I mean, earlier, we learned that pretty much everybody’s running post-mortems.

How many of you actually running, you know, game days and introducing this type of failure or chaos in production?

So I see Mia is raising her hand.

Cool. So I hope by the next time we’re doing Datadog Summit here, whether it’s, you know, on the east coast or here on the west coast, we’ll see way more hands.

I imagine if we came here about a year ago, we’d have heard the same thing about, you know, some of the other…whether it would have been post-mortems or some of the other lessons that we talked about today.

So I hope we’ll be able to see more of that as time moves on.

Service resiliency and reliability

Moderator: So, you know, I’ve heard the sort of cloud provider outages alluded to two or three times since we got started here today.

And I imagine that’s a whole different ball of wax when you are are the cloud, both in that, you know, you’re depending on these, you know, services underneath you that you need to…you’re looking to provide, you know, several nines of SLA to your customers, even if sometimes the folks underneath you may not offer that.

There’s also…I mean, I’m looking around the stage and I can just see the interdependencies between the four Cloud providers that I’m looking at.

You know, Datadog probably detects an issue, which then Eric dutifully tries to page you on, but in turn calls up Bruce, who makes the call for him, and then you all come back to look at your Datadog dashboards, who are powered by Fastly.

So it’s…

And I’m sure, you know, if I go down the line here, it’s sort of the same way, just in different directions.

So has this impacted just the…how has this impacted the ways that you’re designing your services, you know, for failure, that you’re running your game days, you know, where does this…you know, you have this responsibility to wake me up at night when things break or the folks in this audience, you know, and to be available.

So how do we sure that we do that reliably when we have these entanglements?

Eric: Yeah. At PagerDuty and pretty much for any of these services, it’s you kind of have to control your…you have to control your dependencies or they’ll control you.

You know, if you don’t have any mechanism to prevent or to mitigate issues, you know, you will only be the, you know, sum or whatever it is of the SLAs of the services you’re depending on.

So especially at PagerDuty, you know, we do multiple services, so we used Twilio and, you know, not Twilio for, you know…

We do these things so that we can make sure that we can always meet our commitment to our customers, and then circuit breaker patterns and things like that so that we can make that decision either automated or manually.

Bruce: And actually I can only be a PagerDuty customer because PagerDuty uses our competitors also as fall back.

If that wasn’t the case, then I couldn’t.

And so that’s actually one of the things that we do in…whenever we talk to any SaaS vendor, is a technical due diligence call.

And so that’s asking about which providers you’re in.

How do you handle this certain situation?

And so, whether you’re completely resilient to everything or not, right, it’s actually knowing where your failure zones are in your SaaS providers and then making a conscious decision as a purchaser, as a customer, “Okay, this is acceptable.”

Like, “Now I’m gonna look for…”

I might be looking for another vendor to supplement this that has a completely different failure zone than my primary, so…

Chris: And it’s pretty important to incorporate those failure models in your game day and practice it and, you know, don’t just have a plan, actually test the plan.

Moderator: If I recall quickly, Datadog monitors Datadog, right?

Chris: This is true. There is some other systems at play. [crosstalk]

Moderator: I mean, this is something you have to think about it, right?

So, cool. Josh, any sort of closing thoughts there?

Josh: I think for us it’s a…we’ve done special contracts with everybody that…you know, “Sorry, you guys are Fastly users, you cannot cache us.”

You know, that way we’re not relying on our own systems to let us know if something’s wrong, just becomes the snake eating its tail and you never get anything right.

So I think with, you know, most everybody up here, we have that special contract in place in most of our monitoring groups, too.

And it’s almost…it’s been one of those exciting things to see, but also very frustrating that every single SaaS vendor I’ve been talking to recently is also like, “We’re also a customer.

That’s gonna be great."

And you go, “Well, all right, now we gotta work around that a little bit if we wanna be able to use it.”

Moderator: I mean, as I think about…I mean, I think about things like Fastly, you don’t always realize that you’re necessarily depending on you, right?

I mean, there’s…you can’t…if I recall correctly, you provide CDN service for things like, you know, for many of the open source repositories out there as well, so things like rubygems.org or, you know, some of other services that we all just assume…like, “Yep, I will apt-get or gem install or, you know, pip install,” or whatever it might be.

And you realize that that’s actually [inaudible] yourselves.

You know, I’ll open it up.

I’d love to open it up to some, you know, Q&A from the audience as well, but, you know, with that in mind, I’m just curious, have you had…you know, what are the types of strategies you’ve had to take in terms of your deployments or your operations in order to….

You know, one the one end, we’re all in this…we’re all SaaS providers, were all Cloud providers.

We want to make sure that, you know, we’re eating our own dog food and, you know, doing what we preach, which is, you know, outsource some of these services that are…maybe not core to our business but that we think others can, you know, maybe do better based on their expertise.

At the same time, you have this sort of other challenge, which is you are the Cloud, how do you make sure that you can work in isolation?

So are there things that you’re doing there that maybe the rest of us take for granted?

Josh: You know, I’m not sure exactly there.

I think, you know, we run multiple providers, multiple data centers over multiple peering points and, you know…

So I think as long as the complete Internet is not fundamentally broken, you’ll be able to get to us somewhere.

We’re very, very careful about deploys.

We do a lot of Canary work.

We happen to have a lot of pops that go into interesting areas, which allows for Canaries.

You know, we’re getting larger in Australia.

We just launched South Africa and in Brazil.

You know, so you have some that you can kind of test small amounts of traffic with easy fallback points, you know, back to Miami or back to, you know, something that’s larger.

And so we do a lot of Canary releases, and we do very small releases as much as possible to make sure that we’re not, you know, putting out some gigantic thing that’s gonna fall over that we have to then tear apart and figure out what happened.

Try to, you know, just limit the number of changes in each release and increase the number of releases.

I think overall, it actually still takes us, I think two to three weeks to deploy a new cache software, new updates after that because we roll so slowly to make sure that everybody’s safe.

Bruce: I was gonna say on that note, like, I think the thing that we all share as cloud providers is, I think we really understand the value proposition of reliability and that reliability is actually something we’re selling, and that’s not secondary to the product.

That is primary to our offering as a Cloud provider, and I definitely…I think that was one of the most refreshing things when I joined Twilio, to see how refreshing, like, everyone was…like, the priority around resilience and reliability was there and it doesn’t come, like, second to product.

Like, that is what we’re selling.

We’re selling this, we’re selling reliability. So…

Josh: And I think, as you brought up, you know, that we’ve all mentioned the big Cloud outage, the fresh wound from a couple weeks ago, I think the reason that is so impactful to us is, when was the last time before that that S3 went down?

You know, it just became, like…it’s like breathing, you know?

It’s been around so long.

Moderator: You just assume that it’s…it will be there like the sun coming up in the morning and the moon coming up at night.

Eric: So that’s one of those things where it’s absolutely…you’ve gotten too used to…if a service has 100% uptime, it doesn’t.

It just means it hasn’t failed lately.

That becomes one of those ideal victims for the next game day or the next automated failure injection.

Josh: It’s like the overdue volcano.

Bruce: If you bring their SLAs down proactively, you know, you make sure your software is resilient to that.