Fixing Out-of-Hours On-Call at Intercom | Datadog

Fixing out-of-hours on-call at Intercom


Published: July 12, 2018
00:00:00
00:00:00

Improving on-call

Hi, hello, so I’m Brian, and it’s amazing to be here at Dash today.

So I realize it’s a long day of presentations.

So I’ll try and keep this interesting.

And so this is a presentation about how we improved on-call at Intercom.

So chances are where you’re working, they probably need their on-call improved too.

And so hopefully this presentation can provide some inspiration and ideas.

And first of all, I’m gonna trying a bit of creative engagement, is anybody on-call right now?

Cool.

So, I don’t know, maybe we can help out on your page, that could actually be a better idea for a talk, like just, we all help out on on some alert that just paged.

Anyway so I’m gonna introduce myself and where I work, and then I’ll talk about the problem with on-call that we solved and how and why we solved it.

So if you’ve read the agenda already, there might be some spoilers in there, and I also wrote a blog post about this, which some of you might have seen.

So that has got even more spoilers, but I’ve got some juicy new material to share today as well.

So it’s not all spoiled even if you’ve come across that stuff before.

My background in on-call

So who am I?

I work in Intercom’s Dublin office in Ireland.

And so we have three engineering offices in Dublin, London, and San Francisco.

Dublin’s probably the primary engineering location though we’re hiring in all three locations.

And so my background is very much in like fixing things.

So my first job was doing Solaris technical support many years ago.

Solaris is kind of like Linux except for banks during like the .com boom.

And so I really enjoyed helping people learning Unix internals, networking, troubleshooting, fixing stuff.

And so skipping forward a few more years, and I worked at Amazon.

Running like Amazon’s DNS and load-balancing infrastructure.

So not the AWS services, they were kind of built while I was there.

It’s the stuff predating it and the stuff that AWS has built on top of.

So even back then, Amazon was reasonable-sized company.

But obviously, AWS exploded while I was there.

So I got to do a lot of really interesting work with people, like fixing things and building things.

So like on the retail websites, with Kindle, with lots of different folks from AWS.

And so I did a lot of on-call at Amazon, but I also participated as a call leader, which is a virtual team of like-minded individuals, tasked with resolving problems anywhere across Amazon.

And with Intercom, Intercom is a lot smaller, but the work I’m doing is quite similar.

So it’s building things that break or try not to break.

And I’m building great teams to run these things, which shouldn’t break most of the time.

And I help out where I can organizing work, and like do…also running the cloud that I used to kind of run or something like that.

So it’s a lot easier to use the cloud on the outside than it is on the inside.

What is Intercom?

So what is Intercom?

Thanks to the internet, people communicate drastically differently than they did you know, 10, 20, even like 5 years ago.

And so these days people use messengers like WhatsApp for their day-to-day lives, and that switch is happening in businesses too.

So Intercom builds a suite of messaging products, that allows internet businesses to accelerate growth through all parts of the customer lifecycle, through customer acquisition, engagement, and support.

So a lot of the time what that means, is we have a little button at the bottom right-hand corner of your website.

And you can talk to the business you’re engaging with there.

And there’s a lot of other features on the backend for the businesses who are on the other site.

And so technology-wise, Intercom in my time there has had a fast-growing engineering team.

And we’re built largely with like well-known core technologies, so, Ruby on Rails application, we use a lot of MySQL, built almost exclusively on top of AWS.

So we try and use well-known existing technologies and tools.

So we’ll not try and solve problems, like novel problems or hard, technical problems, we’ll run as little software ourselves to keep us focused on our mission to make internet business personal.

So we try and run as little software as possible and use other SaaS businesses to run our great services.

So this is the background information.

The bright side of on-call

So I’m gonna start by talking generally about on-call.

And so on-call is brilliant…sorry, I love on-call, it’s been like an important part of my career and growth.

And the on-call function itself in most businesses is like a perfect storm of technical challenges, ownership, deep responsibility for serving your customers’ requests, or fixing things for your customers.

And as well as like continuous improvement for your organization and people.

So being on-call like connects you umbilically, to your technical stack and to your customers.

And so if your customers can’t use the software services that they are doing to try to do their job, on you’re on-call and it’s broken, you’re doing something about it to fix this.

So you’re directly doing something that your customers can’t do, so that’s great.

This is really satisfying stuff.

So being responsible for delivery of the value that your business gives to your customers, it’s pretty thrilling.

And it’s satisfying to fix things, so like you get that endorphin rush for having brought up the service or saved the database or whatever.

And so you’ve got a nice immediate payoff and you get to be a bit of a hero at times.

The challenges of on-call

And so on-call can be good. But at the exact same time, on-call is also absolutely awful.

It can make your life considerably worse than it was before you started doing on-call, especially if your organization that you’re working for is not paying attention to like the quality of its operations and the quality of on-call.

So fundamentally, nobody likes getting woken up in the middle of the night.

So just the act of getting paged in the middle of the night can cause like physical and mental stress.

I can still vividly remember the ring tone of the phone I had at Amazon.

In the first few years they used to page me, and it was not in the good way.

And even now, despite having done, I don’t know, hundreds of on-call shifts, I’ve been pretty comfortable with our technical stack and the Intercom stuff, and years of experience of doing this stuff, and like, I’ve got two young kids and I run almost every single day, and there’s loads of reasons why I should be able to sleep, but I still sleep badly, whenever I am on-call.

I’m just like wondering, did I miss a page or I don’t know, dreaming about servers going down or something.

So there’s opportunities here at on-call, you know, it’s bad and you get paged, and there’s lots of learning opportunities.

But like, when you get paged and something, there’s also follow-up that needs to happen, you need to log a ticket, and then dig into it maybe the next day.

And this kind of follow-up and due diligence to all your alerts and that is, this is work that you didn’t have to do beforehand.

And you have to build in enough slack into, you know, your week when you’re on-call or your day when your on-call to be able to follow up on things.

And depending on where things are or what you’re doing, this could cause some organizational slack. You might miss some commitments you made for that week, or not shift something that some other team was depending on.

So this in itself can, kind of, like cause you, kind of, personal grief to meet your own targets.

And going on-call makes you very aware of not just every technical problem that’s happening in the company, but also like organizational problems.

So this can become rapidly overwhelming, because you pretty much get the viewpoint that everything is on fire all of the time.

And it becomes difficult to see how you can even start to fix things.

So examples of organizational problems are like if you have a team who are not paying enough attention to their services, or operations, or maybe they’re not learning lessons from earlier editors, or they don’t review their alerts methodically.

Or maybe even just their services going through rapid growth, and the systems aren’t scaling, the implementation decisions that they made are no longer applicable.

Or they’re at the edge of what the system was engineered for, or even worse.

Maybe a team whose stuff you’re on-call for, they got reorganized, they don’t exist anymore.

Oh, now it’s like your job to kind of figure out this stuff, and this can become demoralizing very rapidly.

History of on-call at Intercom

So, history lesson, On-call at Intercom, so on-call at Intercom grew organically over time.

So Intercom has been a fast growing startup in the history of its life.

And so typically with a lot of stuff in fast-growing startups, things just kind of happen or things happen organically.

And we can figure out things later when we get a bit bigger.

So the history of on-call is that when things started off as it’s kind of normal enough in small companies, like the founder, our CTO did on-call…like Ciaran, our CTO, he did the on-call, he kept things running, it was good.

And as we grew, pretty early on we started building out an infrastructure operations team.

You know, we knew we were gonna be all in on the cloud from pretty early on in Intercom’s lifetime.

And so we started basically building a team around Ciaran, and scaling out that function.

So we had like a small infrastructure team, who were the default team for on-call.

And as we continued to grow, lots and lots of teams started to go on-call.

So there’s a culture of strong ownership in Intercom, which partly grew out of other people coming over from Amazon into Intercom as well, like myself, and this seemed to make sense at the time. We would grow a team, they would build a thing, and then they would go on-call for the thing.

And so this is good, putting developers on-call for the stuff that they build is pretty powerful, because it really does…it gets the right people looking at alerts.

It gets the right people looking at these things.

And you don’t have a lot of communication overhead with having to hand over stuff to new teams and that kind of thing.

So we had the infrastructure team still being like the catchall team for a bunch of shared services and databases.

And then we had all the sort of lowish-level product teams who were dealing with product concerns and features, and building out data sources and services to operate all that stuff.

But we just kind of drifted into this, there was no plan as such here.

Teams did it, it was good, kind of worked.

And like the kind of problems that the teams were dealing with was like, say our email delivery team, they need to know if we’re on a bunch of spam blacklists or if the data stores that they’ve built are still working.

The user data storage team needs to know if MongoDB is down.

Which, you know, it has been for a bit.

Operational problems with on-call

And so this whole set of…there’s like real on-call operational problems that happened here. But still…there were problems in this, and so kind of like the proverbial frog in boiling water, we did…unlike the proverbial frog, we did notice some problems.

So, there were too many people on-call. So like we probably had like six or seven people on-call over weekend, with various degrees of how serious their on-call shift was.

And so some teams were paged a lot, they’re well rehearsed in like being online, being ready to go.

Other teams were paged infrequently, but they still kind of had a pager, which would directly page an engineer.

So but there were still too many people who, you know, on a Friday, they go off and they’d be somewhere thinking in their back of their heads, “Oh, I’m gonna get paged,” Or like, “I shouldn’t go swimming with the kids.” Or whatever.

And so the number of people on-call for Intercom seemed to be mismatched for the scale of our operations and size of the business, in comparison to say, when I was was working with Amazon, say if the amount of people who would be on-call for a small service like S3 or something like that, it was kind of in that kind of order.

Inconsistent on-call across teams

And we also noticed that we had inconsistent operational quality between teams.

So because all these things kind of grew organically on ad-hoc basis, a bunch of people knew like what a good alarm should look like.

Other teams wouldn’t necessarily have the kind of operational background, or experience with things, or maybe they did and they just didn’t prioritize this stuff.

And so it’s completely okay for teams to take the eye off kind of the operational, the day-to-day operations and kind of go by the seat of their pants a bit.

We also noticed that there was a high tolerance for out-of-hours pages, and teams seemed to be okay, with like knowing if there was an alert spike or an error spike on their server or something.

And they would, engineers would get the page and kind of acknowledge it and they’d follow-up with the team in one day and stuff like that.

But it wasn’t a big deal to pay to get paged, and in reality it is a big deal and you have to have a zero tolerance policy towards these kind of things.

And in the quality thing as well, like we had inconsistent approaches to, say, documentation alarming, some alarms had a runbook, most did not. And all of these kinds of problems led to very different engineer experience between teams.

You could end up on, say, the email delivery team and suddenly you’ve got a lot of on-call. .

And these were taken quite seriously, but the previous week you might have been working on say a frontend team which didn’t have the same on-call responsibilities.

So this in effect becomes like a barrier to engineer and organizational fluidity.

And Intercom is a fast-growing company.

And so we want engineers to be able to switch teams, without a lot of friction, we can organize things depending on how things are changing, how the priorities of the business are changing.

And so this kind of inconsistency itself made just a bit of a barrier.

Because of certain folks who may not want to do on-call, due to completely normal life reasons, or preferences and what they wanna do, and it didn’t feel good, it’s like keeping those people away from those kinds of teams where they could have some great impact.

So the variance of on-call, experience and expectations, across teams was a problem.

And so we wanted to get away from it, mix the best effort and kind of professional on-call teams.

And so in short, on-call was eating our engineers, we were reducing our productivity, there were engineers who weren’t able to be as impactful.

We couldn’t move people around as easy, and so there was a lot of fun building stuff in kind of scrappy startup fashion, keeping things running by the seat of your pants is fun and all.

But like Intercom, we kind of transitioned out of that phase, and we’re moving more towards a place where we want a sustainable, rewarding environment to be a part of.

Fixing on-call

So let’s try to fix these.

So these slides make this look super easy, like this is months of work.

So we built a virtual team of volunteers.

So, we asked our engineering organization for volunteers and maybe prompted a few people to go forward.

And we started thinking about the problem at hand, and so we started from first principles.

And like, why even have on-call? Do we need on-call? Like, unfortunately, we couldn’t get away from having on-call.

And so once we accepted these, we started figuring out what are the values that we’re starting to apply, and what are the kind of goals and things we could measure, that would show whether the things we do are successful or not.

New standards and processes for alarms

And so we started to build a bit of a plan together and we wrote standards and processes effectively to describe what an alarm is, how an alarm can get moved from a team into this new virtual team, and what these alarms should look like.

So you know, we gave guidance around not alerting on kind of low-level, maybe, symptoms or maybe things that have alerted in the past, but aren’t representative of real, true customer experience.

And also it wasn’t just us, the customer experience had to be degraded for us to alert on something.

It also had to be actionable.

So for example a five-minute spike on our most important API, of errors, it’s important, it’s customer impacting.

And we do want to know about it, but we don’t need to get somebody out of bed about it just because a database failed over correctly, gave a bunch of spikes and errors, and then repaired itself.

We don’t need to know about those things, we don’t need to get people out of bed.

So the questions we had were pretty tough, they were, is this actionable, and is this meaningful?

And we also built out like a Terraform pipeline to control our alarms.

So we put our alarms in configuration with Terraform.

And this meant that we had a programmatic controller, like we were able to use the git flow for reviews on alarms.

And so we had a change control over, what alarms changed and when and why.

It made it very quickly auditable as well, so we weren’t clicking around the Datadog UI, to kind of figure out what the status of things where, or when things changed.

We had it all nicely in a GitHub repository.

So that made things easier to review, made the process of handing over the reviews to the new team, dependent on approvals and that kind of thing.

And it’s just generally a good thing to kind of socialize decision making behind why certain alarms are at different thresholds and that.

And so this was like one of the technology things we did, but really wasn’t necessary for it to roll out and expose, but it’s a nice thing we did at the same time.

And so we took over the alarms, and so we…this again was a lot easier to write than it was to do.

Organizational changes

So we had to like set teams and deadlines, and you know, we worked with our leadership team to make it an organizational priority.

So we got like great buy-in from engineering leadership, why we were doing it and the outcome.

So they gave us unwavering support in this, which helped a lot.

And teams also like were pretty enthusiastic to get things over, and that was a lot of work.

Because we had new standards about runbooks, about documentation, about whether these alarms should exist at all.

And there was some re-engineering of some things I had to go over to get these services into a state of where a new engineer who has no deep knowledge of the service itself can go on-call for these things.

So we forced teams to in fact write documentation and runbooks that they did kind of want to write, but didn’t necessarily do it at the time.

And so I have to kind of say this, like we did these mostly…like this was all Dublin-based engineers.

So we were all able to kind of collaborate and to do this kind of quickly and locally.

Virtual out-of-hours on-call team

So the out-of-hours on-call team, what are the kind of attributes or things that are better.

So these are the juicy details.

So outside of office hours, you know, outside of Dublin office hours, the virtual on-call team gets every single page.

And during office hours, the teams themselves get the pages for the stuff that they break.

This actually means that they get most of the pages.

So when developers are pushing code, when we’re making changes for infrastructure, this stuff generally happens in office hours.

So this way we do actually get the people who operate the services, build the services, experiencing most of the pain of ruining their stuff.

And so we don’t detach people away from the services that they’re building.

Also we encourage the product teams that when they’re on-call for the services during office hours to follow the runbooks that they’ve written for the on-call team externally.

And it’s also pretty convenient as well, that when Dublin is awake tends to be that’s when we’re going through our biggest traffic ramp every single day.

We’re largely a business-facing business and consumer, and so that generally means like when Europe starts waking up in the morning, and then the East Coast of America starts waking up in the morning and then at the West Coast at the end of the day, Dublin’s there for the entire time.

So like if something is gonna break, it usually breaks during office hours.

And so as a member of the virtual team, you go on-call for a week.

This week stays from or starts from Friday evening to Friday morning.

And you can like talk a lot about exactly when this kind of schedule should finish and end.

And you can break schedules for hours.

And this works okay.

I mean, there’s probably other schedules that could work as well.

But one week was good enough to get enough context for stuff that’s going on.

And it’s also not so long that you’re gonna be taken out of action from your team for ages.

And you’ve got the weekend off before the week, your week following your on-call shift to kind of relax and get back into a normal state of mind, in case you had a particular bad shift.

So this works for us, and there’s lots of ways to do this. And so the team consists of about six or seven engineers.

And so we consider like membership of this virtual team to be like a tour, and you’re basically kind of committing to doing about three or four of these shifts when you’re on-call on the virtual team for six months. And then we want people to move off the team.

So you join the team, lasts for about six months and at the end you leave the team.

And so obviously, not everything goes well, and so the way we think about the duty of the on-call engineer, is that you’re a first responder.

You know, you’re not a surgeon, or you’re not somebody who’s got necessarily a lot of experience with the systems that you’re dealing with.

Your job is to stop the bleeding, like perform CPR or…I’m not sure how far to go with this metaphor.

But like basically deal with the issue that can be resolved with first aid and general knowledge.

Escalation decisions

And to take the stress away, we don’t want the on-call engineer to have to decide how to ask how to escalate, where to escalate to.

You know, figure out who to engage.

So we go through like a secondary rotation of an on-call engineering leader, and that person’s job is to take the stress away from the first person.

So first the on-call engineer gets paged, if they’re not clear what the next steps are, or they’re not making progress, or even just for any reason, they’ve got a single path of escalation.

There’s no ambiguity here, it’s engage the on-call engineering leader.

So this level two escalation takes away the stress for like what to do next.

And adds in possibly a little bit of delay here or latency, but we’re hoping that this will work well, and it has so far.

So at that point the on-call engineering leader then figures, “Okay, is this worth escalating for, like should we actually bring in the team who are responsible for the service or whatever?

And then from that point we escalate further over to…typically through the manager of that team.

And so the manager of the team will have context about say maybe something deployed that day or maybe there’s s specific person that we should call on a best effort basis to bring in.

So this is like best effort from here, we’re kind of relying on judgment, we’re relying on people being available.

And there’s no obligation to take these calls or to be on-call, but typically we’ve generally managed to get enough people around to kind of fix the problem.

So from the product team engineering manager, goes to the engineers, kind of bat signal, we kind of page everybody, just say, “Hey if you’re awake it would be cool if you could join this call.”

Or whatever and so..

This escalation thing kind of hasn’t been used as much as we thought it would be, but it’s the kind of system we put in place.

And so we remunerate for each shift, so there’s a set fee you get for the shift that you did for your week of work.

And we kind of…we don’t force people to rest but we kind of expect people to rest, especially if they were were paged in the middle of the night, or their time or travel was disrupted.

We do want people to be rested for the rest of the week or during their time.

Unexpected observations

So there’s been some unexpected things that have come out of this.

So all of this kind of started working and paid a lot of attention to it.

And then there were some things just kind of didn’t go as well, or went differently to the way we planned.

And so, the formal escalation chain I’ve just mentioned actually was quite uncommon.

What has happened in practice has been more informal escalations.

And so a page goes off, you know, it’s visible in Slack…a lot of these pages happen during, say, San Francisco business hours, and some of the engineers in San Francisco might just start to chip in and help out.

And equally, just because of the visibility on Slack, and because people are awake or whatever, they just kind of hop in and prevent a lot of the formal escalations, to like through the L2 process.

And kind of in effect, kind of weakens that escalation process, like escalation processes are only as powerful if they’re used frequently.

But what this did mean, like we’re building kind of a small community of people who are engaged and helping out in these kind of events.

And so it’s kind of unexpected but nice to see this kind of happen.

And we’ve built up a lot of more operational knowledge across the…especially from our San Francisco office than we kind of expected, or we didn’t anticipate this going into it.

And teams took strong ownership of the issues, so we mandatorily open at the highest severity issue of every single page. And we kind of had a carrot and stick approach so we thought we might need to do a carrot and stick approach of handing alarms back to teams if they failed to take action, or if they refused to take action or whatever.

In reality, this hasn’t happened.

We open the highest priority issue and then the teams have taken really strong ownership for like addressing the root cause.

And we might chip in with like advice or, reasons or guidance, whatever.

But we just haven’t had to deploy the stick, the carrot of having an on-call team do your on-call for you is enough.

So that was like really positive and went better than we kind of planned as well.

And also there’s like some sort of social stigma about paging somebody outside of your own team.

And when it’s your own team, I don’t know it’s like your you treat your family members different from your friends or something. But like it seems to be tolerable to page your own team for something, or to have these pages go to your own team.

And when it’s someone who’s like remote from you you don’t really…you feel more guilty or something so.

It’s kind of weird, people treat people inside their team better than they do inside the team.

So it’s maybe, it’s kind of expected.

So we had a lot of good things, which is nice, and so the system worked, and the number of pages drops month over month.

So we kind of had a consistently, a few months of less than 10 pages.

Some of that’s related to just work that’s going on and some improvements we were making.

But the main point is we removed a lot of junk alarms, and we improved the quality of the alarms that are coming through the system.

So, just the sheer act of taking alarms, reviewing them, having people in this chain and then paying attention to the stuff that is alarming.

That in itself just reduced the amount of alarms massively.

Just the amount of alarms that existed, never mind the ones that were causing pages.

So we had good metrics, good outcome, kind of consistent, lower pages which is a good thing.

And we also saw the low-value paging alarms destroyed, so a lot less kind of symptoms getting paged on, a lot less like swap on our mem caches, that kind of stuff.

And we just started paging more on the high-level customer facing symptoms.

And also the rotation thing works, so we kind of worried a bit that we would form this team of a bunch of volunteers and then no one else would volunteer, and thankfully people did. But we had to take care to kind of rotate people in and out so it wasn’t a big bang rotation.

But like a B team coming in or something that we were bringing people in, and phasing them out over kind of constantly.

And so that actually worked reasonably well and we’ve had a lot of people interested in joining the team.

And as I mentioned earlier, the SF engineers were kind of getting involved in a bunch of day-to-day stuff and then get them advancing in that.

And so we were able to establish basically the team and work closely with our staff or get the virtual team spreading across both offices.

And that was just great for a bunch of cross-organizational work or cross-office interactions.

You know, you got this specific virtual team, bunch of work that you’re doing with each other.

This just builds great social connections, and builds more capacity to just kind of work in the SF office, which is really good.

There’s more good things, and so one of the things we do is like send out monthly comms to the engineering team, how things are are going, what things are paged, what lessons are learned.

You know, a bunch of stuff are in these slides.

And we get good feedback about this stuff.

So it’s nice being able to talk about these work, it’s nice being able to explain how all of on-call is working out for the entire organization, see the engineering team, and get good positive feedback about things.

Also, we’d just see all these small kind of social approval of this team and the work involved in it.

So membership of the team is referred to in like performance reviews and anniversaries, and you just see it dotted around the place, so people are kind of conscious of like rewarding, or like knowing that this thing is valuable, or that this function is valuable.

And when people participate in the team, it’s kind of socially backed up through these kind of public demonstrations of our gratitude for people being involved in this team.

And so that’s good to see and really positive.

And so as part of, like I was able to write this as part of the job spec for a systems engineering role.

And it was really satisfying that I was able to say just because you’re in a systems engineering role on our team doesn’t mean that you’re doing mandatory on-call.

So, you know, at the moment I think there’s one person on our systems engineering team who’s part of the virtual team and it’s just nice to be able to have roles and advertising for systems engineering which is generally aligned with being on-call.

And almost assumed that there’s some on-call allowance in this.

Because this opens up the role to kind of more diverse candidates or those people who might be interested in having on-call being less part of their work.

So, sorry about the…we’re hiring page but this was my short story.

And I wrote a blog post about this, I mentioned previously, so it’s easily Google-able and if you go to blog at intercom.com.

Takeaways and recap

So, some takeaways, so I’m gonna recap some points.

Any on-call setup anywhere, I think it’s open to challenging.

On-call is kind of one of those things where you can set up a bunch of alarms and just kind of accrue a bunch of technical debt.

And just the way you go about it through your who is on-call and what they’re doing, what they’re doing on-call for and everything, and I think it’s one of these things which are kind of, it can be disrupted in your own team.

So I think be radical and just kind of question the nature of your on-call setup.

As there are some like really positive takeaways you can you get or some really positive improvements you can get.

So by all means don’t do on-call and like hurt your customers and SLAs.

I think to have confidence to do this kind of work, you need to understand who your customers are, what their pain points are, what the SLAs are for the different things you’re doing.

But the main point is to…once you have that in mind and you’re aware of the stuff, optimize on-call for your people.

So applying the concept of human ops, and so you know, your computers, they’re not great, but the humans are the things that will build your company around, so optimize for them.

And so it’s not that we’re trying to hurt your customers and SLAs, but the humans are things that you can optimize for.

And this whole setup I described here, it’s the application of continuous improvement.

And so we have like a mechanism, which is this virtual team, and that we have processes that review alarms and take things through a process.

And all of this buys focus and time to build more products and kind of matures things.

And ultimately it’s like a labor for continuous improvement to your organization.

So do not let on-call eat your engineers. Thank you.

Q&A

Any questions?

Audience member 1: You said you had two people on-call at a time. So how do you like, does that mean that one of them is primary and the other one is secondary?

Brian: Sure So the question is do we have two people on-call and then what’s the setup?

So the level two on-call we have is strictly a kind of management on-call.

So it’s escalation only, and so there’s no technical expectation of that second person, it’s purely, it’s like escalations.

And we need to figure out next steps to take the pressure of like the organizational, aspects of like okay, will we engage further and stuff like that.

So yes, it’s not like classic primary, secondary technical on-call, it’s single primary and escalation on-call.

Audience member 1: Thanks.

Audience member 2: So you said your secondary on-call is like semi best effort. I’m curious what you meant by that, because in my experience best effort means you won’t get them?

Brian: So we haven’t had too many bad experiences.

So it’s best effort in that there’s no one whose job it is to be like obligated to be on-call.

So we have a high expectation or high standards of what we expect of the person who is on-call.

They need to, you know, make sure that their shifts are being covered, if they need to do an errand, or if they’re gonna go to shops, they should have their laptops with them.

That kind of…like just that they should be available and on-call within the tight enough time.

And before the rest of the folks, we kind of rely on their good judgment around being available and just not forcing teams to do their own mandatory on-call for an escalation.

And so to a certain extent we’re kind of weighing this, like we haven’t had…we’ve had like one or two incidents of where it would have been nicer to have some other people around.

To kind of help out with things and but we haven’t been hurt too badly by it.

So again, I guess it depends on like the type, the nature of the problems that are coming in.

And the kind of appetite for risk in the organization and also like we’re kind of fortunate, because we move people around, teams around, reasonably frequently.

So there tend to be people online, or enough people around me have some familiarity which aren’t on a specific team, so is that good?

Audience member 3: What kind of resources do you provide for the people who are on-call, so they know what to do when they get a call.

Like do they get contacts, is there like a knowledge base, or how does that work?

Brian: They have in the every single page that comes in through PagerDuty, there’s like a runbook or link through our runbook in that.

And there’s supposed to be enough there to make progress, and so the runbooks, the guidance we have to give to the teams, or like runbooks should be written.

But to be usable by somebody who is familiar-ish with the system, you know, somebody who works for Intercom, and but not necessarily familiar with like the workings of it.

And so the steps have to pretty clear, and you need to be able to like do simple enough actions before like you reach a point of having to escalate.

So really we rely on that.

We rely on like not quite fully automatable steps, but like a number of steps that allows the engineer who just got paged, to like get their bearings, understand what the system’s got going on.

And then it gives them options, or it gives them choices of like what to go and do after that.

And so that’s about it.

Audience member 4: Why that alarm exists or why they have that alert?

Brian: Yes, having that runbook in the alarm is one of the standards that I talked about to be handed over.

You know, we’re kind of trying to regularly review the runbooks as well and get feedback from the engineers who are actually dealing with the runbooks when the pages do come in. So like in addition to like the high-priority issues that would open for the teams after an event, we might open some lower-priority ones like, “Hey, this runbook refers to our old wiki system or whatever.”

So we try and continue to improve those things as well.

Anyone else?

Audience member 5: I had a couple questions.

And the first one is what do you do if nobody is opting in or volunteering to do on-call?

Brian: Well, we haven’t had that problem.

And we like we consider this like what to do if we’ve done all this work, we created this team and we don’t get kind of interest in it.

I mean, I guess the whole thing falls apart.

I wouldn’t be giving this talk.

I guess we do reward the work and like we rely on the social pressure or something, or social rewards for doing this kind of work.

So like, when it’s meaningful in say performance reviews and meaningful in like promotions and stuff like that, we wanna provide the right kind of incentives and motivation for people to get involved

But we’re fortunate, we do have like a big ownership culture in the company.

And so we do kind of hire for that sort of thing as well

So we’ve been looking to be able to tap into a pool of fairly interested engineers.

Including like engineers who want like systems, or operational backgrounds who just want to get involved and do more stuff.

So, I guess if this starts falling apart, I know I guess we could hire a whole team for it, but that seems against like the spirit of this whole exercise so…

Wrap it up?

Thank you all.