Seattle Summit 2019 Closing Panel

Published: April 16, 2019

00:00:00

Alex: Yeah.

Jason: Cool.

Thanks for speaking and joining us.

So, I thought it was really interesting.

You know, Adam, you mentioned rolling out code and having it break everything.

Having been a developer myself before joining the evangelism team at Datadog, I did the same.

I’m curious if you’re also on call for what you deploy?

On-call procedures at Carta and Rover

Adam: So, yeah.

As of very recently, Carta has just rolled out an on-call schedule for engineers.

Luckily, you don’t have to be on call that often, but we usually have a primary on-call and a secondary on-call, who’s more of like a domain expert and that rotates.

You might only have to do it once or twice during the year or something like that.

Jason: Oh wow.

Adam: Yeah.

Jason: That’s pretty lenient, not on call so much.

Alex, how about you?

Alex: We have a very similar setup.

We have a primary and a secondary on-call.

I’d found that it sometimes it can be hard to scale that with the team if the organization grows.

You know, being on-call less often is good for, you know, developer happiness, but also being on call, you learn a lot about your service.

So, there’s some sort of trade-off to be made there, but generally, the engineers at Rover are on call for the service.

Jason: Nice.

And is that like a voluntary thing or is it just congrats, you’re a developer.

We put you in the schedule.

That’s it.

Adam: It’s more of the second.

As we’re rolling this out, we’re only rolling it out to senior engineers and above.

So hopefully, we can eventually let it scale up to the whole organization.

But yeah, It’s required but not for everybody right now.

Alex: Yeah, we expect a certain level of tenure before people are thrown into the on-call, you know, being at Rover for a number of months.

But we’re also, you know, always rethinking how best to do that as the team scales.

Creating fair on-call schedules for your teams

Jason: Yeah.

So, I’m curious, you know, when you talk about how best to do that, what does that schedule sort of look like?

I’m assuming it accounts for when you’re on vacation, the schedules adjust.

But tell me about how you do the scheduling for things like that.

Adam: Well, we’re still trying to figure it out.

But from my understanding, it’s gonna be automated, I think, probably through VictorOps, and it’s just gonna pick people.

And then you can override it if you know you’re gonna be out of town or something like that.

Alex: Yeah, we use PagerDuty which has a similar setup.

So, there’s this rotation generated in that it’s pretty easy for engineers to communicate with each other and add overrides if necessary, right, so, you know, it layers all the overrides together to give you who is actually on a call for a specific time.

And generally speaking, people are very approachable about trading on-calls.

You know, if you have something going on this weekend and you need to be on-call, you know, go somewhere that weekend, you can talk to whoever is on-call.

You can swap with them.

That sort of thing.

Compensation for on-call duty

Jason: Nice.

So, it’s interesting that you mentioned trading because I know a lot of organizations when you’re on-call, because you’re spending that time all night, because it’s definitely more stressful, you’d end up trading that with a day off, like, you get a day off to follow that.

How do you compensate for that?

Do you do something like that?

Adam: Yeah, we’re talking about actually getting engineers a bonus, like a cash bonus or something like that when they do have to do it.

Alex: We haven’t explored anything like that.

That sounds fun though.

I would…

That’s a good idea to take back to the team.

Yeah, we generally have the on-call rotation as a week usually, and you can trade off, you know, specific days.

Yeah, I’ll say, you know, I was at Amazon for a while and the on-call rotation is a lot less stressful at Rover than it was at Amazon.

There’s very little, you know…

You’re expected to be on-call a lot.

That’s how I’d phrase it.

Jason: Cool.

Also, I wanna open it up to any questions out there if anyone has anything.

Man 1: Anybody.

The incident response process

Jason: I know Ilan’s standing by.

Cool.

So, when you’re on call, obviously, incident happens.

Tell me a little bit about your incident process.

Datadog sends an alert through VictorOps, through PagerDuty, you get that.

What does that generally look like?

Adam: So first, it goes to our SRE team and they get woken up.

And then most of the time, that’s where it ends.

They figure it out.

They solve it.

And then it’s only in the rare case where it’s like a software engineering issue, and then the primary on-call gets woken up.

And then if for some reason, that person can’t figure out how to do it, then you contact whoever is the domain expert for that particular case.

And whoever is on call is the domain expert at that time.

And if you can’t reach anybody, then it goes to your manager and then, maybe, the director, or something like that.

Alex: So, we use PagerDuty.

We integrate with Datadog monitors and most of our pages are coming from monitoring.

That’s the ideal scenario, right, as that we’re finding out things through our automated monitoring.

So, the page goes to Slack and PagerDuty, so the on-call gets paged.

We only have one tier of on-call right now.

So, the engineer wakes up or hopefully, it’s during the day, they’re already awake.

They’re working on something.

They will look at the page, acknowledge it, and then they’re responsible mostly for coordinating the response.

So, if it’s something that they’re familiar with, they can work on it.

But generally speaking, we want our on-call to be communicating with CX, communicating with the tech team coordinating a lot of moving parts and keeping that coordination in like a single Slack channel, those sorts of things.

And then having people who aren’t responsible for hacking a page or communicating with customer service, being the ones who are actually working on fixing the issue.

Jason: Yes.

So, that sounds very familiar, that said. PagerDuty has sort of published a whole bunch of resources around this, but it sounds like you’ve adopted the incident commander, incident scribe, the incident responder, kind of thing.

Alex: We’re going in that direction.

I would say it’s less formalized right now, but the direction we’re going in to help us scale the on-call with the team is to make that actually more formalized.

And generally, that is our philosophical approach is to have like, an incident manager who has a very well-defined but limited set of responsibilities during an incident.

What does the on-call workflow look like?

Jason: Cool.

I’m curious for people out there, how many of you are in an on-call rotation?

I think, that’s…

Well, that’s like almost everybody.

How many of you have that separation between, like you’re on call but you’re the commander and you, like, pass things off to other people that respond to?

Okay, just a few people.

Everybody else, it’s pretty much you’re on call.

You’re responsible.

They try to fix everything.

I see some heads nodding.

Cool.

So, along these lines because I think we’ve all been there, right, you get woken up in the night and because you’re on call and you’re responsible for fixing that thing and you’re maybe not the expert, I’m curious, how have you helped resolve that?

Like, do you get a runbook, a playbook essentially of things to look at?

What does that look like for your teams?

Adam: Yeah.

So ideally, you’re documenting everything, right?

So if something happens once and you need to call in an expert, you shouldn’t have to do it again because you’re writing it all down and you have a runbook for the next time it happens.

And so, we built up a number of these runbooks.

But as we roll out the on-call schedule for engineers and it goes beyond just SREs, then it’s gonna be even more important because nobody really wants to get woken up.

So now, there’s a real incentive to do it.

Jason: How do you, like, manage your repository as that knowledge builds, like…?

Adam: That’s a good question.

It should probably go in GitHub.

I don’t think it is right now.

Jason: Okay.

Adam: Yeah.

Alex: I can tell you where we are and where I’d really like us to be.

Where we are is we have, you know, a set of monitors that page and we have the ability to page specific teams even though they aren’t explicitly on call, which I think is a state that’s not sustainable long term because if someone is not officially on-call, then you can’t expect them to wake up when you page them.

So, we have some defined ownership of, you know, what areas of the app are owned by what teams.

But the situation I’d really like to be in is closer to what it sounds like Carta is moving towards, which is I think every paging monitor should have an associated runbook and that runbook should be like a link that’s in the page that you receive.

So, you just click a button and it goes, and it tells you step by step.

Here’s a…

I mean, you can use this with Datadog notebooks, right.

It’s a decent solution to this or just a free-form doc with markdown.

But you can get a series of graphs that are specifically associated with that paging monitor, and some markdown that tells you, like, “Oh, if the graph is doing this, that’s what it means.”

And then, you know, you have like a decision tree of where you go, right, escalate to this team if you get to here.

That sort of thing.

That would be my long-term vision I’d like to see with that, with where we’re going with it.

Takes a lot of time.

You gotta maintain those.

Someone has to go write those but I think that that’s one way to make the on-call scale efficiently.

Learning from your on-call experiences

Jason: Yeah.

That’s a fantastic use of notebooks

Actually, it’s interesting because for notebooks within Datadog, a lot of times, we started using them for postmortems.

So essentially, the other side, once we come out of that incident, I’m curious thinking about that and postmortems, how do you, you know, the incident happens, you resolve it, what’s your process of learning from that incident look like?

Adam: Well, we have a strict policy around doing postmortems for any production incident.

And so, you know, you try and do the root cause analysis.

But I don’t know if any of you are familiar with 5 Whys?

But it’s this process where you just keep asking why five times until you really get to it.

It’s blameless.

You’re not trying to look at the individual or a group of people who are involved with it.

It’s more about understanding why it occurred and then, like what we could have done differently to prevent it from happening again in the future, and then just documenting that.

And so, we have a whole folder in Google Drive where we just have all of our postmortems.

Alex: Yeah, we follow the 5 Whys process as well.

I found it to be really effective if done right like any process.

We use a Google Doc for it right now.

I would like us to move towards a different tool that’s a little bit more formalized.

We have used the Datadog notebooks to do that as well.

We try to really include snapshots of dashboards whenever it’s relevant to describe it, you know, links too, so you have context later on when you go look back at it.

But the notebook thing is nice because it sort of forces you towards a narrative.

You know, so it may not be appropriate necessarily for the 5 Whys part, but for building a narrative of the incident that is like shareable even with PMs or non-technical folks.

It’s nice to have a tool to do that.

Who is involved in the postmortem process?

Jason: Yeah.

I’m curious when you say, you know, involves sharing, who’s involved in that postmortem process for you guys, and then how widely do you share it out?

Adam: So, it’s generally for us, anyone who was even tangentially involved in the incident, so somebody who discovered it, you know, people who were involved with fixing it, anybody who was affected maybe to support people,

you know, people who were involved with the long-term mitigation of the problem,

we try and get everybody who potentially has something to contribute involved with the postmortem.

Alex: Yeah, and we do a similar thing.

And we also try to include at least one person who actually wasn’t involved in the incident.

Maybe someone from the SRE team or who has experience with the process because sometimes, they will ask a clarifying question or they will drive the root cause analysis towards something that is not like domain-specific, right, because if everyone in the incident is there and it’s only people from the incident, you might end up tunnel visioning a little bit and not capturing something that is relevant to other teams, so we…

Whenever possible, we try to have someone else who wasn’t directly involved as part of that.

Audience Question-and-Answer

Jason: Yeah, I think that’s a fantastic tip because, I think, so often, we get that tunnel vision of this is exactly this incident without broadening that scope of, well, there could be similar things.

Kirsten [SP], any questions out there?

Everyone seems just like really into this and…

But I’m guessing there’s some questions.

Man 2: Hey Adam, I enjoyed hearing about your feature flag.

Could you talk about how many…?

Jason: That’s a microphone, so hold it up to your face.

Man 2: That’s a microphone. [inaudible 00:12:43]. Hello

Jason: Yeah.

Q-and-A: Feature Flags and compatibility

Man 2: Adam, I really enjoyed your talk about feature flag.

Could you tell us more about how many do you have in flight and what do you do about incompatibilities?

Adam: About compatibility?

Man 2: Yes.

Because you have, you know, old versus new, rolling up great kind of things, do you have those sorts of issues?

Adam: I think, if you were to look at how many we have in flight right now, it’s probably about 50 or so.

And we haven’t had any conflicts as of right now anyway.

I mean, I suppose it could happen.

One thing that’s helped us to avoid that is, for sure, having the flags expire because if you don’t, then they just sit around forever and rot, and then you run into things like, I don’t know what this is.

Can I remove it?

So, you know, the 30-day expiration period has been a good forcing function to prevent that from happening.

You can renew it.

So, we have a little admin UI we built so you can go in and say, “I wanna refresh the expiration for another 30 days.”

You can do that in 30-day increments, but you have to do it.

The other thing we did is we built integration with Slack, so we’ll have a cron job that runs, I don’t know, every hour or something like that, that looks for flags that are about to expire and notifies Slack.

And every flag set, we set an owner, like the email of the person who created the flag, so we always have a record of who created it and when it was created, and how long until it’s gonna expire.

So, that accountability has helped us to kind of prevent things from conflicting with each other because the flags just don’t live that long.

Q-and-A: Why does Carta make its feature flags expire?

Alex: The expiring feature flag thing is awesome because…

So, we use a…

We have a legacy library called Gargoyle which was made a long time ago, and I’m actually really excited to try this out because we have hundreds of flags and some subset of them are the codebase, the rest are only existing in the feature flag system.

They’re not even referencing code anywhere, so having like an automatic expiration really keeps people super accountable.

And we made a big effort to try to track down whether flags are being used, and it was pretty difficult especially when the flag names are dynamically generated.

So, having some sort of system that formalizes it and expires the flags is really cool.

Adam: And it was really unpopular when we decided to do that.

People said, “You’re gonna throw exceptions in production?”

Alex: Yep.

Adam: And the answer is, “Well yes, we are, like delete your flags.” So…

Alex: Yeah, we faced a similar thing.

It’s cool.

Man 2: Do you use at Carta, I imagine, for managing your own documents?

Adam: Yeah, we do. We’re ID number one.

Man 2: Cool.

Do you ever find yourself running feature flags for yourself that your customers may or may not have?

Do you have the same experience for them as you do for you or…?

Adam: We have done feature flags where we roll out features specific for customers.

I wouldn’t say that we do it specifically for us.

Man 2: Cool.

I’m just curious.

Jason: I’m curious if you use feature flags for deployments like starting to do canaries and then how long do you keep those around, so that you can roll back if you encounter an issue?

Adam: We’re not currently doing canaries right now.

That’s something that we’re gonna try and look at doing in the very near future.

One of the goals for our quarter is to get time to production.

So, from when you commit to when the code is in production down to an hour, and that’s a lofty goal for us.

Like, we have a lot to do before we get there.

But we’re gonna have to figure all these things out, like, well, what’s our orchestration process look like?

How do we shorten the release cycle?

How do we make sure that we’re shipping frequently without actually breaking things?

But having feature flags is like a really important first step in order to be able to do that.

Alex: Yeah, we’ve saved on customer impact with that.

I think, if I remember, you have like a phase rollout kind of thing where you can dial up the amount of traffic going to you, about, you know, 1% increments or whatever.

We’ve used that style of feature to limit blast radius on a really risky feature, deployed behind a feature flag, and then slowly ramp it up.

That’s like…

That’s a really critical functionality to roll out dangerous changes in production.

Man 2: Cool.

Jason: I think, there’s a question back there.

Woman 1: Hello.

Okay.

So, from today’s presentation, I learned a lot about how to use Datadog for performance-related issues like slow queries or servers not available.

I’m working more on the security operation side.

I’m wondering if you have dealt with security-related issues using Datadog?

Alex: Yeah, I can give one specific example, actually.

Datadog has the ability to monitor…

One of the types of monitors you can use is an anomaly monitor that lets you, you know, like they have an algorithm that…

It’s a well-defined algorithm that predicts the future values for a metric or the expected values for a metric based on, you know, a six-week previous rolling window and daily trends, and you have some levers you can pull to adjust that.

And we actually monitor when we have a lot of failed login attempts.

So like, we record whenever people tried to log in to Rover, and we record, you know, we tag on whether it was successful or not.

Well, if you have a lot of unsuccessful logins and you have a metric that’s a login attempt, success or unsuccess, and you set up a multi-monitor that uses the anomaly detection on non-success then, when that metric suddenly spikes and it’s a way out of the anomaly detection range, that tells us that there’s some sort of security incident going on.

I can’t get into too many, like, specifics with that.

But a really common one that has been encountered by a lot of companies recently is what they would call “credential stuffing”, where they’ll find a giant database on the dark web that has username or email password combinations and they’ll try them on a bunch of different popular services.

So, if suddenly, your login traffic is spiking, they’re typically not doing this one at a time, they’re using some sort of automated script to do it.

That’s been very useful for us and we actually find that it is usually more a signal, like we don’t get a lot of false positives on that alarm going off.

Adam: Yeah, I mean, I don’t have…

That’s really cool.

I don’t have much to add to that.

I’m not a security expert.

The one thing that I can say, I think, that’s interesting is this is all done in code, but we have the concept of a suspicious operation, which is an exception that you can throw if you see something funky going on.

So for example, if I’m trying to access youroption grant but I’m not logged in as you, you have the option as an engineer to throw that exception and then we can handle that in various ways, so we can send alerts to the security channel in Slack or something like that.

Man 1: Cool.

Cool.

Jason: I think that’s a really interesting thing of, you know, we often think of instrumenting our code and thinking of performance, or thinking of those as business metrics, right, like how many widgets did we sell?

But actually tracking those interesting anomalous kind of things can be very useful.

Man 1: Cool.

We probably have room for one more question and then we should let folks get some nice snacks before the next workshop so they can caffeine up and sugar up.

Jason: Anyone want that last question?

Man 1: Cool.

Jason: No?

One right right there.

Q-and-A: When does PagerDuty send out alerts?

Man 3: Just in the same scope with the security questions, do you set off PagerDuty alarms for those?

Alex: For that particular alarm, yes, we do.

We have…

So, you know, with Datadog, you can set up like warning, an alert threshold.

So, there’s a warning threshold will go to Slack.

This is actually a pretty common pattern we have.

Warning threshold goes to Slack, to a specific channel.

Alert threshold will alert the on-call and that particular one actually does include a runbook that tells you to, you know, reach out to security.

Another thing that kind of along this point that was really useful for us there is, and I actually didn’t know that this existed until recently, the Datadog, like alerting template language and the monitors, allows you to like conditionally include PagerDuty aliases based on a tag value.

So, we could actually route the pages to specific teams or channels in Slack based on the values of certain tags.

And we haven’t used that for the security one but we’ve used it for some other monitors.

But that question made me think about it.

That was really useful for us as well.

Jason: Well, yeah.

For those of you using Datadog, if you haven’t used that when you’re in the monitor’s page, click on the template.

There’s a little question mark, plenty of cool template variables there for you.

Cool.

Thank you both for your time.

Thanks for presenting today.

Alex: Thank you.

Jason: It’s been really insightful.

Thanks.

Man 1: Oh, great.

Alex: Well, thanks.

Seattle Summit 2019 Closing Panel

On-call procedures at Carta and Rover

Creating fair on-call schedules for your teams

Compensation for on-call duty

The incident response process

What does the on-call workflow look like?

Learning from your on-call experiences

Who is involved in the postmortem process?

Audience Question-and-Answer

Q-and-A: Feature Flags and compatibility

Q-and-A: Why does Carta make its feature flags expire?

Q-and-A: How can Datadog help address security-related issues?

Q-and-A: When does PagerDuty send out alerts?

Start monitoring your metrics in minutes