As Ilan said, my name is Mia Henderson, I’m an SRE at PagerDuty.
I’m going to talk today about how we deploy 30x a day safely, using Datadog within our deploy process.
I’m going to start with a very scary graph—be prepared—if you’re an on-call person, this is going be very frightening.
So, this is a service interruption that happened on our production infrastructure and it delayed some notifications.
It’s a very scary graph, but the worst thing about this graph, is that you can see that we had advanced warning of the outage—and we didn’t avoid it.
I’m going to talk about how this happened and how we prevented this from happening again by using Datadog.
Before I can talk about how that happened, I need to talk to you for a bit about how we work at PagerDuty.
PagerDuty’s engineering group consists of a bunch of product-focused teams, and some infrastructure teams that are responsible for shared infrastructure and libraries.
The product teams are divided along product functionality, and they’re each responsible for the uptime of their own services, and they’re on-call for those services.
Each team manages their own on-call rotations, and makes their own decisions about monitoring and alerting.
PagerDuty really believes that the best way to have high quality software is to make the people who write that software responsible for operating that software.
Datadog @ PagerDuty
So, we use Datadog very heavily at PagerDuty. Part of creating any new service is creating a bunch of Datadog dashboards for that service, and updating our core dashboards if the service that you’re creating happens to affect our incident pipeline.
We create a lot of monitors for each new service.
We have a lot of monitors, which are obviously all hooked up to the great PagerDuty integration with Datadog.
We store all of our monitors in a JSON format in a GitHub repository.
That repository is shared across teams, and teams own their own alerting.
If they think something should be low priority, or it shouldn’t alert at all, or the threshold should be different, they can go in there and change their own alerts.
They don’t need any approval from SRE or any other team to do that.
Monitors and environments are defined separately.
You can basically go into any environment and pull any monitors in, and if you create a new environment, you just create a single JSON file in this repository and you end up with all the default monitors that we’ve created.
Syncing to Datadog
We sync our monitors with Datadog using Barkdog.
We basically have our custom internal format which gets transformed using an internal tool, and then we use Barkdog to sync with Datadog.
But on to what I actually wanted to talk about today: deploys.
So, we have a monolith, it’s called Web. And a bunch of microservices, and I’m sure you’re all pretty familiar with that infrastructure.
About a year ago, many of our projects were manually deployed using ChatOps.
They still were deployed many, many, many times per day, but, manually.
We had locks that developers had to take before doing deploys to make sure that only one developer was doing a deploy at a time.
For doing these projects—for deploying them—we had as-you-deploy boards, which was basically a Datadog dashboard that you’d pull up when you’re doing a deploy. And you would monitor that, while you were doing a deploy, while you were waiting for the deploy to happen.
Our deploy process for the application contains a few steps.
We build and test the application.
Then we deploy it to canary servers, which are a subset of our production servers, to ensure that it’s working.
Then we wait, we roll back the canary, and then we do a full deploy to all of our servers.
But manual deploys slow down deployment and they make people very sad, because they don’t like having to wait to get their code out.
So people spent a lot of time waiting to deploy and babysitting the deploys themselves.
So, we started migrating to continuous deployment.
Many of our projects got on continuous deployment, and we really wanted to do this for our monolith. But monoliths are monoliths, so everyone was very, very nervous about doing this.
They had all the reasons in the world why we shouldn’t do continuous deployment for our monolith: they wanted more test coverage, had lots of other reasons.
Eventually we decided—we’re just going to do it.
We don’t care whether you think that we shouldn’t do it right now, we need to get it done, and the best way to do it is to just do it.
So, we already had a mostly automated delivery that was hidden behind the ChatOps command, so we didn’t have a whole lot of work to do to actually automate the deploy.
Continuous deployment process
What do we use to do this?
Well, we use Travis CI to build and test our applications.
It’s a really great tool that we love a lot, because it involves absolutely zero SRE input.
All the developers can go and do all of their building and testing themselves and control that.
We use GoCD to do the actual deployments, because we need a secure server within our own infrastructure to do those deployments.
And for our web monolith, the deployment’s done by Capistrano, which is kind of complicated and a bit of a pain in the butt for us.
For our more modern deploys, we got rid of Capistrano—which is great.
We still use Travis CI and GoCD. But we have an internal tool for deploys that uses HashiCorp Serf for communication, so we don’t have to rely on hundreds of SSH connections working properly.
It’s a lot simpler than Capistrano, but it meets our needs for container-based deployments.
Continuous deployment is great!
Continuous deployment has been great.
Our developers really love it; they don’t spend time waiting for locks; they just merge and they go on with their day.
Our deploys per day have gone up, and engineers spend more time delivering and less time waiting.
Fixes and features get to our customers more quickly, which I’m sure all of you are very happy about.
But, the developers were no longer watching the as-you-deploy dashboards, because they didn’t actually know when their deploys are going out.
It could go out right when they merge, or it could go out an hour and a half or two hours later—which brings us back to the graph.
So one day, a developer at PagerDuty merged a change to the web repo that caused an issue with processing background tasks.
The deploys are canaried, and as you can see in the graph—the canary failed.
So the developer was paged due to the canary causing issues, but by the time they got around to looking at the alert, and the alert had gotten to them, the canary had been rolled back so the alert had actually auto-resolved.
So the deploy progressed past canary, went out to production, and caused this.
So we lost all of our background task processing for about 10 minutes.
As always, we had a post-mortem.
We identified a number of tasks that we needed to do to prevent this from happening again.
One thing that SRE had been talking about for quite a while, at least in concept, was doing canary checks for web.
So now we had actually a concrete case of why this was needed, and a concrete example of the metrics that we should be checking.
So, after a couple weeks, we found a few hours to write these checks; and really, it only took a few hours to write these checks.
How does it work?
How does it work?
Within our build pipeline we have a canary phase, where we deploy the new version to a few servers within our fleet.
Once the canary has been out for about five minutes, we check the metrics for the canary background task servers using a Ruby script.
This is a whole bunch of Ruby code, you can look at that when the slides go up. But basically, we use the Datadog API Ruby library, which we love. And we basically create a simple script in Ruby that goes out to Datadog, checks the metric that indicated the issue during this outage, makes sure that it’s not in a failure mode, and then returns zero or one depending on whether it’s failing or not.
And we integrated that into our build and deploy pipeline.
It was very easy to do.
The big problem was actually educating all of our engineers about why your build is failing.
It doesn’t mean that your test failed; it doesn’t mean that our deploy pipeline is broken and that you should call SRE and complain about it;
it means that probably some metrics are not the way they should be; and you need to investigate why those metrics are not the way they should be.
It took a lot longer to do the education than it actually did to write the script itself.
We spent a lot of time talking to engineers for a couple weeks after doing this.
So, this check has prevented more than a few bad builds from going out, and it took only a few hours to build and a whole bunch of developer education.
So, as always, I would really love to see a lot more of these checks going on with a variety of our services, and more checks going into our monolith. But, there’s always a balance between over-monitoring and getting our teams to prioritize monitoring over other feature work.
But, of course, I’ve been pushing for it and I need to spend some time with my product manager being like: “let’s do this. I want to spend time on it.”
Adding monitoring to deploys
So, how do you add monitoring to your deploys?
First off, you need to identify the metrics that are important.
This can be hard because they can be different for a canary than they are for your actual whole service. But if you look at the monitors for a given service, it can be a good place to start.
If you don’t already canary changes—start doing it.
It can be a challenge to do because you need to make sure APIs are compatible between revisions, but, it’s a really good way to make sure things are working.
If you already canary, make sure it’s easy to identify the canary servers through service discovery, or some sort of config file, or tagging in Datadog.
Next up, write a check script.
Datadog has great libraries—Python and Ruby, a lot of languages to use—so this is actually one of the easiest things to do.
Integrate the check into your continuous deployment.
If you are worried that the check is going to cause a lot of issues, you can make it a soft fail initially so that you can actually get metrics about how often it’s going to fail your deploys, and whether that’s going to be an issue.
Then you’re going to have to educate your engineers.
They’re going to complain that their builds are failing, and that their deploys aren’t going out, and they’re going to need to know why.
So, the easiest way to get this done is to just do it.
Don’t hum and ha, don’t worry about how much test coverage you need on your projects.
Make sure you just go out and just do continuous deploys, and use your existing Datadog metrics to make sure that the services are working when you canary them.
And that’s it.