The Path to SRE (Auth0) | Datadog

The Path to SRE (Auth0)


Published: 7月 17, 2019
00:00:00
00:00:00

Thank you Daniel.

Really happy to be here talking about a topic I’m fairly excited about, which is SRE.

Before we get started, who has a notion of what SRE is?

Okay, fairly good.

Who has that at the company where they work at today, something like SRE?

Different names, same goals

So the interesting thing here is that you will probably have very different things, but you call them the same.

And if we think about SRE as a term, and how it has been Googled, we can see that how many searches happen has increased a lot since, of course, Google published their SRE book.

And a lot more companies are working on figuring out how to implement this “SRE thing” where they are at.

That was our case as well.

Last year, around mid-2018, we got together for a company off-site. We are a remote company.

We get together in one place, and one of the things we talked about is, “Hey, we want to do SRE.”

But no one even understood, like, exactly what we’re going to do.

It just sounded very cool, you know, you do it because it’s cool.

Why is SRE necessary?

So we started with, why? Why would we do SRE? Why it was important for us?

In general, that’s a fairly good idea. Like figuring out why you’re trying to do something. And this phrase from one of our SREs is, why?

It’s because reliability is the feature that all customers use. And that applies to basically any product.

It’s not an exception that Auth0 is under that situation. So we are in a position where people might be authenticating with social providers. Our customers might be managing their users or using authorization, but they all depend on us providing the correct response within a reasonable timeframe.

It’s particularly important that we are up (Auth0 is up), because our product is what end users of our customers use to log in to applications.

So if there is a newspaper, and that newspaper uses Auth0 for their logins, all of the users that log in need our system to be up.

As a conclusion, if we are down, our customers are mostly down, and that’s not a very good thing. So we’re fairly critical.

And in 2018, this is a presentation that our VP of Engineering (the CTO at the time) did, and this is what he shared.

He said, “We’re all about developers. We’re all about simplicity. We’re all about even extensibility. But everything we build must be secure and must be reliable.

“That’s the way in which we get our customer’s trust. That’s the way in which we grow. That’s the way in which our business will work.”

So we said, “We’re going to make a focused investment on reliability.”

It’s fairly similar to what you hear from security teams. Like, people have security teams. That’s not really a question.

We said, “Let’s take a similar approach. Let’s put our money where our mouth is.”

“Hey,” we say, “Everything we have to do needs to be reliable. Let’s invest in that.”

Ensuring system reliability at scale

And this was particularly important on two accounts, both related to scale.

On the one hand, there’s system scale. If I show you the amount of requests and logins we process per second every month, the curve looks like this.

It’s been going up year over year, which is very interesting as a systems challenge, but it also means our systems need to keep up with that.

But on the other hand, we have a fairly similar-ish curve for the amount of people that we have in our organization.

I joined Auth0 five years ago, approximately. We were 10, and now we are 500.

So it grew.

If you can standardize practices around reliability, if you can standardize processes around reliability, if you can standardize tools around reliability, it helps you scale.

Now, we had an idea of why we wanted to do it, but we didn’t know exactly where we should start. We had the book, the book says lots of interesting things. We read a couple of blog posts, saw some videos.

We kind of had an idea, but we wanted to ask people that had been doing this for a while, how did they do it, and most importantly, why they decided to do it in that particular way.

What SRE looks like at different organizations

We talked to a bunch of companies, some of them you might have met from their logos.

So we talked to people at Google. Very interestingly, people at Google that were SREs tell you that not all of the SRE book is exactly as described. Who would have figured?

So learning that was good. Asking them why there were differences, right?

Search is not the same as ads and it’s not the same as cloud. They do it differently.

Companies like Facebook and Atlassian do SRE. Facebook calls it production engineering, but like, they have their own kind of flavors.

And then companies like Twilio don’t do SRE. And we wanted to learn how they were approaching reliability at scale. We wanted to learn about how people were organized, who they reported to, and why they reported that way.

There are companies that have SRE go all up the ladder to whoever runs technology. There are companies that have SRE within each division or product group.

We wanted to understand what the motivation for that was. We wanted to learn what style of SRE they did.

Can you actually block deployments to production as an SRE? Can you say, “Okay, this can’t go out?”

Are you more a consultant?

What can SREs do? What can’t SREs do? What did they do? And, again, always with the why. The questionnaire was question why, question why?

People don’t get into the why because it’s very natural for them. That’s what they’ve been doing for (in Google’s case) 15 years, right?

They just know that’s what they do. When you ask them, why does this happen? That’s when you get to the root of the answer.

We wanted to know who the sponsors were. These are the people that are going to say, “We’re going to put more heads into SRE, or we’re not going to put more heads into SRE.”

These are the people that decide what the SRE priorities are going to be.

Many companies have trouble recruiting SREs.

But at the same time, a lot of them strategically keep a low count of SREs compared to their general engineering population, because scarcity is a benefit.

People fight for their service.

Implementing an SRE team at Auth0

We had a notion of what we wanted to do now, but before they find out exactly what it was going to look like, we wanted to give the people that would be on this team.

We were thinking a five-, six-person team back in the day to start with to get the final touches on to define the implementation details.

So we started focusing on the who; who would be the first batch of SREs at the company?

And in order to do that, one very important question to answer is, where do we want to be on the SRE spectrum?

When we talk about SREs, we hear a lot of things, but in general we can talk about people that come from a mostly systems engineering background, maybe infrastructure, and are very good at coding and automation, and then we get the other part.

These are the people that are very good at product engineering, software developers that started to understand how to build reliable systems at scale.

We wanted to be on the latter side, and the reason why is that we expected to work a lot with our product engineering teams.

The good news for us, and if you’re doing something similar at your company for all of you, is that the usual suspects, the people that you have joining incidents when it’s not their turn to join them, the people that just comment on RFCs, because there was an RFC, and they could contribute an idea that would make the solution more reliable. You already have them.

They are already eager about doing these things. So if you make this their full-time job, they will be extremely happy.

But it’s not only important that they are excited. They also have to have a particular set of qualities, and the first one of them is they need to be teachers.

Why?

Because we want to scale, and the only way to scale is to make sure that whatever they do, whoever they work with, understands what they did.

They can knowledge transfer, and at the same time, it’s very important that they’re good teachers.

No one wants that bad teacher.

“Hey, I got an SRE. They came to my team and I had a very nasty experience with them.”

You won’t get pulled in as an SRE team again.

They need to be advocates for two things. They need to be advocates for reliability as a concept and they need to be advocates for SRE, the brand.

The SRE team brand is very important, because people need to be aware that it exists, what it does and what it doesn’t do in order for your team SRE to be effective.

They need to be great problem solvers.

As an SRE team, you’ll get all sorts of things that you need to work on, different variety. You will get low level issues related to CPU, an event loop being blocked. You will get latency issues and you’ll get more human things, like, how do we do incident response? How do we learn from incidents?

These people need to be flexible and they need to learn to basically learn, not necessarily have all the answers.

And finally, if they know the system, it’s great. The reason why it’s great is because it gives them credibility and they can get off the run running fairly fast.

Whenever they go interact with other teams, they get very positive feedback. It’s good.

Another thing we did, we were fairly lucky, we got someone with experience. A person that had been an SRE at Google for over 10 years.

But they weren’t dogmatic about Google, and, “Oh, because we did this at Google, we should be doing that here.”

That was very important. They were very humble. And when we brought them in, we could talk to them to say, “Hey, what do you think about this idea?”

And we would get into the weeds of it. Hey, why are we thinking about it like that? Where are the tradeoffs?

So they were very good for bouncing ideas around. And they also served as a mentor for the other SREs on the team that were used to these practices, but fairly new to the whole “being an SRE thing.”

How your SRE team can get buy-in

What we’re doing now is, we are starting to specialize.

A lot of our services run Node.js, and what we did fairly recently was hire a person that, in general, is very good as an SRE. Very good.

But at the same time, they are very knowledgeable about building Node systems and libraries. They’ve been working on Node in the CORD. They’ve built lots of libraries for it. They understand the internals, and as our team continues to grow, and take more challenges, these specialized positions start to become more important.

After all of this, we had the people, and immediately we identified that as we were trying to roll this out, as we were trying to socialize it first with people, not officialize anything, we saw some fear.

People were afraid and that’s normal.

First of all, they don’t understand what it is.

We talked about it, but also they fear that they will be forced into doing things. They fear that they won’t have ownership of what they are building, because in a lot of cases like that’s what some of the literature says: “SRE operates your systems, SRE runs your production checklist, SRE does not let you push to production if you’re over budget.”

We went ahead. We did an all hands meeting, and we said, “This is what we do.”

This is probably the only slide with that much text.

So we identify, develop, refine, and disseminate libraries, services, practices and processes. Basically, we’re just one more internal tools team, internal services team, platform team, right? We provide a service, and that service is reliability.

SRE also does not do these things, and again, when you’re talking about a strategy, when you’re talking about something long term, not only talk about what you’re going to do, talk about what you’re not going to do. Be explicit about that, because these were the fears people have.

So we said we’re not going to force ourselves on other teams. This meant that teams are the experts at what they’re running.

That’s basically rule number one. And they engage with SRE if they deem it necessary.

This means two things.

If SRE has a suggestion, they can make a suggestion. They can’t just change your code or infrastructures. And the pool requires they don’t have neither the permissions nor the “authority.”

And at the same time, any interaction with SRE needs to be bidirectional. So if you’re a team and you want SRE’s involvement, you need someone from your team to be there and work with them. It needs to be important for both sides.

Another very important one and people were somewhat not that happy about this one, interestingly, is that SRE is not a NOC.

SRE, in our case, did not do incident response for all services. What SRE does do is work on incident management and incident response processes.

Everyone in SRE is a very competent incident commander, and SRE trains people in incident response practices, and they train people to be very good incident commanders.

The SRE involvement spectrum

With all of this in mind, we came up with what we referred to as the involvement spectrum.

Okay, one sec, I’m going to go over to that side, because those people feel lonely.

Okay, good.

So involvement spectrum.

You could cater to all of these types of SRE flavors.

The first one was basically office hours. People joined our Zoom meetings, because we don’t have any offices.

They would show up, ask questions about either the week, things like that.

But at the same time, this is something we only did at the beginning and we think it wasn’t very successful, because we were missing offices.

It’s not the same if you just have a calendar appointment somewhere, and a calendar that someone might check, versus if you have a glass door meeting where you see all of the SREs freely sitting down where you can go ask them questions.

So this was useful, especially at the beginning.

But the key things were consultancy and embedding. And they were fairly similar.

The main difference was the length of these. So consultancy under a week, embedding between a week and two months, something like that.

You basically went to Jira, created a ticket, and said, “This is what I need SRE’s help for. This is the scope of the work. This is how long we estimate it will take, and these are the outcomes. This is what success looks like.”

That was the first part.

The most important thing there was we forced people to pick from a dropdown someone on their team that would be allocated full time while the SRE was working on something to pair with them.

So this goes back to the education thing. You always have someone on the team that knows what they’re doing. Very important thing.

But also you are getting the team invested in what the SRE is helping contributing to build, because they are actually putting a person to work on that.

It’s not, “Oh, this is important, hand waving, hand waving,” and then it’s just, “Oh, the SRE will do it, right?”

So you don’t use, for example, SREs for Node version migrations. That’s probably not a good way of spending two people’s time.

You could contact SRE directly, that was very interesting.

So for SRE’s own services, there were some, and you had them in the spectrum before.

Let’s say you’re having issues with rate limiting. You’re having issues with feature flag’s platform level services.

You page them. Also, if you have your own incident that’s not going very well, you can page them to get help.

You could also go to our Slack channel. You could ping us via Jira, very important things.

And we had office hours as I talked about first.

I’m going back that way. Like the people with the lights are hating me right now.

Marketing and branding your SRE team

So one very important thing is creating that brand, making sure that people are aware that SRE exists.

First of all, what they do and what they don’t. And also, that they trust you.

That’s very, very important. If they don’t trust you. They won’t call you even if they presumably need your help.

And whenever you think of a brand, the first thing, most important thing, logo. Most important thing ever, because without the logo, again, it’s not cool. Remember that part.

So this was a very cool logo, because if you look at it, it’s a wolf. But then the two faces are wolves. It’s very complicated. Our SRE manager likes doing logos, which is interesting, but we did use it a lot. We used it for conference documents. We used it for presentations.

I think someone talked about doing t-shirts. And I don’t know where that went, because I didn’t get one.

But it’s really good. It works. People recognize it, we have a Slack emoji for it. So it’s positive.

We did office hours, again, key at the beginning, because of the confusion. Not so effective towards the end.

The thing that worked very well was brown bags.

Send the message to get everyone’s attention. Say, “Hey, we’re going to talk about some specific topics. You might be interested in them.”

They will come.

This is where we got a lot of things done. We talked about incident response. We talked about writing postmortems. We talked about distributed tracing. We talked about rate limits.

Each topic we talked about, people were like, okay, these people might know what they’re talking about. This was good. That’s where you create that brand recognition, that trust.

Another thing we did was what we call investigations. So whenever we saw something, like, weird, nasty, either cURL lag or C++ plugin library memory leak, we’d go in, write it down. You can see very weird and cryptic codes that make you look smart. That’s very important.

And this is what we did. It worked.

But the key, jokes aside, was flexibility. At the beginning, people had no idea what to use us for. And a lot of the people that were part of this team were fairly senior engineers on other systems. So they just came with anything. Someone even came to, like, order lunch. That’s not what we do, but might help.

Being flexible did help. Saying, “Hey, I will help you.”

We went along with those requests and then after we had built up some credibility, gained some trust, we will tell them this is not actually what we do, but they would go out happy, and then know what they would expect from us.

A very important thing here as part of flexibility was incident management.

So we would join incidents even if we weren’t paged and just sit there. Being there when everything’s on fire is powerful, because people think, oh, they have nothing better…no, just kidding.

You can help. That’s when you help. That’s when you can either say, “Hey, this might work.”

That’s when you can help overtake incident command if the person that was incident commanding is probably the best suited to fix things, and you can help after that to write the postmortem, because we also took over for, let’s say, postmortem review guidelines.

So if we were there during the incident, we could also provide fairly good guidance for how to write the postmortem.

And that, again, very important for gaining trust. I remember one of the last big incidents we had, November last year. The incident had nothing to do with us. Like everything has to do with us, but…nothing to do with us.

We showed up. The first IC was a… IC, incident commander, not individual contributor. The first IC was a senior SRE.

The second IC, I did that. And that is what we needed to do.

Like, that was the important thing for Auth0, and at the end of the day what we do as SREs is minimize risk, period. If it’s already on fire, there’s not much risk minimized. So just put it out.

All of this is useful, but unless you execute, you will get nowhere.

Like, we could have done all of the other things, but if we don’t show results, that doesn’t help.

Because again, what we’re building is trust. People need to want to call us. That’s what we need to get to.

They need to be, like, we need SREs, and we need to be, like, we have no more…give us budget.

Tracking SLOs and SLAs for your SRE team

Talking about budgets, SLOs.

So we started introducing this. We had this for a lot of services.

We had SLAs, especially for the external ones, so they had SLOs.

But a lot of the internal services did not have this. We helped teams introduce them, period. Just having them is very good, because you can start using the term in conversations and it means something.

Another good thing is we introduce something called reliability reviews, which are operational reviews or operational retrospectives. But when you call it reliability reviews, it has two Rs, its R2, and it’s a lot cooler.

Again, remember, cool.

But jokes aside, again, we automated all of this, so we reduced toil, and our reliability review, which we actually took the format from Atlassian. That’s a customer of ours.

We would list the incidents or the alerts we got during the week. We list the services and their SLOs, the toil work, and a couple of other things, investigations, action items.

Once we have this automated, we get beautiful tables like this.

So as a director, I do lots of tables. Spreadsheets are my favorite tool.

Green is good in general. If you had seen any red, it would have been bad.

But this is what we just go over every week. And even if things are green, we look at, “Hey, why were we close to that number?”

We’re very clear. Like, there are also some notes, like, “Hey, we could get even better at measuring this if we did these other things.” It’s very interesting.

And another powerful one is the one about alert fatigue.

Whenever we get an alert, we talk about it a lot.

We say, “Hey, did we need that alert? Should that have woken me up?”

That’s the first question. If the answer is yes, then we leave it.

But if the answer is no, then we should either reduce the priority, or maybe we should be tuning the alert, and just reviewing these things, reduce this pager load, but also going over the SLOs, talking about this every week, especially in things that have product managers, showing these to product managers is very important.

We did incident response. We worked with our customer success team and we worked with our infrastructure team on coming up with a new version of incident response, better definitions for severities, better ways of paging people.

We introduced new roles, so we didn’t have what’s known as a scribe role before. That’s the person that basically takes notes, and it’s very important, because otherwise no one knows what the hell happened and there’s no way to remember after all the adrenaline has gone down.

We run the trainings. So we run the training for everyone that was on call at the time. And whenever new engineers join, we kind of bunch them up and we have them go through the incident response training.

We introduced distributed tracing across the org. We put it in a lot of the major systems. It’s working. It’s useful to detect spikes or weird things and calls.

Before, we had to do things where we would take a log IDs or trace IDs from five different log systems, have all screens at the same time open and just figured out, hey, what’s going on here? So this is very useful.

We implemented our whole rate limiting stack making it faster, more reliable, easier to maintain.

And I think there was another benefit. I don’t remember. But trust me, this was huge, because everyone got the benefits.

Remember, everyone uses rate limits. Everyone was a customer of the service. Everything we did here…very important, very valuable.

We worked with a couple of teams, the release engineering team, the infrastructure team to reduce deploy time. We made deploys a lot faster. I think I have some numbers later.

Most importantly, we built a library that takes our standards for how we define infrastructure in AWS, and runs remote deployments, so that’s more software that all teams that use those standards can get benefits from that this SRE team created.

And finally, we worked on a bunch of complex issues. If people have a memory leak that they couldn’t figure it out, they probably call us, weird deceptions, high latency.

What else happened?

Oh, yes, retry storms and those things where it caches, go haywire. So we had many of those. And again, those create that trust.

The state of SRE at Auth0 today

Question is, okay, it’s been a year.

Where are we today after all this?

This is what our work looks like.

It’s a very simple picture, but we have our platform group. SRE is a part of that.

We have other teams there that run, for example, our infrastructure, our networking, our internal services, our tooling.

And then we have other groups like IAM and developer experience that have product engineering teams, and SRE collaborates with all of them.

These are some of the numbers.

So our tools are happening organically, and this is another thing…we try not to force things on people. This is not just a necessary thing. This is a general Auth0 thing.

If they come organically, we try to do that. Otherwise, we might generate some motivation, but 5 out of 11 teams doing this is very good.

We are deploying five times more often with, I think it’s more than 10x faster deploys, which is huge if you think about the importance of being able to roll back fast enough, people not just staring at a screen saying, “I’m deploying,” and stuff like that.

80% of the services have tracing, which means that we’re past that 50% mark where you start getting incremental benefits from it. So you have a lot of things that have tracing enabled. You can use it to debug many system parts.

We solved five very complex issues that were causing what we call a micro outage. So for like 10 seconds, a couple of processes would just not respond. That would result either in high latency or errors, and you just call the SRE, and say, “Hey what’s going on here?”

In a couple of cases was a memory leak. We had a pipe in one case that was closing randomly, and that was generating errors, because of how we did log in. So very interesting.

We helped bring the reliability of our user management API to over four nines, which was important again as part of customer experience, and we reduced the rate limiting latency 99 percentile by, like, two-three times depending on who you ask.

The future of SRE at Auth0

Overall, we are very happy.

We are entering Q3 2019, so first year anniversary with a lot more things being asked of SRE that what SRE can actually do in a quarter.

And that is, kind of, our definition for success.

We are trying to figure out what we are going to do. So there’s a lot about that sponsorship aspect where we need to prioritize that on the engineering leadership level, make some tradeoffs. But we’re also starting to think about how we are going to grow this.

Disclaimer: this is a work-in-progress. It doesn’t mean it will happen.

Some people might not even like this, but our notion is that we are going to create more affinity between SRE teams on specific parts of the org, which are useful both for trust and improving productivity, because once you get that knowledge in, it’s good to keep it.

So SRE PR, platform reliability, is the SRE team that would work with platform making our storage better, making our deploys better, collaborating with all of the other platform teams.

AR stands for application reliability. And that’s the team that would work with product engineering teams with our IAM domain making, our authentication pipeline faster and better, working on code bases.

And the same thing applies to DX, which is AR as well. Under OX, first of all, again, it sounds cool. But also operational excellence. That’s the team that would work on things, like observability, like incident response. Eventually, things like, chaos engineering, at least pushing and evangelizing across the org, resilience engineering, all of those things.

Again, disclaimer, this is not final. It will likely change. This is just the best thing we have as a vision today.

With all of that being said, I really appreciate you all being here on time. I did not have the confidence that these were going to be my final slides, so I have not uploaded them yet.

I will be sharing the link to them on Twitter, and I hope you enjoy it. Thank you.