I am Cory, I work for Stripe. There’s actually a slide about that so I won’t talk about it anymore. This talk is called “Building a Culture of Observability at Stripe.” I gave this talk at Monitorama actually in June but I have updated it with all kinds of cool numbers, and also at Monitorama, I wasn’t really talking about what we used. I mentioned Datadog briefly but that wasn’t really the focus of the talk.
And so Ilan asked me if I could give this talk again and then put all the Datadog stuff back into it. So I went over the slides last night and thought of some even cooler stuff but I didn’t want to make him update to slide. So I’m gonna ask you to use your imagination.
So, again, Cory Watson. I go by gphat on the internet, Twitter, Github all that other stuff. I joined Stripe in August of 2015, so I’ve been there for just over…or just getting close to a year and a half now. Previously, I worked at a place called Keen IO that did analytics, also a Datadog customer. And before that I worked at Twitter, where I worked on observability there as well.
I’m a generalist by trade, which means I know a whole lot about a lot of things, but very little about any of them deeply. And, luckily, I have a lot of awesome people that work with me who know a lot more about this stuff and can make it actually work, whereas I just sort of guess.
So what we’re talking about again is creating a culture of observability. You could more generally apply this to say, like, if you want to put something like Datadog into your organization, how can you pull that off? Like, it’s one thing to say we’re gonna pay money to have Datadog and we’re gonna use it but it’s another thing to get all the folks that you work with to adopt and use this stuff in practice.
And so I want to tell you a little story about how Stripe worked when I joined and what we’ve done since then. When I joined, Stripe had some visibility but not really enough.
I tell a funny story that’s been romanticized a bit where I say, like, you know, on the first day when you join a company now they have you ship some code, right? They like get you in a room with all the other new hires and you write something you deploy it.
So they said, “All right, guys. We’re gonna write this and then you’re gonna deploy it to the website. You just type these commands.”
And so I raise my hand, I’m like,“Hey, so when I deploy it to the site, you know, Stripe, the process is billions of dollars in credit cards. How do I make sure that I don’t make it so people can’t do that? Like, if I ship it and I break everything, am I gonna know?” And they were like, “Oh, don’t worry about it. It’s just the site.” I’m like, “Oh, that’s funny. So I know I’m new in stuff, but, no, tell me really how do I know?” And they’re like “No, I don’t worry about it you just ship it.” I’m like, “Cool. I know what I do at Stripe now.”
That’s not really how it went. My interview at Stripe, they knew my past work and they knew that this was something I could help with. So it was kind of always in the cards, but Stripe needed this. And also what they really needed was clear ownership of what observability was going to be at Stripe. There was no one that was doing any proactive work on it it was all very reactive. We’ll talk more about that in a moment though.
Also a lot of broken windows. This is a saying that…an idiom I really like to use the whole, “When a window gets broken, then people break everything after that because no one’s taking care of it.” There was also a lack of confidence in the tooling, which we’ll talk a little bit more about as well.
A vision for the future also is pretty important. Like, people need to know that it’s going to get better at some point. And lastly, it was all very reactive work. These were things that like when there was an incident we’d make a dashboard for that so that the next time it ever happened, we would be sure and be able to find it, but very little proactive work was being done.
How to get people to invest in observability
So you are here today because you know that the type of stuff we’re talking about is important. You’re here because you care about the observability of your systems and being able to react to problems and hopefully solve them in a reasonable amount of time, or even preventing problems from happening at all. So the question is how can we get others to agree that this is important, and more importantly, invest their time into doing this work? Because it’s one thing, again, to have these tools, but it’s really important to get people to actually invest in them so that things get better instead of you just toiling away by yourself.
Facts about Stripe
So there was a cool thing I saw on Twitter one day where someone said, “It’s really hard for you to give a talk about your organization and who you are if you don’t give us some facts about your organization and what you’re dealing with.” So organizational facts about Stripe. There are about 550 people at Stripe.
I assume everyone knows what Stripe…everyone knows what Stripe does. We process credit card transactions and make that easy on the internet and also other cool things like Atlas that let you basically start a company through the internet without having the traditional hassle of all that paperwork. So 550 employees, about a 100 percent growth in the last year.
So I’ve been there a year and a half, and I feel like I’ve been there a long time compared to… Well, you know, the new people come in and they think I’ve been there forever, which seems weird to me.
We have about 30 different engineering teams. I don’t have a lot of scope on things outside of engineering teams. We run about 230 services.
Stripe is mostly a monolith, but we have a lot of supporting services that are going in the background. A lot of our fraud detection and stuff like that is happening in other places. Thousands of AWS based hosts. We’re mostly a Ruby shop. We’ve got some JVM stuff especially on the data side of the house as you can imagine. And tons of open source stuff. Like, all the open source things that everybody runs we probably run all of those, too.
The observability team is, kind of, a hand-wavy number but there are five of us that are full-time. We also have an intern that’s been with us for a double internship and then we also have one team member on loan. So depending on the day of the week, sometimes we have more or less people.
The Meaning of ‘observability’ at Stripe
And lastly what does observability at Stripe mean? We’re gonna talk about what observability means.
I enjoyed Alexis’ explanation and I have a textbook version, but observability at Stripe is: We are responsible for all the Datadog stuff and the integrations therein. We also work with Splunk, Sentry, PagerDuty, all the supporting libraries, like all that you use in your runtimes to instrument your code and such. We we work on those internally. And also the core dashboards that, kind of, define whether or not Stripe is working. We mostly owned those and try to steward them and make sure they’re in good shape.
So let’s say that you are here and you are adopting Datadog for the first time or you’re thinking about getting Datadog, or even really anything that’s Datadog shaped. This could probably be applied to many things you might want to do in your organization, but we’re here today to talk about Datadog, so let’s stay there.
Key points of this talk
Where does one actually begin to make a change in your organization? How does one do this? These are the keys that I want to, kind of, leave you with today. If you get nothing else this summary slide is pretty great.
Care about users
The first thing, I can’t come up with a better way to say it, is that you genuinely have to give a shit about the people that you’re working with and their happiness. Especially when you’re in observability, like, we’re the people that page you. They have every reason to not like us all, right? Like, we’re sitting here waking them up in the middle of the night if we ever have false positives or false negatives. There are really big consequences. And so every single day, when we’re working with the other engineers at Stripe, we have to really really care about making them better, making them quicker, and making them more effective at their jobs.
Follow up on feedback
Also following up on feedback, like, if people come to you and they say, “Hey, this thing that you made, like, it’s pretty great but it would be really good if it did this one more thing.” Or if they tell you that they hate it. following up on that and not only taking that feedback and saying, “Yes, thank you for that information.”
But then like attributing it back and then making sure that when you do the thing in, whether it’s good or bad, even if you don’t do it. At least following up and saying, “You know what, we just couldn’t make it to that.” Being accountable like that is very helpful.
Trend toward a better future
People understanding that you’re trending toward a better future is really important. Like, just sort of rearranging the deck chairs on the Titanic as it were is not really a state anyone wants to be in. So knowing that you’re trending towards some awesome thing in the future is very helpful toward keeping people’s spirits up.
And lastly, you can say you’re doing these things, but if you’re not actually measuring… Like, the work that we’re doing is measuring you know these little computer things that we’re trying to get to do work for us. Measuring yourself though and the progress your team is making toward that future that you think you have coming is very important and keeps you from getting caught up in, you know, sort of, like, spinning in place and not accomplishing anything.
Starting over with Datadog
So for us the act of joining up with Datadog was mostly just starting completely over. We basically had to burn everything to the ground. I didn’t spend a lot of time on this when I spoke at Monitorama because I didn’t want to tell anyone you should burn everything to the ground because that’s probably bad advice.
Spend time with tools before getting rid of them
But if you’re going to do this, I highly recommend you do two things first. First, spend time with the thing that you’re getting rid of because your users have been spending time with the thing that you’re getting rid of. They know what’s good about it and they know what’s bad about it.
And if you know what’s good and bad about it, it’s much easier for you to be an effective salesperson to get them to want to use that thing. Because if they say, “Man, I really love that, you know, thing that Grafana does,” or something like that, you’re prepared you can say, “Datadog does that, too.”
Or, you can counter with, “Well, yeah but what about that one thing you don’t like? Because we’ve solved that problem here.” I’m not picking on Grafana, it’s a great tool but you know, all of them are great in their own way and bad in some other ways.
Improve your current tools and systems
Also, we’re gonna be looking to improve the systems that you have. Datadog is not a tool that you specifically, like, replace everything in your organization with. It’s very tightly integrated with many other services. So Alexis mentioned PagerDuty earlier, which is a great example. Like, you don’t have to get rid of that. In fact, you’re gonna integrate tighter with it and probably even improve it.
I know that our Nagios alerts just sent, like, really annoying text with no actionable stuff in them. But now our Datadog alerts through PagerDuty send cool images that help you get a little bit of context before you actually respond to the page.
Replace tools and systems if they can’t be improvedLastly, if there are systems you just can’t improve, you can rip them out and completely replace them with Datadog.
We’re still in the process of getting rid of Nagios. It’s, sort of, like a disease that we just can’t completely rid ourselves of. I’m gonna pause to take a drink while everybody giggles about that one. So we’re trying to get that.
Leverage past knowledge
But more importantly, these past systems have their own failure cases or their own success cases. And a joke that it’s, kind of, like, archaeology, but there’s like social archaeology of all the people that are working in your organization have used a lot of these tools and they have a ton of information they can help you with. And so before you embark on this great journey to replace everything with Datadog, or whatever it is, be sure and seek those people out because they’re gonna both appreciate being consulted, which we’ll talk more about in a moment. And then also, you’re gonna learn a ton about what not to screw up the next time you try and replace this.
So why Datadog? Why should you use Datadog for this stuff? Well, I’m biased I’ve been a Datadog user for many years and I quite like it. I’m also biased because of its general purpose.
The observability stack at Twitter was not opinionated, it was basically very similar to Datadog in that it had a simple interface for getting metrics in and then you built dashboards and monitors effectively with a rule-based language that was whatever you wanted them to be. So since Twitter observability worked that way and that’s where I, sort of, earned my wings here, then that was very appealing to me.
The velocity of Datadog
Also, the velocity. Like, there are all these Datadog engineers that are in here or that are representing all the engineers and other people, support people and product managers and stuff that are back still at the office, like they provide me with a ton of velocity.
They make me look good, because I’m just, you know, a team of like six or seven people doing this work and then there’s like hundreds of people back here doing this work to make the observability stuff better for me. And that’s awesome. It’s like the build versus buy thing, right? Like, this is really helping us in terms of buy.But at the end of the day, I feel really great about it because even if Datadog does something I find, you know, like they didn’t get that feature I wanted or they didn’t add that thing that I wanted to do, almost all the tools that I ever need from them are always open source, which means I can go in and I can add that feature, or I can make whatever changes I need to make it better.
Helpful and friendly
Lastly, they’re super, super friendly people. I highly encourage you to run up and talk to many of them. This will be the first time I’ve seen many of them in person, although I know many of them through Slack or Twitter or any of the other ways I can find to bug them and get them to make screenboards and timeboards not different.
The friendly and helpful staff is a huge thing for me about…like, I recently had a not really a competitor, but another company whom I spent like 45 minutes on the phone to try to get them to understand that there was a CSS bug in their product. Datadog has never made me do that they just believe me when I say that something’s not right and, you know, they trust me as a user which I really appreciate.
Importance of empathy and respect
So away from talking nice about Datadog for a moment and let’s go back to talking about how you can make change in your organization. The biggest thing, I think, other than my overview slide if we get into individual slides, is empathy and respect.
This is something our industry is not generally known for. I’m sure you’re all the most sympathetic and respectful people in technology that are here today, but generally we have this, kind of, cold dismissive attitude/stereotype about technology folks.
People are not generally evil
Well, I’m here to tell you today that the people that you work with you may think they don’t care about monitoring, or uptime, or stability or any of the other things that wake you or them up in the middle of the night. But the truth is they’re not. They’re not trying to be evil, they’re just busy.
People are busy and stressed
They’ve got their own responsibilities and things that have been asked of them. They’ve got deadlines, they’ve got project and product managers, and just managers who are standing over them and asking them to finish up all these features that they need. And they don’t have time to do all the things that you need them to do to instrument all the code points and set up all the monitors, and be sure that they’re not verbose and annoying.
They’re also pretty stressed out a lot of the times. And they’re doing the best with what they have available. They have a few minutes out of the day that they can take to set up monitoring or build that dashboard or refine whatever that thing is that you’ve got in your monitoring systems. And they’re doing the best that they can.
Being a hater is lazy
But they’re definitely not lazy, and, you, being a hater about it, is also extremely lazy. A lot of tech conferences I feel like you go to them and people just stand up on stage and tell you … It’s very easy to be negative, I guess is what I’m saying. I don’t want to call out all tech conferences, but I highly recommend that you embrace the positives much more than the negatives when you’re trying to get people to change, because you catch a lot more flies with sugar than with vinegar. Is that the saying? I should know that.
Help people be better at their jobsAnyway, the goal at the end of the day is your job in working with these tools and Datadog’s job for us is to make us more powerful and more better at our jobs.
Replaced existing system
So we’ve decided we’re gonna do this replacement. We’re approaching this with empathy and respect and trying to move these systems. Asking people to overcome their momentum is hard. We’re asking people who come in every day and who are used to looking at certain dashboards or working a certain way to somehow now work in a different way.
Knowing that you’re asking people to do this upfront is important. You’re asking a lot of people. It’s like someone saying, you know, when you wake up tomorrow you’ve got to suddenly start brushing your teeth with the opposite hand. You’re not gonna like that and you’re probably gonna get toothpaste all over your shirt. I do that when I brush my teeth with my normal hand much less the opposite hand, but yet people still let me run their operations. I don’t know why.
Sometimes you just have to declare bankruptcy, technical bankruptcy, on these things. We had to do that with StatsD. You know, everybody I’m sure has worked with Graphite in some capacity and the whole dotted naming thing.
Tags help cure ops headaches
Like, that doesn’t translate into Datadog. Datadog doesn’t support wild cards, they could technically maybe, but it would probably not be awesome for you because then you’d keep perpetuating silly Graphite names instead of using tags. Also, getting rid of this stuff has saved us a lot of ops headaches.
This was a really pro, a really big pro we could throw to people, like, we’re not gonna have to deal with the huge numbers of Graphite boxes and the gigantic StatsD drop rate. UDP for metrics is convenient, but technically not always sound. And we were dropping sometimes 50 percent of our metrics and people didn’t know because StatsD doesn’t deal well with that.
We’re still in the process
Lastly, we’re still in the process of ripping this stuff out. I don’t want to stand up on stage and be like, “Hey, we just totally swapped it out and everything’s gorgeous and beautiful.” That’s not true at all.
Graphite stats, yeah, all those things are still running at Stripe. They’re just not doing anything important. In fact someone mentioned, like, I think a week ago they’re like, “Hey, I tried to go to Grafana and it just doesn’t work anymore.” And I’m like, “Shucks. We’re not gonna fix it we’re just gonna leave it there because the metrics didn’t work for you anyway.” So we’re still there.
So this is probably one of the thing, I didn’t know that this was a thing until after I had been doing it for a little while. But there is a Japanese concept called “nemawashi” in which, I think, that, like, the pretty explanation of this is that if you were going to move a tree from one place to another, you dig a little bit of the dirt around the tree, go and get some of the new dirt from the new location, and put it around the roots and give the tree a chance to like taste its new environment and learn what the future is going to be like before you just rip it out of the ground rudely and stick it in a new place.
This works really well with people, not necessarily digging holes and sticking them. That would be weird. But start small. You’re a great guinea pig, use yourself, learn these things, then start to lay a foundation by reaching out to people across your organization and just showing them what you’re thinking of doing. Each one of these is a learning opportunity. Like, how can you take the feedback that you’re getting from them and make your next pitch even better.
If you actually go to Wikipedia and read about this, it even discusses that Japanese management are offended if they come to a meeting and you introduce a concept that they didn’t know was going to happen. And you can imagine how that would feel even in your organization if you show up at that all hands meeting and someone tells you some big change happened and you didn’t even know. You kinda feel like a jerk. Whereas if you know, you feel a little smug like, “I knew that was gonna happen already. I’m hip. People tell me stuff.”
But in the end, asking how you can improve is my favorite part of this. Like, not only are you seeding this change but you’re also telling people that, “Hey, this change is coming. And can you help me pitch it to the next person?” This requires a little bit of humility, but it’s very powerful.
Lastly, this is gonna give you a lot of opportunities to engage just contempt. You’re gonna find people who don’t really want to change. Those people are not your enemies. They are the people you have the most to learn from.When you find people who are dismissive and difficult about what you’re trying to do, those people are going to give you a ton of great information that you can apply to either the other people who are difficult, or to people even who are kind of on the fence, like you’ll bring them over.
I have, sort of, a background in customer service, and I’ll tell you that most people in a customer service situation, they probably want less to reconcile the situation than you think you have to give. You can often offer it. You don’t have to offer them so much. Like they don’t necessarily want free, they just want a small discount, just for their time, as a recognition.
And so it’s usually fine to engage people that are not totally hip to this. This joke has been up here for a while, but in the end, you know, there’s always whiskey if you can’t figure out how to get them to go with you. Some people are not going to bend and you’ll just have to, sort of, move the ship of progress past them and maybe have a drink.
Identify power users
A big way to get change in your organization is to find people who are power users. Find that person and that other team who really, really likes the idea of what you’re doing with this new product or you know, Datadog or what have you.
Find those people, talk to them and say, “What can I give you to make this easier for you?” Use those people as levers to move the weight of your whole organization.
We had some folks in our operations teams, I just went and talked to them one day. And they, not only did they learn how this worked, they redesigned whole systems to facilitate less of a batch-oriented monitoring approach and more of a real-time approach. They also taught everybody else in the team. So my team didn’t have to go and like teach this team how to use these tools they did it for us and it was awesome. And we get to watch them grow now.
This is not about my success or about my team’s success, this is about Stripe’s success and about this change in culture. And so by doing this, we’ve empowered all these other people to do the work for us. You could also call that delegation I suppose if you wanted to be more crass about it.
Finding the value
Lastly, in this space, I wanna talk about value. You are doing this for a reason, right? You’ve chosen to switch to Datadog or to get rid of some old system. To do so, you need to be adding some type of value. We don’t make change without adding some type of value, or else why are we doing it?
What is it you’re actually trying to improve? Are you trying to improve mean time to detection? Are you trying to decrease the number of incidents? Are you trying to improve mean time to remediation?
There’s all kinds of cool metrics you could be improving, but, A, what is it you’re improving? B, how can you measure it? And then, C, constantly think about whether or not this is the best way to do it. Not necessarily, like, should I be switching to Datadog? You don’t have to think about that every day. Some decisions you should live with for a little while. But is this individual approach I’m taking the best way you could be approaching this problem? Don’t be afraid to shift gears if you have to.
What is observability? Why do we want it?
So Alexis talked about this a bit ago. The word “observability” gets used a lot. Maybe just by me. I don’t know. I feel like I see it a lot but I think I’m just tuned to it.
Why do we want observability and what does it really mean? It’s not just a replacement for the word “monitoring” and it’s not just about metrics.
So observability is actually a real thing. If any of you in here were electrical engineers or studied engineering, you probably, at some point, touched control theory. If you know a lot about control theory, please wait until later to correct me for all the things that I’m saying incorrectly about control theory. But the gist of what observability means in control theory is observability is about measuring how well internal states of a system are working by measuring the external outputs.
You can’t see into, say, the engine running in your car to know whether or not it’s working, but you can look at the output. Like, I’m sure many of you have, at some point, in your lives been in some vehicle or something that you press the gas to go and the thing didn’t do the thing it was supposed to, and you could tell by the external output that the system was not working the way that you expected. So we need to figure out how to replicate that type of thing in the work that we’re doing every day.
And so systems, like the systems that we work on the services, the products, they output work. If the internal state of the system goes bad because the database is broken, or a machine died or AWS US East 1 is busted, or if DNS stops working to quote current events, then if that goes bad, you need to know. And the way that we know that is by adding sensors.
So this is the part where anyone who’s studied control theory maybe should cover their ears until later, but we’re gonna generalize this for use with software engineering. This is a feedback loop. So we’ve got a reference coming in that’s saying “This is a thing I want to happen.”
We’ve got, in this case…You could generalize this this chart a lot, but we’ve got a programmer who is supposed to take this idea and turn it into some system. So we all write code that then is supposed to output work on the far side.
Does this thing have a laser beam? Man: [Inaudible 00:22:41] Cory: Sweet. Everything just got better. Later, I think, I actually need it, too. So we need to add sensors here, such that when these sensors notice things the programmer is notified, such that they can then improve the system.
This is what we do every day. Like, scaling is all about sensors, feedback loop, and improving the system. This is an extremely powerful chart that you should like burn into your head and use with almost anything. I’m actually gonna use it for jokes later. It’s great. I love it.
Flat org work ethic
So, now, that we’ve talked a little bit about what observability is I should really have like transitional slides, but I had to squeeze it down to fit today. Many of you may work in an organization like mine that’s very flat. Stripe has no titles. We are basically just a bunch of people that sort of collectively try to get work done.
That can be a big challenge sometimes. How do you as a new person in an organization? Now, if you’re like the director of monitoring or observability or something here, and you can just like sort of waddle into things and be like, “We’re doing it this way.” That’s great. I don’t have that. That would be cool if people just let me be the king of the world and decide these things, but it’s a challenge otherwise.
Starting is hardest part
So the hardest part of that, I think, is really just getting started. You have to come in every day and start plugging away at this work. So do that. Come in every day with this goal in mind, “What am I going to do today to push this rock a little further along?”
Be willing to do the work
There are a preposterous line of yaks between you and the other side of this. The only way that’s going to get better is if you actually shave them. After I gave this talk at Monitorama, someone came up to me and said, “You just described a test yak, which is like an infinite number of dimensional yaks.” And I was like, “Okay, I’m gonna use that. Thank you.” The the line of yaks is long, and you’re gonna have to shave them all. That’s the only way to get the work done. It may not be you specifically, it may be your whole team, but the yaks must be shaved.
“Stigmergy” is something that I found on Wikipedia at one point. You can look this up on Wikipedia and become an expert in it, like I have now. The idea on stigmergy is how do systems that don’t have orchestration get orchestrated?
Like, how do a group of ants know how to function as a colony when there’s not really any one ant that’s going around telling each other ant what to do. Stigmergy is about when one ant comes back and he’s got like you know, a big grain of sugar on his back then other ants were like, “Whoa that guy he’s got cool stuff I’m gonna follow him. I’m gonna check his chemical trail and see what he’s doing and try to do those same things.”
It’s this idea that if you demonstrate work in a group of people, other people will probably start working with you. And so if you come in every day and you keep plugging away at it, other people will show up. We call this grind or hustle sometimes same kind of thing.
Strike when opportunities arise
So lastly, strike when opportunities arise. So some of the worst most stressful scary times in a place or when something is broken, but those are also the times when you have the most leverage. Because if you can demonstrate value at a time when something is broken, you have just totally one-upped like all of the other tools that are sitting around that you’re trying to replace or what have you.
And this is really good stuff. This is also good for your own for your own stock within the company or, what have you, like, this is promoting you and showing that you and/or your team have the ability to to make this work even better.
Promoting the work of the teamSo on that note, engineers, I have found, are not typically all that excited about advertising their work.
I’m not really sure why. We tend to have a lot of hubris, and think that we can do anything, yet writing up a big blog post that describes all the work that we’ve done is sometimes not our favorite thing. But people don’t know you’re doing work if you don’t talk about it.
You can do so without sounding like a pompous ass, and some of the best ways to do that is to promote the work of the team instead of yourself. This is not just you going out and doing this work, this is a collective team. As I mentioned earlier, I’ve got a team of six people that work with me on this every day, and it’s only through our combined efforts that we’re able to do this.
More so though, I mentioned leveraging other teams and finding power users. Promote their accomplishments. Every time I see an email at Stripe or someone says, “Look at this cool thing we shipped.” And I see a Datadog chart in it, I know that my team helped make that possible. And that promotion is a great way to get other people like look at the success we have created in other teams, these are our use cases these are our success stories.
Ask to help so you can learn
Also, if you see someone doing something, like, say, they’re embarking on a cool reorg of that service that everybody knows is terrible, and they’re gonna start over again this is a great opportunity for you to sort of sidle up beside them and say, “Hey, can I can I help you with this? Can I help you instrument this? Can I help you with the code points? Can I help you set up the monitoring for this?” or something like that. And then use that as an opportunity to learn. The more you help other people around you, and the more they see you as a helper, the more you’re going to learn from them, and the better your results are going to be in the future.
Team branding (‘Observa Bees’)
Last, we have a cute thing at observability…or in observability at Stripe. At some point we came up with “observa bees” as a cute pun on observability and so we have a brand now. We actually have little emojis. We have like, I don’t know, five or six different bee emojis that we slap on almost everything that we do. We do we use them in the signatures of emails people call us “Observa Bees.”
They join our slack channel and use a bee before they say something and put like a little wave. I’m not sure where it came from, but I love it because it’s just something that people can seize on in a way that they can address us.
Strive to be helpful
We also try to be extremely helpful. Datadog has set a great example of having awesome customer support. We try to do the same thing. How can we help the other engineers go that extra mile? When they ask us a question, we don’t just answer it. We answer it and then we partner up with them to make sure that the answer that we gave worked, and that we’ve seen it through all the way to completion.
Our willingness to invest in any, like… I think when I pitch working and observability at Stripe to people (always be recruiting). We always try to throw in that the whole polyglot thing. We probably touch every code base within Stripe. We write every programming language. This team has the opportunity to touch anything at Stripe. And that’s awesome.
Not everybody gets that possibility. Like, you often get siloed and you work in like your little part of databases, or our query parsing or whatever it is. We try to work everywhere and we make that part of our brand.
Make it easy and goodSo the tools that you’re gonna build for people, and the stuff that you’re gonna do with Datadog in your organization, it’s your job to make it easy and good.
I like to use email as an example of not doing things good, but making them easy. How many of you have received email today? You don’t have to raise your hands because I know you all did. How many of you were excited about that email that you got today? None of you were.
Yet, if you received a package, like if there was a box sitting on your doorstep this morning, you were probably excited because that’s an easy thing to do but it’s good. The thing that comes to you is usually something that you asked to have come to you, right?
So it’s very easy to write easy things but it’s very hard to write easy good things. But, I mean, write, create you know whatever it is that you’re doing. Process or engineering work. So it’s important that you make it easy or even automatic to do things right and extremely hard to do wrong.If you don’t want people to build monitors that trigger constantly and wake them up in the middle of the night, you have to make it difficult for them to create those types of monitors, which may mean that you create monitors for them that follow good standards or that you’ve lent them or enforced some kind of standards or what-have-you.
Quality is really, really, really important for this stuff. It is extremely important that you do not wake people up at 3:00 in the morning unnecessarily with a bunch of trash that you’ve generated and you’d be like, “Sorry guys, you know, I just…sorry. You know, we just did that. Our bad.” You have to follow up on that stuff.
So I’m gonna shift now into talking some more about specific things that we do with Datadog where we have tried to make some of these things that I’ve described to you possible. So this is the show and tell part.
Automated monitors is our, sort of, like, brand name for monitors that we have created. We use a feature of Datadog called “multialerts.” [00:30:325] So when you make an alert with Datadog, you’re often just saying, “If this threshold is crossed for this metric.” But there’s also a powerful feature where you can say, “If this threshold is crossed for this metric for any specific tag they’re in.”
So let’s say, for example, you’ve got these common problems like disk space or swap space utilization on a UNIX box. We can set up monitors automatically, that because each host is tagged with the name of the team that owns the host, we can detect that these things happen, make the host’s team one of the tags that we alert on. And then downstream we will, create an alert for you automatically anytime that this happens. So we will notify you. You don’t have to set it up because I don’t have to go to every team and go, “Did you make that monitor to make sure that swap space doesn’t get violated on your box?” There’s no one ever remembers to do until three in the morning when all the swap space is gone.
So we’ve done this for people. But we have to be very very careful, because we’ve now basically lurked an alert in that’s gonna page you perhaps at 3:00 in the morning, and you didn’t even know it existed. You have no state on it. You didn’t make it. You don’t know what to do with it. It’s very, very important that anything you create that notifies somebody be actionable. It’s just, sort of, like, complaining at people at 3:00 in the morning is bad. I must have some sort of, scar tissue around 3:00 in the morning because that’s what I always cite as the worst time, but I don’t think anybody likes getting paged, period, much less than 3:00 in the morning.
So when you show people that something is broken even if they didn’t ask you to and you show them how they can fix it easily, they care significantly more than just saying, “Hey, guess what? You’re out of swap space,” and then running away from them. They want that help.
Example: automatic ticket creation
So this is an example of a ticket that we generate automatically. We only do this type of automated monitoring for things that are not going to page you right now. We only do it for things that are perfectly fine to deal with during business hours.
So this is JIRA. The important part of this is that the reporter is Botty McBottface. That’s probably the best feature of all of this. It seems like we all got really lazy with naming after Boaty McBoatface and now everything is Something Mc-Somethingface.
So over here, we’ve labeled every single ticket that we generate with a couple of things like “Datadog,” and the “ticket maker” is the name of the thing that does it. But we’ve also got the name or the number of the monitor. This is really important because sometimes somebody will do something that causes us to accidentally generate like 50 tickets out of nowhere. I need to be able to quickly say, “All right, find every open instance of a ticket created by this monitor so that we can close it and apologize to the users.”
Here, we are pointing them out that if this monitor is bothering you, if this is not something you want from us, please give us feedback. I have more slides on this in a moment. We’re also taking advantage of Datadog’s included charts to automatically show them what’s going on in the ticket, which is very helpful. We do this for… I forget how many of these we have there. It’s something like seven or eight, but we monitor swap space, services that are in restart loops, disk space, cron job failure, all kinds of things. Running out of inodes is another one.
The other common thread to all these is they all have simple solutions. We know how to fix this. Like, if you’ve ever had a box run out of inodes, you know how to fix it. You probably forgot now and you probably have to be reminded sometimes, but we have solutions for that, too, because we made things called “investigation dashboards.” When you get linked by this ticket or when you get pinged by this ticket, we always have a link that brings you to a screenboard that we’ve made because screenboards have to be used for this because the widgets are cooler. We can’t use all of these features in timeboards.
But, here, we’re giving people like, “This is an automated message.” We’re telling them that. We’re saying, “It was created for you by observability. You can learn more about it, we have a whole wiki page, and also please if you have feedback tell us.
This is everywhere. We put this repeatedly. It’s also… Oh, don’t press that button. I don’t know which one it was, actually. You know, we’re also telling them… We’re beating them around the head and shoulders like, “Please give us feedback you’ve got stuff.”
But this one is for DAEMON Tools we use to keep services running. We detect restarts. So here this the service was restarting, like, 800 times a minute. That’s probably not desirable, and so we’ve shown them very clearly where the problem occurred. We’re giving them feedback on, like, mouse over this and you will get the… Because everybody knows you can mouse over the charts in Datadog and you get the ticket or the the tags as an overlay.
So we’re basically walking them through how to use this, and then when we’re done, we’re actually giving them a runbook down here. This is how you solve the problem. This is curated by us.
We take feedback all the time. If you think you’ve got a better way to solve this problem, please let us know. Well, you could, too, but I mean the users that get alerted.
And then lastly, we go to the feedback section. So this is just a Google form like there’s no there’s no cool whiz-bang technology behind this. I just go and read them every couple of days. So we ask people…
This is a great example by Julia Evans Burke on Twitter. You may know her. She filled this out because we had a service that was continually restarting and we notified her.
And so we ask you like, “Was this helpful? Did we just give you a bunch of junk?” And so she was nice enough to give us a five. She said it was great, took her a minute or two to find it, but, hey, everything was right here. So we asked for suggestions on how to improve. And then lastly since we’ve got you here is there anything else we can do? Is there any other way that observability can improve? “keep being amazing.” I’ll take it.
So this… I wanna give a shout out to a a friend of mine Kelly at Simple. He showed me a cool dashboard that they had built where they took every single incident and they charted, like, how many of them were happening, and by what team or whatever.
This is a dashboard that we scrape all of the data out of PagerDuty every single day. And we bring it in to redshift and then we use a product called Looker, which is a commercial product to basically do BI on it.
This was a dashboard built by Steven, our intern, in the observability team. And it specifically has the ability to choose which escalation policy things went to. The date range, the urgency of it, and then whether what the SLA was supposed to be. And we’ve got cool charts here that show the number of incidents by day.
This is the my favorite one though. This is the incidents by origin. This big one here is Datadog monitor 447391. I don’t know which one that is. But 32 percent of all pages are coming from that one monitor. This helps engineering managers know how to address their on-call toil, like, is this something we could improve? Is this monitor being too loud? Is this something we can make better? So this is all about value. How can we both bring down? How much we’re bothering you and also create more value in your team?
This slide is, kind of, small. It’s mostly here because, I think, this is a problem in the whole industry. Many, many tools that allow you to build alerts and monitors and stuff, they let you build all these things and you can put them in and they’re great, but there’s very poor ownership stories.
I don’t know to whom this monitor belongs/ We often have to sort of hack this in whether it be Nagios or Datadog or what have you. Our organization though, especially as we’re growing, when the company is small and there’s like a 100 of you, you all kind of know each other, and you know, who’s responsible for what. And if it breaks, they are probably 70 people that know how to fix it.
Nowadays though we can’t do that anymore. We’re specialized. There are people who only understand some systems of Stripe, and therefore, ownership is important. You don’t want to page the wrong person. And so we’re still worrying about how to solve this problem, but I just wanted to throw it up here because not everything’s rosy. Ownership is very difficult for us.
Did creating a culture of observability work?
So did creating a culture of observability work? I mean, obviously I’m not up here asking you to hire me, so I’m still employed, which means, I think, I’m doing an okay job.
Yes, we have totally… It was very weird, for me… I gave this talk first to Stripe people and it was, kind of, weird. I was like, “Yes, it worked.” No one argued with me. So, I think, we’re doing okay.
For some teams, man, it’s been great. We have changed some teams. Some teams are shadows of their former selves. Strong, strong champions and huge improvements to their confidence, because that’s a big thing you get from this, is you’re confident that your services are working well and that they’re not crappy.
Other teams, mostly the same. Not every team totally bites this. And lastly, there are some teams that are like, “Wait. What is the observability team?” These are really rare but those are the ones I’m the most interested in. How can we get over and help those people? But there they’re very rare.
As far as usage goes, I’ve updated this slide very particularly to reflect changes since I gave this talk the first time in June. Greater than 100% growth in every single measure about, like, the nouns we’ve created in Datadog.
So 450 dashboards that’s up 100 percent over June. So it was already big then but now it’s a 100 percent. So there were 339 dashboards in Grafana, now there are 450.
The notebooks feature is probably…we don’t count notebooks. That’s probably gonna help us because I know I have Cory testing and Cory testing 2. One of which is a timeboard and one of which is a screenboard, for reasons that frustrate me. But I told all the Datadog people that I would often complain about screenboards and timeboards, so this is also a joke.
We also have 400 different monitors. This is awesome. Like, this is 400 things that are monitoring whether or not Stripe is functioning and working well. In the old one, there were like less than a hundred. No one even trusted them. Often the way people knew that something was broken was they just watched all the time, or a customer, or we use the tried-and-true system of being told via Twitter that our stuff was broken.
Seven thousand five hundred eighty-seven metrics doesn’t mean much because we have a lot of tags now. So how many metrics we really have? I mean, I know we have one metric, the cardinality of which is something like 100,000 over an hour. So that one by itself is that many metrics. But not really a comparison.
Back when I gave this talk originally, I was so proud. We had 2.5 percent response rates to the number of automated monitor tickets we had created, and all of them were positive and the average was 4.5. I’m sorry to say that it’s not true anymore, but mostly because people say, “This cron job that failed, I didn’t put it on this box.” And I’m like,“Yeah, but you own the box.” And there’s some like friction internally around who should we really be telling that that problem is happening. And that seems to be the source of the less-than-stellar feedback we’ve gotten on some.
But it’s still not really bad. They still appreciate being told. They just wish we’d tell somebody else. I do, too. I wish I didn’t get paged.
Problems, things that we’re having trouble with though. I did a really poor job when I, sort of, designed our naming scheme, which was I, sort of, trended toward really general names with a lot of tags to make it more specific. Well, that really quickly gets you into tag cardinality problems.
So for those not familiar, each unique combination of tags and name, generates one time series, kind of, on the back end. And so when you say, “I want to sum them.” You’re asking to sum a lot of potential time series under the hood. This is something that has… Datadog seems to be, like, technically keeping up with us because as we’ve increased it there combination of caching and performance improvements have made it mostly okay.
But some of our most core metrics like how many API requests is Stripe having, are also the ones that we have the most desire for people to put more tags on, therefore, we get the most cardinality, therefore, they perform slowly. This is something that I wish I had done better in my naming. We’re starting to shift now to where the service is in the name instead of as a tag, which significantly drops the cardinality, also, some other stuff we’re doing with aggregation that I’ll talk about later.
For an individual service owner knowing what metrics are available to them is tricky. Like, they can go to the metric summary, not the Explorer. They can go to metric summary and look for metrics. But it’s very difficult to know which ones are for me. If I use a common framework at Stripe, how do I know what metrics are available to me? I want there to be like a service catalog where I know what my service is and therefore what metrics are relevant to it.
We still use Splunk internally and there’s still a lot of questions around should I be logging or should I be metric-ing? We’re gonna make that worse when we suddenly add spans and traces to this whole thing soon. But I have plans for that which maybe will be our future talk.
And lastly screenboards and timeboards. That’s annoying to both me and our users.
But lastly one of the things we have trouble with is when to use a service check an event or a metric? And which widgets those things are available in and what features are available in a monitor? This is something you have to be careful with. In some cases we almost just omit all of them because we’re not really sure which one we’re going to be using, but they are very handy. So that when I say primitive features, I mean, the primitives that Datadog supports: service checks, events and metrics.
So things that we’re trying to adjust like where we were able to change the culture but what are we trying to do now? Where can we be better? We worked very tactically for the first six months. It was like just replace everything, and that was the measure of success. Did we replace it? Are people using Datadog and not using the other tools?
We’re now shifting to work on much more strategic things. There was a blog post at Stripe on the engineering blog last week for an open source project called Veneur, which we have used to replace the DogStatsD that comes with the Datadog agent, and we run our own effectively DogsStatsD in Veneur on every box.
We have central boxes to which it then sends metrics and generates global percentiles and sets. This is something that was missing for us because previously with StatsD, you know, you send all your metrics to one box, therefore you get a global percentile for, say, the timing of a function or an API call. We now get that, but we also get host local metrics, which is very nice.
So a big shout-out to Remi at Datadog, specifically. I don’t think he’s here today, but he was very helpful in us getting this to work because we kind of bombed Datadog with like 45 megabyte post bodies for a couple days while we were trying to figure this out. So sorry about that, ops team.
We also have, well, talked about… Well, there’s one of the things I didn’t put on the slide which is we have an open source repository of our own checks. So there’s the agent which comes with all the checks in our main line. If you go to stripe/datadog/checks, we have a whole repository full of open source checks that we’ve either felt we’re not really great for inclusion into the main line because they were a little weird, or maybe our work is in progress or something like that. We recently finished a Splunk integration that lets you monitor like search your your master licensing, search heads, all that other type of stuff that is very helpful for us.
I’ve been telling you all throughout this talk that you should have good metrics on how your stuff works. We totally don’t. We managed to get by just on goodwill for a long time and my stunning good looks, I guess. But now we’re actually having to prove. Like, Datadog is a thing and it costs us money, and are we making effective use of it in our incidents easier to solve, and stuff like that.
Lastly, monitoring is still hard. It’s really tricky for someone who doesn’t steep themselves in Datadog every day to write a good monitor. That’s not because Datadog’s monitoring stuff is bad. It’s because it’s very powerful. And so explaining to someone when to use an average, a sum, a count, a rate, all of, at least once. The things, like window must be full before evaluating. These are very powerful tools that are very difficult for your average user to understand, and they get frustrated very quickly. And if they create shitty monitors, they just turn them off. And that’s not what we want.
So in summary, start small with these changes. You’re not gonna just come in and just whirlwind change everything. It’s gonna take time. You’re eroding something away, so be like water.
Seek feedback often, and specifically remember where that feedback came from so that you can circle back and show them that you actually listened. You don’t have to do what they say but you do have to follow up with them.
Think on your value, what is it that you are trying to provide with the tools that you’re that you’re selling to these people? And then be sure and measure the success that, or hopefully the success that you’re having, be sure to measure the effectiveness of these changes and that they’re actually doing what you said they would do.
And lastly, enjoy making a change like this. This has probably been the most fun in my 20 years in this industry that I’ve had. Making this change for an organization like Stripe and working with a group of people who are so happy to to work with us on these changes.
It wasn’t always easy, like sometimes it was contentious, but, by and large, they were accepting and happy about it. And this is probably the most fun I’ve had. The last year and a half has been the most fun I’ve had in my in my entire career.I think that’s it.
So I want to say thank you to the observability team at Stripe. So [inaudible 00:46:27] is actually here today Andreas, Josh, Kieran, Chris, Stephen. All of Stripe, because everyone has been so helpful in all of this work. Also, a big thank you to Datadog because we wouldn’t be able to do this if they weren’t here building all these awesome tools and helping us through all this work.
If you’d like to learn more about some of the stuff that I do you can go to… Here’s where I’m advertising a bit. I’m gonna keep that brand up. Engage with my brand at onemogin.com.
There’s an observability section at the top if you only care about that, but that’s all I seem to write about anymore. You can see me on Twitter at @gphat. But be aware that, I think, weird Twitter is hilarious and I retweet a lot of weird stuff.
Lastly, my GitHub, github.com/gphat. Also, at /stripe, you can find some of the open source stuff that we do for this type of work.
And if you are ever interested in talking about any of this, or, you know, if you want to join the observability team at Stripe… I shouldn’t be recruiting at Datadog’s event, but I’m told I must always be recruiting. So therefore you can always reach out to me with questions or whatever at firstname.lastname@example.org.
And lastly, the joke slide. Questions, to feed into my feedback loop of making this talk better, that was supposed to be funnier than it apparently is. But if you give me feedback, it goes to me and then I put it into the slides and then this talk gets better.
Audience member: [Inaudible 00:48:14]
Corey: So how do we get to 450 checks and who helps make them? Some of this was stuff that was already there. Like, we replaced a lot of Nagios checks. Stripe’s got a bunch of very experienced people that had these things in place. Other ones, we found as opportunistic like the cron job failures. We sent them to email lists and people ignored them. And so a much better solution was to… We also have rate limiting built into our ticket maker, such that if it fires a bunch, like, you don’t get inundated. But the majority of them originally just came from us by hand converting them over from Nagios. Then later, we started to add a bunch of supplemental ones. Again, go to your incident review meetings that the organization probably has, and then what could you do to help prevent those things? A lot of them came from that. Most of them came from the collective wisdom of all the smart people around Stripe. And then many of them from the observability team itself that just had a cool idea. And then the lion’s share of them though to be clear are the ones that teams have made for their own services. So we represent a small portion of that. [Inaudible 00:49:14]
Audience member: Hey you said that there’s a 550 employees at Stripe. How many of them are engineers are technical or kind of like your users?
Cory: I’m not allowed to tell you. They told me I couldn’t give that. I don’t know the number but they told me I couldn’t say that number. So 550 was what was on the press page. I can’t say percent either. I heard that.
Audience member: You have a team that five or six people dedicated for observability. At what point in the Stripe growth either by quantity or some other measure that Stripe decided they need that sort of dedicated team to do that? That’s the first question. And the second is, there’s a Datadog, you know, API to send the custom metrics to StatsD and eventually show up in Datadog. It’s sort of… You, sort of, have to educate the the application, you know, engineers to to make correct usage of it. How does the observability team, you know, work with the engineer? I mean, is your team responsible for sort of educating the engineers to make correct usage of it? Or what kind of things that you do that, you know, if they want to have better visibility and the application they should be sent. I just want to understand, you know, where the the division of responsibilities in Stripe.
Cory: Sure, it’s a very good question and a weird answer to it. So the first question was, how does Stripe choose how many people were gonna be on that team? I’d like to tell you that there was some sort of analysis and cool math, but it was basically just that’s how many people have decided they would like to be on the team. Stripe’s very fluid about teams. If you’re interested in joining a team as long as you’re not like leaving your other team in the lurch, then it’s fine usually swap teams and join them. Originally, the observability team was just a bright flash in my eye, and then I managed to find a few other people who were interested. And so we operated with a team of three for up until probably the second quarter, and then we’ve added a few people since.
And so now, we’re, sort of, growing proportionally to the infrastructure group. We are one of the teams that, I think, is better is higher level staff. But also because…maybe I’m biased but I just think this stuff is so cool that it’s easy to recruit for it. But there’s no particular reason we have necessarily seven people.
But that leads me to the answer to your second question which is, we have to get people to instrument their code and to know how to make good use and effective use of these tools. The first is that we maintain the… So we don’t have people use the StatsD libraries directly. We have a library called “metrics” that is part of the core framework that is used within all of Stripe’s products, and then we also have them in other programming languages as well. But basically we create the libraries that our engineers use. So we have tailored them specifically to the use cases that are our customers…not Stripe’s customers, but other engineers at Stripe what they’re used to using. So in many cases, we are automatically instrumenting things for them and they don’t even know.
Now, how do we educate them and make them better at it? I mean, this is both a good and a bad answer. It’s bad because it doesn’t scale well to have to go into every team and teach, but that’s effectively what we do. One of the things that I like to do and that I really push within the team and that has attracted some people to the team is I like to go and… I think most of us familiar with SREs by now where you tend to take an SRE and embed them into another team to sort of, help them with their operational function. That’s what we do with observability folks. We try to get them to go work in another team for a few weeks and sort of help out with that problem.
Like, right now we’re paired up with with our risk team, and they’re improving the, sort of, like, they’re getting rid of all their old Nagios stuff, and converting it over and then instrumenting code where possible. So basically we do one-on-one partnerships with other teams to try to improve the state of the art. I hope that got all the different steps.Man:Ilan: Sure, just last question.
Audience member: You showed the survey. That’s one way of collecting feedback to make sure that your team is doing well and that you’re improving. What are the measures you’re going to use when you’re able to report on the success of the durability team?
Cory: So the aforementioned survey is obviously helpful but only really for automated monitoring. Stripe has a culture of…it’s called 360-degree feedback where you solicit feedback from everybody around you so the collective, sort of, feeling of how are these individuals doing sort of stacks up. Just yesterday, there was a discussion in a channel about Stripe adopting some new thing and someone… I totally didn’t bribe them, specifically mentioned that the observability team has been very successful bringing this sort of change into Stripe, so what can we learn from them on that? So, sort of, like, tapping into, like, the general trimmers of the of the company.
And then lastly for things that are more quantitative. I think that our mission is basically to reduce meantime to detection. So how quickly can we bring down knowing that something has occurred? I can’t necessarily help you with meantime to remediation because I don’t know that your service is tooled well enough to deal with the problems that have come about, but I can help you with meantime to detection. And so that’s probably the best measure that I think we will have in terms of our effectiveness across the organization.
And then, of course, we have a significant impact on the number of incidents and the urgent…and the, like, how bad those incidents were. So we have a like a level system internally level 1 level 2 level 3. When those happen, we’re not directly responsible, but we can certainly help prevent them in the future.
So if we’re good at keeping like five nines or whatever then hopefully that’s a measure of success for us. I hope that covered all the things.
Ilan: Well, thank you very much, Cory. That was fantastic.