The Service Map for APM is here!
Creating context with service maps (Datadog + Airbnb)

Creating context with service maps (Datadog + Airbnb)


Published: July 12, 2018
00:00:00

Connections matter

Ashley: I’m here to tell you today about something, a way that Datadog is going to help you manage complexity in your services in a very particular way. Visually, because at Datadog, we believe a picture is worth 1,000 words. A smart visualization can help you grok a complex system just in a second. Things like the Host Map, which many of you here are familiar with. It shows you where your hosts and containers are deployed in a second. You can see what’s going on. And like many things Datadog does, it just works. Now, I look at this and I love it. It’s great, it looks pretty, but I want something a little bit more because I always want more. So, what do I want? Connections. I wanna see how things fit together. Because in this world of fast changing dependencies, where we’re moving from monoliths, to services, to microservices, your dependencies are changing out from under you so quickly. Your colleague launches a service, and you’re really lucky if they tell you that it now exists. So, how do we manage this?

We do things like this, ye olde service map. Many of you in this room are familiar, we’ve all white-boarded and Vizio’d this. But they take a lot of time, they’re very quickly out of date. You all can relate to showing up to a job and being handed an out of date service diagram, it’s like eh, it’s close enough. But we still do it. Why? Why do we spend time creating these over and over again? Because they’re useful, because connections matter. It’s essential to how we run our teams and how we run our businesses. So, if this doesn’t work, what does something that does work look like? Well, it’s gonna do a few things for us.

Understanding your application

First, it’s gonna say, what does my application look like now? It’s gonna give me a snapshot in time of what my application looks like. Next, it will tell me how does my service fit in the big picture? It’ll provide context, context for how I fit with my team, with my customers, with my business. And once more, it’s gonna tell me what are the state of my services at a glance. I wanna be able to tell, is it healthy? Is it fast enough? And if something’s alerting, what does it mean for the system as a whole?

Service Maps

So, if only something like this existed, which is why today, I’m really excited to announce to you Service Maps. This is Datadog’s very particular solution to this problem. It gives you the thousand-foot view of what’s deployed and the connections between them. And then it goes straight to a detail view in a click, where you can see something happening very narrow. And it provides you service status in context, so I can see every service and the monitors attached to it and also the absence of a monitor. So, I can see where I’m missing monitoring and probably should add it. Now, I told you that a picture is worth 1,000 words. I think a demo is worth 1,000 more, so let me show you how it works.

Service Map demo

So, we’re gonna see here this is the thousand-foot view. We’ve intelligently grouped services, so you can see things that are connected. At the top bar I can filter down quickly by a web server, DB, a cache, or anything I’ve defined custom. I can filter by environment, so I can look at production staging test. And very quickly, I’m like cool, thousand-foot view, I want to zoom down and look at a particular part of the system. So, I’m able to zoom in and take a look, and while I’m here, I noticed that there’s colors, so I can see where my monitors are. I mouse over and I can see a live data flow of my master database and its dependencies. I click in and now, I see the first order relations of this database. I mouse over I see monitor stats, I can click, view, and take a gentle stroll through all of my services one by one. And that’s great, but sometimes I don’t wanna explore, I wanna look. I know exactly what I’m looking for. In which case I can go over to this handy little search bar, type in the service I’m looking for, in this case Mindy, and as I mouse over, I can see just the service I care about.

And from there, I see top level statistics. I can mouse out. And I mentioned that you’re able to see service status in context. What does that mean? So, as I’ve been going through this, I’ve noticed some red and I see…let me check out my database here. Cassandra is in an alert state. I click in, what does it mean? Well, I see some green, I see some unmonitored things, and I see that one of my web servers is not very happy. So, I know I need to go check on that. So, I’ve seen service status connected with its context, and that there is a quick overview of some of the key features of the Service Map that we’re showing to you today. I think that the best person to speak to what’s happening here is one of our customers, but first a question. How many people here have stayed in an Airbnb in the last year? All right, that is a lot of people, and I’d like to welcome to the stage, Willie Yao, engineering manager for observability, who’s gonna tell us about why his team’s excited. Welcome.

The Service Map at Airbnb

Willie: Thank you, Ashley. Hi, everyone. My name is Willie, I’m the engineering manager for the observability team at Airbnb. I’m very excited to be here today. Airbnb is a global travel community that offers magical end-to-end trips including where you stay, what you do, and the people that you meet. My team’s mission at Airbnb is to ensure that our software engineers have the monitoring introspection tools for them to successfully develop and operate their services. With a large team of engineers developing hundreds of services, maintaining high performance and availability at our scale is a really difficult challenge. I’m gonna tell you why we’re so excited for Datadog’s new Service Maps, and how it helps us with that challenge.

1,000-foot view of distributed systems

The ability for our engineers to effectively introspect across our distributed systems is a super important lever for us, enabling us to quickly develop new products to provide better features, to provide magical travel experiences for millions of guests and hosts. Once we’re passed 100 services though, most Service Maps become difficult to interpret due to the sheer volume of information. With Datadog’s visual clustering, we have an opportunity to more easily introspect on the areas that we care about and visually see the relevant dependencies.

Real-time root cause analysis

When an issue occurs, being able to overlay error rates and latency on top of your topology becomes critical to determining where in a distributed system the root cause lies. The time to resolution saved here can be the difference between Airbnb guests struggling to contact their host in a really magical guest check in experience.

Quickly onboard new engineers

Another problem that Service Maps address is their architecture diagrams are rarely up to date. One interesting side effect of this is that it becomes really difficult for us to on board our new hires. With an always up-to-date Service Map, we’re able to quickly help new hires educate themselves about the new dependencies that are relevant for their new job. Having this reference quickly speeds up onboarding, allowing us to more quickly add new members to our team.

Architectural decision making

Finally, there’s often a difference between the architecture that you’ve designed and the one that you’re actually running in production. I’m sure this is something that many of us have learned the hard way. Well, when Datadog showed us their Service Maps, we discovered many unexpected dependencies that we hadn’t quite designed into our system, and we were really, really excited to find that out proactively rather than in the midst of an incident. And so Service Maps provides us with the first step in helping us understand what our architecture actually looks like which in turn helps us make better long term architectural decisions. So, that’s first about Service Maps. It gives us faster introspection, real time root cause analysis, more quickly onboard new engineers, and helps us make informed long-term architectural decisions. And with this, let me turn it back to Ashley.

Recap

Ashley: Thanks Willie, that was amazing. I can really see how you and your team are gonna be using the Service Maps, and I hope all of you here today can see how your teams might be using them. Willie’s giving a talk later today, so I encourage you all to check it out and learn more about how Airbnb powers their magical experiences. So, just to recap, Service Maps are providing you a global view of your deployed services and the connections between them. It just works, there’s no additional setup and they’re integrated with monitors you already have.