And lo there was monitoring (James Turnbull, Empatico)
Published: July 12, 2018
Good morning. I’m James Turnbull. I’m the CTO of Empatico. We build educational software for elementary school students. And before that, I was at Kickstarter and Puppet and Docker. So if you use Puppet or Docker, I’m partly responsible for some of that, and I’m terribly sorry.
I’m super excited to be here. I think the Datadog team has done an amazing job of putting this together. I remember sort of the early days of few people in the office and piles of stuff everywhere. I’m also really excited because Croatia beat England last night, which always makes an Australian very happy with anyone who beats England at sport, even if it’s not us. Let’s see what we got. I appear not to be clicking. So here we go. All right, cool.
A brief history of monitoring
This morning, I wanna tell you a bit of a story. So, this story’s a bit about where I come from in the monitoring world, a bit about the history of monitoring, a little bit of a call to action about observability. And for many of you, some of the story I’m gonna tell you, it’s history, it’s a place you’ve been. But for others and particularly people outside of this room, it’s a journey they’re still going through. Oops.
So, in the start of my career, I was a data center operator. And that’s a very long time ago, which is why I have no hair and gray in my beard. I’ve managed to keep it off my neck though, and I still don’t run GEN 2. But back in the day, I was the shift operator, and I worked with four or five other folks in a data center for a co-lo site. And we managed about 40 mainframes. And our monitoring system was largely us. It was a human-based monitoring system, and we worked with checklists and runbooks.
And some of you may be old enough to remember checklists and runbooks, but we would start a shift with ticking off items, you know, this bit of production is run. This job is run. This backup’s done. And that was the check system, literally manual, paper-based checks. And with 40 mainframes and four or five operators a shift, that scaled pretty well.
But the world changed significantly. And in the next few jobs I had, client-server appeared. And we started to see a significant increase of scale and complexity in the applications we ran.
And the second thing that happened was that back in the day…some of you may actually not remember this at all, but back in the day, IT wasn’t necessarily a critical component in a business. But very quickly, it became apparent that things like email, payment systems, IVRs, that companies couldn’t function without them. And as a result, IT became a mandatory requirement.
So. needless to day, additional complexity, additional scale, and high impact, things started to go wrong. And around this time, we started to see a significant increase in the number of outages, complexity of those outages, and the time it took to recover. And what emerged out of this was what I would describe as check-based monitoring. And some of you will recognize the “saint” reference in there, but Nagios is obviously the classic example of this, and sadly remains the classic example. But those of you familiar, check-based monitoring is periodic checking, like once every 1, 10, 30, 60 seconds, of state, essentially either a binary “it’s working or it’s not working” or some kind of threshold.
But this kind of monitoring is pretty fragile. And most of you in this room are familiar with the fragility of it, and a lot of you will probably have moved on to the next generation. But a significant number of our peers in our community are still stuck in this place. And I’m gonna quickly step through what I perceive as, I guess, the problems.
So, limited resolution of scale. If you would like to pick a fight with me later about whether Nagios scales or not you are most welcome to. I’ll happily block you on Twitter. But largely speaking, if you wanna scale most check-based monitoring systems, you are building a distributed system with its own inherent problems and things. And if you are spending more time maintaining your monitoring distributed system than you are your actual distributed system, you have a big problem.
It also has a limited resolution, limited granularity. So, from the perspective of a check, a lot can happen in the period of time between a check, particularly if you have to extend that period of time out to handle that sort of scale. It’s particularly hard to hit things like trends if something happens in seconds 0 to 29 but not second 30. When the check happens, you don’t see it.
Almost all check-based monitoring was originally reactive, and it was reactive in two different ways. One was, it was often implemented after the application went to production, which is not awesome, or it was implemented organically when something went wrong. “Oh, something is broken. We should write a check for that.” As a result, everybody’s monitoring system is both the same and weirdly unique in ways that you really don’t wanna have memories of.
Almost all check-based monitoring…because of where we came, most of the people that originally built the systems were originally were operations and infrastructure people. It’s very infrastructure-centric, and it’s very single data source-centric. So by that, I mean that it’s generally based on the outputs of that binary, “it’s working, it’s not working,” or the outputs of that threshold.
Up until actually, not that long ago, a significant number of check-based systems did not collect any performance data. In fact, some of them actively threw it away—Nagios. And until sort of people, particularly the folks at Etsy, started talking about things like StatsD and Graphite, there was not a lot of metrics gathered. There was not a lot of data other than that binary state. And there’s not a lot we can do with that binary state.
And more importantly, that infrastructure-centric nature of things meant that we were monitoring widgets and not outcomes. I remember that early in my career, I would have a conversation with business people, and I would say, “the widget broke.” And they’re pretty smart folks. They can context-shift and work out what you mean. But ultimately, they were like, “Well, I don’t really care about widgets. I care about outcomes.” Whereas, if you’re thinking about the world in a non-infrastructure-centric way, then you are thinking about things like the invoicing system failed. And by the way, that means that a bunch of our customers probably missed their last invoice, and we might get an influx of calls to the customer center. That sort of stuff is actually useful information to pass on to a business person. The widget is broken, is not useful information.
The current state of monitoring
So the world has grown since then. Most of you are probably familiar with some of these things that we now look after. Some of them for good, and some of them for evil. But the world has distinctly not gotten simpler, and a second thing has happened to IT. Where previously, IT was mandatory, it’s now a business differentiator. Being better at product, being better at IT, being better at shipping product, having higher availability and better performance actually matters from a competition standpoint. There are organizations who have failed as organizations because their IT has failed.
The rise of complexity, scale, and impact
So, we’ve seen the world increase in complexity. We’ve seen the world increase in scale. Those client-server environments are now containers or serverless functions or virtual machines or spin-on-demand cloud instances. The impact has gotten a lot more significant. But largely speaking, monitoring hasn’t actually changed fundamentally.
There are a lot of people in this room who are probably looking into this and going, “Thank God, I don’t live in that world anymore.” But outside of this room, surveys, talking to a bunch of folks have indicated that for the vast majority of shops, the core of their monitoring system is still what I would describe as check-based monitoring with a variety of different tools, but something that resembles a Nagios-like thing at the center of it. And that’s a pretty frightening thought given how things have changed outside of that world. And it does disturb me that the parts of our industry that have advanced really quickly have not sort of allowed us to catch up a little bit.
So we’ve established that maybe we haven’t done so well in the past. And I think we need to think about what I would describe as a new start. And I think we need to just think about that in two different ways. The first one is that we need to think about what we’re doing now. We need to think about evaluating our existing systems. We need to think about how they’re built, what information we’re collecting, what we’re watching, and what we’re learning on. But more importantly, we need to think about a bunch of new requirements. This is not just about patching old systems. This is about thinking about the world in a new way.
And I’ve chosen the word observability. It’s a little bit…I don’t like definitions like this. I think this is very fluid at the moment. But currently, everyone’s set it on the word observability much like they’ve set it on devops, and hope it doesn’t end the same way. But this is the sort of paradigm that I think that we need to be at. This is something that…a world you live in right now. But for others, it isn’t.
So observability means an umbrella term. It embraces more than monitoring. It’s not a perfect definition. I think it encompasses a bunch of different functions. I look at observability including things like monitoring and testing, to some extent, sort of process-based things, changes, tracing, logging, all come underneath the observability banner.
A call to action
And my call to action here really is that in order to get to this place—I’m gonna talk a bit about how we get there—we also need to consider that as part of that journey, we need to be sharing that information. We need to be spreading that knowledge around to people. We need to be helping our peers in the industry actually take a step forward, and move into this new world.
And there’s two reasons I’ve expanded that. One, it’s important to give back. I’m heavily involved in open source. I write books. I try to leave every ReadMe I find slightly better than I found it.
And the second reason is at some point in time, you may end up working at one of these companies. And it’s much more pleasant to go into a company that thinks about the world in a better way than it is to think about in an existing, slightly backwards, slightly uncomfortable way.
Requirements, not tools
So, when I tell people to think about the new world and how to communicate that to people, I start with, I think about base requirements again. And the first base requirements I always tell people is it’s not about the tools. You can choose whatever tool you like. I think the Datadog team would love you to choose their product and that’s awesome. And I think it’s a pretty cool tool. But if you choose anything you like, the problem is always when you start a solution with a tool.
The number of architecture conversations I’ve started with people and they go, “We chose—insert name of tool here—and now we’re building our requirements.” I’m like, “Well, okay. What is the logic of choosing the tool first?” “I read this blog post.” Or, “This guy on Reddit said that this thing was really cool and hence, Mongo.” So this is not a great way of thinking about the world, and it’s not a great way of building product.
Know who the customer is
However, a good way to think about building product is to think about who the customer is. And in the observability world, the customer is no longer you. Your monitoring systems still exist under there. It still does stuff, but it is no longer the realm of you being the pure customer of that system. The customer of your own observability system is application developers, security people, DBAs, the business, and sometimes even external customers. So if you think about their business requirements, if you think about their needs as a first step, it’s a great way to actually sort of rebuild our view of the world because…sorry, I’m having a bit of clicker problems here. Okay, cool.
Because in my mind, observability concerns are actually business concerns. Observability is a business system. It happens to have technology underpinnings, but the real reason it exists is to provide feedback on the nature of your systems and the nature of the systems that run your business.
And so, I always start with gathering requirements: talk to the business, talk to application developers, and say what do you care about? What do you measure? What are the things that you’re accountable for? When something goes wrong, how do you measure that? What is the indicator there? And I build the set of requirements based on that. And I document the fact that all the things that…I have to be able to produce metrics that these people can consume in such a way that makes it easy for them to understand what’s happening, and saves a step in that whole context-shifting process.
Make monitoring a first-class citizen
Also, as part of this, the observability world is proactive and not reactive, which means that for most of you hopefully now, your transitioned so that monitoring systems are built as part of systems development, as part of building products. So you help application developers instrument things. You provide them with self-service tools that allow them to add applications and services, or microservices or whatever, to your monitoring environment without your involvement. You provide input and advice and architecture from day one, from when they start building the system. Monitoring is a first-class citizen in the design.
You can’t solve all of the problems. Your monitoring system will still evolve reactively. Things will still go wrong, particularly in a distributed systems world. But you can significantly reduce the blast radius, as people would describe it, or the impact of those problems by understanding how the system works and ensuring that the key inputs and outputs of that system, the business metrics that function in your organization actually are addressed from day one, instead of you having to go back and do software archaeology on a bunch of code to find out what’s happening.
Observe systems over components
And well and truly are we past the day where the cattle versus pets argument exists. It’s done, right? Most of your components are disposable, whether that’s serverless function or a container, the number of things in modern environments that are essentially fire-and-forget or essentially things whose lifespan is significantly smaller than your traditional assets. So you need to focus on observing systems.
Think about components as interchangeable. Stop thinking about the world in terms of CPU, memory, and disk, but think about the world in terms of this is a system that runs a business process. I may need to be able to drill down into that system, I may need to still collect some of that data, but it’s not the primary reason I monitor it. This also encourages people to monitor end-to-end. It really matters, the customer experience matters to our end users. You need to be looking at things as a whole system instead of just the components you manage.
Broaden your data sources
You’re gonna have to use more data sources. The observability world is definitely one in which you need to combine a bunch of information. And the basics of this are things like, your monitoring system still exists, you still have checks. You’re definitely gonna include metrics data. You need to think about logs, either adding logs as diagnostic context-driven, counting log events to provide metric data. You need to think about traces. You need to think about the metrics that come off of your processes, your continuous integration, your testing. You need to think about your configuration management tools and the data they export. You need to think about the context around that data, the business telling you what these systems do and their components do.
Because that rich data means deep context. If you are gonna be dealing with complex systems, you need as much information as possible. And on day one, when you see a fault, you need to be collecting information that says, I detect latency has suffered on this particular system. Okay, I know about this system. It belongs to these people over here. I can drill down into. I can see its pieces. I can add that context to rich data. Tools tell you…tools give you data that tells you information. Data with context actually gives you answers.
So, to sort of finish up a little a bit, I look at the monitoring world as being, it’s still a cool thing. It’s still tech we should work on, but it is merely a component of what we do for a living. And it’s a component that we need to look at and reevaluate and refactor, but it’s really a symptoms-based detection system. And we are no longer in a world where that is sufficient. We’re now in a world where you need highly granular insights into the systems that you manage together with business context, together with understanding of service-level objectives and SLAs, and the needs of the business. Inside this observability world, it’s about understanding.
And my ask for you this morning is that publish that blog post, open source that code, share that information, tell that story, give a conference talk, help the rest of our community take that next step into a better place. Not only is it great for you professionally, but it’s great for your organization, and it’s great for the community as a whole of monitoring people. And we can later say…we’re no longer in a position where, there are folks allowed to call it who say, monitoring sucks. We can actually live in Jason Dixon’s wonderful world where monitoring is all about love and happiness and hugs. I’m not possibly there quite yet, because I’m very cynical and Australian. But it is important to understand that this is a step forward we all need to take. And I very much appreciate your time listening this morning, and thank you so much.