Dash conference! July 11-12, NYC
Monitoring Strategies

Monitoring Strategies


Published: October 9, 2015
00:00:00

Introduction

This is pretty cool. There’s a lot of you out here. I think the room fits like 260, 250, so it’s pretty awesome. My name is Matt Williams. I’m the evangelist at Datadog. And just kinda curious, how many of you have heard of Datadog? Let’s start there. Okay, good. Having the huge booth with big monoliths and lights probably helps. How many of you have gotten a demo? Okay, cool, yeah. How many of you use Datadog? Awesome, awesome. What’s next from there? I can’t think of any other questions.

So this session is monitoring strategies, finding signal in the noise. Very cool title. I think the session is pretty cool as well. But for those of you who might not be familiar with what we do, this is not gonna be a sales pitch. But I do wanna say what we are. We’re SaaS-based monitoring platform. Basically the idea is you load up an agent on each host and then we start giving you data about what’s going on within your environment.

And the talk is talking about how to deal with all these metrics that you’re collecting. You’re potentially collecting a lot of metrics. We see this because we’re collecting a lot of metrics for all of our customers. And we used to use a figure in our talks, maybe four months ago, that said we are bringing about two billion data points every day from all of our customers. Huge, huge number. And I checked last night, I asked some of the developers what’s the current number, and we looked it up. And right now the number is somewhere around five or six million data points every minute, which comes out to about… Last night probably wasn’t in the right state to do a lot of math, but it came out to about seven or so billion data points a day, which is huge, amazing.

Now, you aren’t dealing with seven billion data points every day. But you are dealing with perhaps thousands or tens of thousands. And knowing what to look at and making sure that you don’t get hammered with alerts that don’t make any sense or don’t mean anything is the goal of this session. I wanna make sure that you maybe get down to just the few major alerts that really make sense for your environment. Now, I can’t give you, “Turn on this alert, and this alert, and this alert, and this alert. And you’re good, and you’re done. Your ops job is finished. You can move on to something else.” There’s no way to do that because your environment is totally different than my environment, is totally different from the guy or girl next to you. So there’s no way to come up with what is the one thing to do. But at least I can give you some guidance as to what to look at.

Collecting data

So the way I like to start off is collecting data is cheap. So, grab all of it. Not having it when you need it is gonna be really expensive. Now, this is, I don’t wanna say controversial, but I guess there’s two major camps. There’s the folks who say, “Collect everything,” and then there’s the other ones who say, “Well, only collect what you need.” And then when you’re done with that, stop collecting that and continue on. And then when there’s a problem, start collecting that thing and then stop.

Because having the storage to collect all this stuff is expensive, potentially expensive. But we think the other way. We think you should collect everything. And again, not having it when you need it is really expensive. Because chances are if you turn on monitoring for whatever has the problem today, you probably missed the thing that caused it right at the beginning, maybe two days ago or two months ago. And being able to roll back in time to the previous two months to find out what was the serious offense that came out to and ended up in this disaster. And if you don’t collect, you’re not gonna have that.

So I originally had a picture in here. You probably know what the picture is because you’ve seen it on the Internet. Instrument all the things. It had that character with the yellow…everybody knows that. Apparently there’s a copyright on that and I shouldn’t have included it. But the AWS said, “Oh, do you have rights for that?” I said, “No, I don’t have rights.” So I’m back to just text. But you can imagine.

More containers, more velocity of change

And the reason why this is complicated… you know, we’d like to think of operational complexity. The number of metrics that we’re having to collect and process through is only going up. And this is becoming more and more true as we start using Docker as well. Docker, a lot of us are using Docker in our environments. And we looked at…Docker is just one example. But we looked at all of our customers, and we looked at who are the ones who are actively using Docker. And then out of those, how many containers are they running and how many hosts they have. And we’ve come up with an average number of containers per host.

I don’t know if this is totally accurate because a lot of our customers just started kinda playing around with it. And they may have one host and it’s got two or three containers just to see what’s going on. And then we’ve got other customers that have thousands of hosts, each with thousands of containers, ending up with hundreds of thousands of containers on certain days. And so coming up with one number for an average number is kinda tough. But right now the average is two. Which is actually…when I asked about that, I found out about this a couple of days ago, because last…we did this kinda study a year ago, and the number was five. So I thought, “Wait, are we using fewer containers?” No, the reason is basically we’ve got far more customers that are using Docker. But a lot of them are still just kind of in the evaluation stage. So that’s what kinda brought the number down. We’re gonna have a blog post about this pretty soon, about who’s using Docker, how are they using it, what are they monitoring.

So the result is you’ll end up with n times as many hosts that you need to manage. And this affects pretty much the entire ops flow, from provisioning, and configuration, and orchestration, all the way down to what we care about at Datadog, which is monitoring. And that complexity just increases as a number of things change, number of things that you need to measure increases and velocity of change.

In this velocity, there are a number of things to measure, basically we look at…for every hosted machine, every virtual machine that you have, every instance that you have, you probably have about 10 metrics that you’re looking at just on the hosted…on that virtual machine. Those are probably 10 metrics that are coming from CloudWatch. And then there are the operating systems. You load up some OS. It’s kind of a weird echo, isn’t that? Nope, maybe not. Then there’s some sort of operating system that’s running on top of the virtual machine, which is probably gonna be Linux, it could be Windows. But it’s probably gonna be Linux.

And you’d probably got a hundred or so metrics there. And then you’ve got n containers. So you probably got another hundred or so times n. And then you end up with about 110 plus, 100 times n, per virtual machine. And okay, we open all, multiply times two. So 100 virtual machines ends up with…if we’ve got an average of two, we end up with about 200 containers. Really hard math here. And if we end up…we have 160 metrics per host with just a virtual machine, we end up with about 310 metrics per host.

Now, usually, I’ve incorporated some of these slides in some other decks before. And usually when I get to this slide, there’s always somebody that sends me an email saying, “Your math is wrong.” No, no, no, it’s not wrong. Because 100 plus 100 times n, 200. 200 plus 110 is 310. Just in case there’s one of you who’s gonna send me an email, no, my math is correct. It’s even easier to do it two versus five.

Okay, so if we have that many metrics, if we’ve got 100 virtual machines, now we’re dealing with about 31,000 or more metrics overall that we’re having to process through. So things just get really, really complicated. And that’s what we wanna talk about. I keep pressing this thing.

So velocity of change is another factor. And so, physical boxes might have a lifetime of months or years. I know there’s probably some server room that still has IBM OS/2 box running somewhere. And then, virtual machines or instances probably have lifetimes of maybe months or days, maybe hours or minutes when you realize you did something stupid and shut it down. But containers often have lifetimes of minutes, maybe hours, maybe days. But usually by then you’re gonna cycle it and bring it back up again.

Tagging

So how can you deal with all this stuff? Well, the first thing, the easiest thing to deal with that we like to talk about is tagging. Tags definitely make it easier to manage the number of machines. So with tags, tags allow you to kinda basically make subsets of all your machines. You’ve got 1,000 boxes, you’ve got 10,000 boxes, you wanna create a subset so you can ask questions because a lot of your job is about asking these questions. I wanna know which containers have a resident set size greater than or less than a gigabyte.

But I only care about the machines that are running the image web and are running in region us-west-2, and are on any of the AZs. And they’re running on instance size c3.xl. And so, I can start incorporating tags. So I might assign a tag to every single container and every host, saying, “All the things that are running the image “web” are gonna have the tag ‘image:web’.” All the things that are in us-west-2 have that tag. All the AZs have their tags. And all the instances that are c3.xls have that particular tag as well. Cool.

So now I can change the query really easily because I’ve just created this small subset. I’ve gone down from 10,000 down to maybe 10 machines. And I can change that query to say, “Hey, just give me all the machines that are using one and a half times the average resident set size.” So tagging, by assigning these tags… And what’s really great about this is that as I add more machines, as I add more containers, I don’t have to change the query. Because the query already defines the tags. And so all I have to do is make sure that every container that gets added has the same…the right tags for image “web” and us-west-2. And if I’m dealing with the machines and containers that I’m bringing up on Amazon, a lot of those tags get put in there automatically because we do that on the console, on the AWS console, and then they get inherited in Datadog.

So some of the tags that we could be using include “demo:nginx” or “demo:docker”. I like to do a “demo:matt-demo”. That way I could see my demo machines rather than somebody else’s demo machines. Or a platform like AWS because some of us have to run on HP Cloud, or Azure, or Digital Ocean or somebody else. But most of us are gonna be at least mostly on AWS because, hey, we’re here at re:Invent.

Grouping metrics beyond tags

So, tags are great. Tags are awesome. But they only help with part of the problem. They only help with subsetting the number of machines that we’re looking at. They don’t really help with the idea of, “I’ve got 10,000 metrics on every single machine, which ones do I look at?” And so, we looked at a lot of…there’s a lot of articles on the web and books out there. And there’s one really great one from Brendan Gregg. I never remember the title. System administration…enterprise system administration, something like that. Really great book, talks about utilization, saturation, and errors as a way of grouping different metrics. There’s some other guys who’ve talked about other ways to make sure you create alerts based on metrics that actually mean something. We’ve kinda looked at all these different articles, all these different books, put them all together to come up with kind of a simplified way of looking at all of your metrics.

And so, that’s what we got here. Okay. So we’ve got three groups of things. There are the work metrics, the resource metrics, and the events. So, work metrics are things like throughput. How much work is getting through the system? How many requests per second, queries processed per second? Things that happen per second, success rate. How many of those are successful? Errors, how many of them aren’t successful? Performance, overall performance of that machine, of that box, of that application. Work metrics are the things that are the most important. Everything else just adds context.

So, the resource metrics add context to those work metrics. Resource metrics include utilization. CPU utilization is one of those things that every vendor says, “Okay, here’s CPU utilization.” Even the simplest top looks at CPU utilization. And it’s pretty much the most worthless metric to look at because it doesn’t tell you anything. It adds context to the work metrics, but it doesn’t tell you anything on its own. If I see 90% CPU utilization on my box, so? Maybe it’s supposed to be at 90% every day at 4 p.m. And then at 5 p.m. it drops down to 0 or 10, and that’s just normal. So utilization on its own doesn’t tell me anything. I need to use that in context with throughput, with success, with performance.

Same thing with saturation. Saturation is talking about queues. So if your resource, whatever that thing is, is oversaturated, there’s probably gonna be a queue that gets built up. And ideally, you want that queue to be always at zero. Sometimes it might creep up higher than 0, 1, 2, 3, 4, or 10,000. But you really want it to be at zero. But on its own, that queue length, that saturation number doesn’t really tell you much. It’s not until you put it in context with throughput, with success, and with performance. The same thing with some errors on the resource side and availability.

And then there’s the events. Events on their own don’t tell you anything. Hey, there was a code change. Hey, there was a new version of the application that was pushed out to the servers. So what? Yeah, okay, that’s great. We’re moving. We’re progressing. We’re making changes and we’re getting better. But it doesn’t tell you anything. It’s only in support of some…requests per second went down to…all of a sudden went from 100 requests per second down to 0 in 1 second. And there just happened to be a code change two seconds before that. Huh! That becomes a valuable bit of information in the context of that work metric.

Alerting in context

So, given that, that group of things, we like to say, “Alert liberally, page judiciously.” I actually said that right. Often I have to say, “Ju-ji-ji.” Anyway, judiciously. So the idea here is we should alert on everything. Not everything, on a lot of stuff. Everything that’s memorable, anything that you want to be able to document. Because your alerting platform becomes kind of the documentation for your system, overall system. You got to see, “Okay, I don’t want to look at huge dashboards to understand my performance over the last 10 days or 10 months. I can look at my list of alerts each day and now I can say, “Okay, here’s how things have gone.” But when I say “alert”, I’m not necessarily talking about notifications. Notifications I can add on to an alert. But an alert on its own could just be the documentation, just a record of what’s going on.

Alerts could also include notifications. So some of those notifications might be just maybe a low-level email or adding to a chat, to a Slack room or HipChat, just saying, “Hey, I deployed version two, I deployed version three,” whatever it is. But then there’s really important stuff. Those key metrics, that percent number of requests processed per second drops down to zero in one second. This really steep drop. That’s something I wanna get alerted as a page.

Page on symptoms

Now, when I say page, there are some people probably in the room that don’t know what a pager is. So, there were these little devices that people would carry, often doctors. The people I knew who carried pagers tended to be the drug dealers in my school. But anyway, there were also all sorts of other people that would carry pagers. And then you get a page that says, oh, some number. And then you run to a payphone and you… Well, payphones, unless, I don’t know. I don’t know how to explain payphones. But anyway, so alerts, pagers, they’ve pretty much gone away. I think they might…all of the pager networks are gone. There are some, I think, pager networks around hospitals, but they’re pretty much all gone. But so, when I say page, I’m talking about getting a text message or maybe some sort of high-priority email, prioritized from the sender or whatever. So, page judiciously on just the things that really matter.

And you should be paging on symptoms. What are those key work metrics that…requests per second, dropped connections, all those key things…and be paging based on that type of stuff. So, page on symptoms, not on the causes.

So I created a nice little graphic about this. Page on symptoms. And the symptoms are the work metrics. Investigate using diagnostics. And the diagnostics include the work metrics and resource metrics and events. So you look at those work metrics and say, “Okay, I understand what’s going on, but let me use other work metrics, other resource metrics, and other events to add context to the overall picture so I understand the full story.”

But, of course, this is…what I’m looking at right now is focused on one application, maybe NGINX, or Postgres, or Redis, or Varnish, or something else. But these applications don’t run in a vacuum. They work together. They depend on each other. So if I’ve got a Postgres server…or, sorry, an NGINX server that is serving out my web application, but some of that data comes from a database, and that database is Postgres, and maybe some of it is cached, maybe in Redis or something else. Now I’ve got dependencies, that NGINX server has dependencies on all these other things.

And so, we like to think that you should be looking at those work metrics, looking at the resource metrics to get context, and looking at the events, again, to add context. And as you start you realize, “Oh, actually there’s no problem on NGINX. Let’s dig a little deeper and look at Postgres. Work metrics, resource events, all is good there. Let’s dig deeper. Okay, maybe it’s Docker, the Docker host. Okay, work metrics, resources, events. Okay, everything is good there. Let’s go deeper. Maybe host OS, Linux. Look at work metrics and so forth.” So you keep kind of going through this process and cycling down, digging deeper and deeper.

And so, again, that list of grouping of metrics. We got work metrics. Work metrics include throughput, success, performance, key errors. Resource metrics include utilization. So CPU utilization, disk IO, other types of things like that. Saturation, queue length, my database queue length for…not so much in NGINX. I mean, NGINX is designed to process all those requests as they come in, so it doesn’t ever get to be a queue. You can set up a queue in NGINX, but usually it’s never gonna be much more than one or two. So it could be the NGINX queue. And then events, which are code changes, alerts, scaling events, and so forth.

NGINX metrics

So let’s look at some examples. The first thing is NGINX. I’ve done a bunch of NGINX related events. So NGINX is always the first thing that comes to mind in my demos because I spend a lot of time with NGINX. And so, the work metrics that we deal with with NGINX are gonna be requests per second. How many people are requesting my page or my site every second? And if this number goes up, well, maybe that’s good, maybe it’s bad, I don’t know. But if it drops, maybe it’s good or bad. It depends on your context. But drops pretty drastically is probably bad. But it doesn’t necessarily mean there’s a problem on NGINX. It could mean that there’s a problem somewhere further up. Somewhere between the web server and the outside world, some router, some box somewhere, networking box.

There’s dropped connections. When the NGINX server just can’t handle anything else, it drops connections. Bad, generally bad. So you shouldn’t have more than…you need to come up with the number for you, what makes sense for you. Is it anything more than zero is bad, or anything more than 5%, 10%, depending on your environment? Work metrics continue to request time. How long does it take to process every single request on average? We don’t have to look at each individual request, but on average, what’s the request time?

And then, server error rate. How many 500s, 400s are coming up from the request? So, as my NGINX is processing requests, it gets really, really busy, can’t handle much more. It starts hitting 500 errors. Maybe the end user sees a white page, maybe they see an error, it depends. And then there’s the resource metrics at context. So it could be accepted connections, active, idle. Sometimes you’ll see them as reading, writing, or waiting. Actually, it’s the other way around. Waiting, reading, and writing. And then there’s events. NGINX got started. Cool.

Redis metrics

Let’s look at another example, Redis. Redis is an in-memory database—the work metrics might be latency in ops per second, hit rate, resource metrics, how much memory is being used, and how fragmented is the memory? It’s an all in-memory database. And so, there’s the size of the database, but that size in the database might not be the size taken up in memory. Because if the memory is fragmented, if the ratio is anything greater than one, then chances are, it’s all over the place and taking up a lot more space. As that Redis database takes up more and more space and fills up all the memory, you need to get rid of some of the keys in that database. So you have a strategy for evicting keys. So adding that metric about evicted keys, that keys are getting evicted, helps add context. On its own, it could be normal. It could be normal that 10,000 keys just got dropped out. But you only know in context what the work metrics are.

Varnish metrics

Another one is Varnish. Some of us use it for caching. And work metrics might be number of client requests, backend failed connections. Resource metrics, number of sessions, recently used, new objects, and backend connections. Now, these three examples actually come from blog posts that we’ve got on our website. So you go to datadog.com, or datadoghq.com. I think it’s datadog.com as well. But you go to datadoghq.com/blog, you’ll see a couple…about three blog posts for each one of these, for NGINX, for Redis, for Varnish, also for Dynamo, for…there’s a few others.

Where we go through, what are the metrics to look at, how do you look at them, how do you collect those metrics? Because some of them aren’t as easy as…some of them don’t provide the metrics out of the box. I have to do a little bit of extra work to get them.

Monitoring in context in Datadog

So I wanna switch into kind of a demo, except it’s not a live demo because we’ve all experienced the network. It’s pretty good. But sometimes it goes away and I didn’t want that to happen while I’m up here. So I’ve got a bunch of screenshots. First one we’re gonna talk about is NGINX, and NGINX serving static files. And so, I’ve got this kind of test environment where I’ve got an NGINX server and it’s serving out four files. Four super exciting files. They happen to be randomly generated, so dd to generate random numbers in a file. Some of them are 64K, 128K and so forth.

And then, I jumped around between using Siege, and wrk, and Tsung, and some others to just try to pound on that server randomly choosing different URLs. And those key metrics that we have from the article are dropped connections. Actually, I didn’t experience any dropped connections with my little application, so that’s an empty graph right there. But requests per second, we see that increased at some point towards the right. And then, all of a sudden, it dropped. Now, normally that would, again, probably be a bad thing. Or it would just be that the Amazon, the AWS network, failed on me for a brief second and I came back up to start pounding again.

Server error rate and request processing time. So with server error rate we don’t really…I don’t think you should care about scalar value. How many 500 errors or 400 errors, but rather what percentage of overall connections result in a 500 error or 400 error. And then, request processing time, how long does it take to process a request?

Now, those last two don’t come for free. If you’re using NGINX open source… well, if you’re using NGINX Plus, then that’s built into the…you turn on this status page in NGINX. But if you’re using NGINX open source, you turn on the stub_status and you get a small subset of metrics. And for server error rate and request processing time, you really have to shove that information into a log and then do some sort of log parsing. And so, if you wanted to do that, you would open up your NGINX configuration file, create a log format, and say, “I wanna put in my status and my request time.” And then have some sort of log parsing solution.

So that could be Dogstream, which is part of the Datadog Agent. But we actually prefer that people use Sumo Logic, or Splunk, or there’s a bunch of others out there. Dogstream is great, but it’s kinda hard to work with. So the other guys do a really good job at processing those logs. So once I’ve got that… oh, and then some other metrics add value, add context, could be the connections.

So in the case of I’ve got the open source NGINX, I have NGINX waiting, number of requests waiting, or number of connections waiting, number of connections reading, and then writing. And then I’ve got the 400 errors and 500 errors, the actual numbers rather than the percentage. And then below that, events. Events that are related to NGINX. In this case, I happen to use Ansible for deploying my application, which I think would get me shot inside the company because some of them like Chef. But I was using Ansible, and so I add that for additional context to a lot of my graphs.

And so here’s what one of our dashboards could look like. These are totally customizable. And so we’re looking at number of connections either dropped, wait, read, or write, request per second, connections on each web server, my 500 and 400 errors, my average response time. And I’m also looking at CPU. I don’t have to only look at work metrics. I’m looking at work metrics and the resource metrics. And some of them are resource metrics not only of this, for NGINX, but resource metrics further down the line. And if we were able to look at the entire dashboard, this might go on for a while. Actually, for this demo one, it’s right here. But for a lot of the ones that we use internally, they fit on a really big screen, vertical screen, and it just goes on, and on, and on. And so, okay, cool.

So we’re looking at the dashboard, and as we move on in time, we see… I’m starting to pound on the server a bit, adding more and more connections, trying to increase the number of request per second. And everything is going along, it’s processing, it’s cool. Everything is great. And then…and then…and then I start getting 500 errors. So I got a little yellow bar way on the other side that starts popping up. I’ve started feeding too many requests at my NGINX box, and I started getting 500 errors.

Okay, that’s bad. Except, before, I was only looking at the total number of 500 errors, not in context with the total number of connections. So I overlaid that percentage. What percentage of all my connections are resulting in 500 errors? I see, okay, it’s about 10%. At first I was really, really scared because I saw this big number—fifty 500 errors. Oh, it’s only 10%. Maybe 10% is acceptable, I don’t know. Maybe 8%, maybe 5%. Depends on your environment. Maybe I just don’t care about my users. Eh, 500. They’re cool, they’ll press refresh. Except if they press refresh, it’s probably gonna get even worse. And so, if they press refresh, refresh, refresh, what you do with the elevator, press, “Du-du-du-du-du.” And it just gets worse, and worse, and worse. And we go from 10% to 80% pretty quickly. And it continues.

Monitoring is an iterative process

And so, time to tweak my files. I was using, initially, that standard configuration file that you find on like 80% of the web and used that, and that configuration file was written by a moron that doesn’t know anything, and so it performs really poorly. So I change it a bit and I start pounding on it. I’m getting a little bit better performance. Okay, things are better, so I’m getting there. But I need to keep working on it. And finally, I do maybe recompile NGINX to start using threading, turn on the aio threads option, configure thread options, and a lot of other things I can do to optimize my NGINX server. And I’m going from…what was I before? About 100 or so server requests per second with a response time in the 20 seconds or so. And I go to about 400 requests per second, sometimes higher, down to 2 seconds just with a little bit of tweaking of that configuration file. So, that’s cool. I can keep watching this, make sure everything is going well.

And then all of a sudden, something happens. I don’t know. Maybe something, there’s a vertical line. So something just happened. And things keep going, processing along. Oh, I just started getting 500 errors. And so, there I’ve got…I’m looking at my work metrics. I see a 500 error start popping up. Oh, no. It’s only 0.3%, but that’s potentially gonna turn into something much bigger. And I start looking around at other work metrics. I don’t really see anything that jumps out at me, but I do see that resource metric. I do see that vertical line. In Datadog, those vertical lines represent events. So I can pop open the events just by clicking on one of those vertical lines.

I can see, okay, Ansible just ran. Somebody just did an upgrade against my server and changed a bunch of stuff. It changed a configuration file. What did they do? Who upgraded my server? What did they do? So I go take a look at it and find out what they’ve done, and turns out they didn’t understand what aio threads meant and got rid of it. And I got back my old server, and pretty soon I start seeing more and more of these 500 errors. And that kinda sucks.

So at least now I understand a little bit more about how to go through my… what are the metrics that really matter, and what is a kind of normal number. And so I go back to the original…the good configuration. And I can start looking at that and saying, “Okay, well, if I see more than 10% or more than 5% 500 errors, then I wanna trigger an alert. If I see more than an alert that fires a page, if I see more than 2%, then maybe I fire an alert that’s only a notification. If I see more than just one, maybe I fire an alert that doesn’t actually send anything. It just records it within my alerts list. Cool, okay.” And I continue doing that with other metrics.

And now this is…this is not…you don’t go to this step and then you’re done because you’re gonna keep realizing, “Oh, okay, that wasn’t quite good enough. I need to change those values a little bit more.” And so, this is an iterative process that you go through. Now, if we were just talking about Datadog, then the outlier detection is gonna be really helpful in this, or anomaly detection. But not really the right place to talk about that. We can talk about it in the booth.

So next thing is NGINX and Postgres. So I’ve got another NGINX server and I loaded up a kind of a sample database. I grabbed the Dell DVD database. I know people played around with that. You can download this thing, run some shell scripts, and it generates a database of random content about a DVD store that people are purchasing DVDs. Yes, people purchased DVDs in the Blockbuster days. I guess this was really…well, actually, Blockbuster was rental. Anyway, so NGINX and Postgres. So this thing is running this query. It’s doing a join, grabbing lots of rows about purchases of DVDs, or maybe they’re rentals of DVDs, and then displaying them on a page.

And the first version does it really stupidly. But we might look at key metrics. And some key metrics, we look at… we haven’t written up this article, so we’re still kind of playing around with what the key metrics might be. Maybe it’s commits and rollbacks, connections, percent disk usage. Percent disk usage kinda fits in a resource metric, normally. But we pull it into the key metrics because you run out of disk space on a database, kinda hosed. So you wanna make sure that doesn’t happen. And so, we kind of elevate it to a key metric. Rows returned and fetched are also really key things to watch.

This happens to be one of the ways we look at Postgres inside the company—our CTO looks at a really big dashboard every day and this is the top right corner of it. It includes a lot of other stuff elsewhere. And I blurred out a lot of stuff here, but we can see how we look at Postgres initially. Just to get a high level…how healthy are things generally within our environment? If you load up the standard dashboard, you’ll get something like this.

But I wanna go back to my NGINX example. So I start pounding the server and I load up my NGINX. I’m a web guy so I’m looking at NGINX first. I wanna look at how my NGINX server is performing. I start looking at the work metrics to try to understand…get a better idea of what’s going…let me skip past that, I don’t need to show that. Oh, so here’s what my sequel, my Postgres database is doing. Basically doing join of all my products and order lines on prod_id. Okay, cool. So now I’m starting to look at my NGINX server. I start pounding the server. Everything is going kinda well. Requests per second isn’t that high. It’s kinda slow, and CPU is going up as I work through it.

Oh, 40% already of CPU. I wonder what else is going on. I realize that I’m just passing on too many rows to the web server. Let me go ahead and filter that down, drop it down to only 5,000 rows. And I’ll do this by saying where any of the Prod IDs is in an array of values, this huge array of 5,000 numbers, and only show me those things. Okay, so I think that’s better, but it actually starts to make performance a little worse. So here I switched over to Postgres. I can see that my CPU is going into like 80%. So, okay, this is not working out. Now I’m at 100%. And as soon as I get to a 100%, switch it back over to NGINX. And normally if I was doing this, I’d have one application. And I’d just go NGINX and Postgres all on one dashboard. We got a limited screen space here.

But now I’m looking at NGINX and I see, ugh, I’m getting 500 errors because my database is completely saturated. It’s not responding. We’re getting 500 errors on the NGINX server. And so I realized, “Oh, there’s actually a better way of doing that query. Maybe that’s the problem.” So I update the query. I happen to find a blog post on the Datadog site that talks about how we changed the ARRAY—doing a “where prod_id is in an array”— we changed it to VALUES, and it results into a huge performance gain. And in fact, I think we saw a 100x performance enhancement. And so, okay, that’s better. I see my NGINX server is starting to use less CPU. My Postgres server, it will still churn in for a while. But it finally calms down and it gets better.

So the process of building out this thing, which kind of reminds me, you wanna start monitoring really early on. You wanna start monitoring as you bring up that server, as you first bring up that server. Because you’re gonna be going through this process of deploying something out, making tweaks, finding stupid things that you’ve done, continue making tweaks. And you wanna make sure that you’re monitoring that whole thing the entire time because you wanna make sure that these improvements you’re making are actually improving things. Otherwise, you don’t know what’s going on.

Using experience to customize your alerts

Okay, so at this point I can use my experience to figure out… okay, the work metrics, what are acceptable levels for rows fetched? What are the acceptable levels for connections, or change in number of connections? If it drops precipitously, inserts, updates. What are those acceptable values? Same thing with CPU usage or memory usage, what are acceptable values? And then constantly go through that cycle and figure out now that you know what are the key metrics to look at, what are the values to look at, what are the values to alert on? Create the alerts that are just a record, perhaps when CPU goes over a certain low number, then create alert as a record. As it gets higher, create an alert as a notification, adding it to a chat room. And then as it gets even higher, an alert as a page because things are breaking.

So at this point, let’s switch back over to that list of three things. Work metrics: throughput, success, error, and performance. Resource metrics: utilization, saturation, error, and availability. And events: code, changes, alerts, and scaling events. And the work metrics are what you page on, because they are the key metric that you need to focus on. And everything else, it just helps you add context to those work metrics. Investigate using diagnostics, which are the work metrics, resource metrics, and events.

And then, this is an iterative process. You start at the top, whatever the top means for you. It might be NGINX, it might be Postgres, it might be something else. Look at the work metrics, look at the resources, the events, and dig deeper. Dig deeper, dig deeper until you finally understand the full story of what’s going on. And so, now I’m about 15 minutes before the end of the session or before the end of the 60 minutes, which is the end of this session.

So at this point, I would open up for questions.

Want to write articles like this one? Our team is hiring!