Monitoring Busy Systems at Liberty Mutual

Monitoring busy systems at Liberty Mutual

Published: July 12, 2018

00:00:00

Cloudy with a chance of crashes

How are you doing, everybody?

My name’s Robert Desjarlais. I’m a systems architect at Liberty Mutual.

We’ll go into that in a little bit.

Today I’m going to talk to you about monitoring busy systems, or as I’d like to call it, “Cloudy with a Chance of Crashes.”

So crashes are a normal thing on these systems in the cloud, right?

They’re things we have to prepare for and expect.

And so if you can set up your systems to allow for it, it’s not a headache, it’s not a nightmare, it’s just something you planned and allowed for.

And today we’re gonna talk about some things that you might not have thought about in other conversations.

So here we go.

So I’m the old man in this one, guys, right?

So the old man yelling at the cloud.

So Grandpa Simpson is my hero.

The truth of metrics

So the first of all things, if you take nothing away from this talk today, please stop using load average as a metric that you use to govern how you run your systems, okay?

It’s a derivative of a derivative, okay?

It doesn’t mean anything, all right?

It is extremely difficult to use correctly and be truthful.

And that’s the whole point of the talk, right?

Knowing the truth of the metrics is power, and it lets you access the resources that are available to these systems today.

This is focused mainly on Linux systems, Windows doesn’t suffer from a lot of these problems.

But on Linux systems, the configurations that come out of the box limit what you can get access to in all of the hardware you buy, okay?

So if you can tune your systems, if you can configure them to make access to the hardware you’ve paid for, you can get a lot more out of them.

So the guys that were just on, talking about stranded capacity and configuring and implementing their setups to access stranded capacity.

They were talking about at the app level.

I’m talking about down at the system level, right?

So the CPUs, the networks, the disks, all these things have incredible capacity to do work.

So how come we can have all these problems?

How come you have down time?

If the CPUs were so capable, nine times out of ten, there’s a configuration in your setup that’s constraining your access to the resources you bought.

And if you can get rid of those configuration bottlenecks, your system will flow much more smoothly and basically push all the problems out to the developers.

Who am I?

So just a quick “Who am I?” right?

So I’m a systems architect, I run the policy management system at Liberty Mutual.

So I’m the systems lead for that, amongst many.

I support the infrastructure guys that do all the work there.

When I started doing this job back in 2008, we had about an 80% to 85% up time.

So the customers were pretty upset with us, the business was pretty upset with us.

Last year, we managed to get five nines.

Some of that’s a little bit of luck, I’ll admit to that.

But also, there was a ton of hard work done by a whole bunch of people to get a configuration in place to sustain the business through all of the ups and downs that we go through.

And just a kind of fun fact about me, something that you don’t get to see every day, is that I was the guy that named the “Power Seven” servers from IBM.

Right, so we sat in a meeting and they were trying to come up with a name for them and I said, “Will you please just call them this?” and they did.

So I get to name a whole line of product from IBM.

About Liberty Mutual

A little bit about Liberty Mutual, right?

Oh, there we go.

It was founded in 1905, so it’s a pretty old company.

Yes, we have tech debt.

We’re the third largest property and casualty insurer in the U.S.

We have about 40,000 employees worldwide, and the system I support supports about 15,000 active users.

We have about eight million customers in the system and about…estimates vary between 50,000 and 80,000 agents that also access the system.

It’s a really busy system, okay?

So it’s the system that the call center folks use to answer questions about your policy, it’s the system that the agents use to actually sell you policies.

So we have about 17 data centers worldwide, and there’s about 17,000 servers spread across all.

The bulk of them are in the three data centers in the U.S.

We have one in Portsmouth, New Hampshire, we have one in Kansas City, and one in Redmond, Washington.

Most years, for the past 20 years, for a long time, we’ve been growing our server accounts in our data centers by between 15% and 20% per year.

The last year was the first time we actually flipped the metric over.

We actually started to put enough stuff in the cloud to come down by 7%.

So we are serious about getting out to the cloud, we see it as an incredibly empowering tool for us to grow and scale our business.

Know what’s normal

Back to the presentation.

So the slowest component of any system governs what that system can do, right?

So that’s just the definition of a “bottleneck.”

And so if you can hunt through the metrics in the systems that you have, if you can produce the metrics that visualize what’s happening, you can find the thing that’s constraining you, you can take that limitation out, and allow the system to flow better, get more work done with less stuff, lowering your costs.

That’s the gist of the presentation, okay?

But in order to actually make use of this stuff, you have to have a load in the system to actually rationalize what you’re seeing.

So you have to watch the system when it’s busy.

You have to know what “normal” is to understand the abnormal, okay?

Tuning your system

What you’re trying to do with your configurations is just allow the app to have access to sufficient resources to carry itself through any troubles or issues that the infrastructure presents to it so that you can survive that.

Maybe take a blip, maybe harm a couple of customers, but the system will carry on, and system will keep delivering the service.

So tuning is a shock absorber that allows you to take on any of these problems.

Now this one’s an iChart, but it’s probably the most useful single slide I’ve ever come across in my life, and I share it with everybody when I do talks on these things.

This is produced by a guy by the name of Brendan Gregg.

He’s the Netflix guy.

So “What would Netflix do?” right?

It’s just another…this is just the most powerful example yet of that.

But in it, it’s got just about every command that tells you what’s going on in your system.

And I’ll go into a couple of these in some detail later.

Do yourself a favor, steal this slide.

Sit down with it for an hour or two per week, maybe per month, and go through one or two of the commands, read the man pages for them, and you will be shocked at what you can find in these systems.

And you will be surprised at how much you can configure into the system and find problems with your configurations so that you can really keep the systems up and running, really keep things flowing for your business, really help improve and satisfy your customers.

But it’s got everything there, right?

You can find out how much your GPU’s doing, how much your IO’s doing, CPUs, memory, you name it.

It’s really shocking what you find in there.

Defining the problem

So to use that slide and to use the tools that are in that slide, it really helps to just come up with a one-sentence definition of the problem.

You’re gonna have to experiment with this on every problem that you work on, but that sentence almost always tells you where to look.

How do you know you have a problem?

“I’m having trouble talking to this web server.”

Okay, that sounds like a network problem, right?

Or, “The system just stopped working, the app server crashed.”

Okay, so now you know that there’s something probably in the CPU or the memory, right?

Something in the app stopped working because it lost access to a critical resource at a critical time, and couldn’t carry on, all right?

So you’re gonna have to hunt through with different tools and utilities to find the faulty problem, right, the faulty component.

You’re gonna have to hunt through victim metrics, right?

Things that don’t tell you what went wrong, but things that tell you what didn’t go wrong.

So you’ll have to kind of sift through them, find the one thing that tells you the truth about what happened, and then you’ll have root cause, and then you can start to take action through and mediate.

Troubleshooting tools

And if you don’t have instrumentation, you’re just guessing, okay?

So that’s why it’s so great to have tools like Datadog.

With the integrations that it has, it can start to very quickly get you a lot of value out of a little bit of effort.

And you can really visualize what’s going on in your systems, you can really visualize what’s going on in your apps.

So VM Stat, we’ll start with VM Stat.

It’s the gold standard, it’s the tool that everyone uses to visualize the performance of their systems.

What does VM Stat do?

VM Stat gives you some CPU metrics, it gives you some memory metrics, it gives you the health at a very fundamental level of what’s going on in your system.

Misleading metrics

Except it doesn’t.

It kind of lies to you a little bit.

So the metric most people use is the utilization metric, right?

The user space and the system space and the IO wait times.

What do those metrics mean?

Who knows?

CPU utilization

Well I do, because I looked, and I really investigated it.

As it turns out, it doesn’t mean anything, okay?

It’s what the CP—it’s what the kernel, it’s what the operating system thinks it used in the last second.

And it’s the number of jiffies that was used.

Now who knows what a jiffy is here?

Yeah, nobody.

Okay, one guy, right?

It’s a hundredth of a millisecond.

Or a hundredth of a second.

So it’s several hundred milliseconds equals one jiffy.

Excuse me.

So it’s the number of jiffies that were used in the last second, okay?

But it’s not what I always thought of as the CPU utilization, right?

If I’ve got a four gigahertz processor that can do three instructions per clock, I’m thinking that that’s 12 million things I can do in one second.

But when you’re looking at VM Stat, it doesn’t tell you that, it’s not anything like that.

And so it is a useful metric.

If you’ve used it for years, it will guide you through some of the things, but it doesn’t actually tell you how busy all the hardware that you have really is.

This chart illustrates right here, this tells you all the different execution units that are available on a given CPU.

It’s how you get the three instructions for every clock tick, right, so each column here is a clock tick, and each row is an execution unit, right?

So you have a lot of resources here that are doing things.

And where you have a filled in block, that’s where a command’s actually working, a micro-op is actually working on the CPU. And when there’s one that’s a miss, nothing happened.

The more you can fill these up, the less money you have to spend, the more bang you get for your buck, right?

That’s on the developers, right?

The developers of the apps have to get the code so that it uses the resources you give them.

But the more that you tune the system, the more you can deliver into these pipes, and the more you can fill these blocks in.

So the only component the VM Stat actually reports on is the floating-point processor, right?

Because that was the only thing that the guys at Laurence Livermore Labs and at Sandy Labs, that was the only thing that they cared about back in 1985, right?

And this was the metric that they used to visualize what was going on on their huge clusters of computers as they did their nuclear bomb simulations, okay?

Most apps today care about the load store unit, right?

Databases, web servers, application servers, right?

The load store unit is the thing that you want to be monitoring.

That’s the thing that’s gonna tell you how busy your system really is.

That’s the current modern bottleneck on most of these CPUs.

That’s the thing that gets you access to your network cards, gets the memory in from the main memory to the caches and buffers, okay?

I haven’t yet found a good metric, and if anybody’s got one, please share it, about how to visualize what’s going on with the load store unit.

It’s a really difficult thing to monitor.

The active monitoring, it disturbs the measure, so you get the Heisenberg effect on that.

So that’s why it’s difficult to monitor that.

And so just a quick visualization on some CPU metrics on a system back at my shop.

And so in the conversations that we have with the business folks, and in the past, they were always looking at this idle.

They were always looking to say, “The system is 95% idle here, there’s nothing wrong.” Or, “It’s 50% idle right now, so it’s 42% busy.”

But the problem isn’t in that metric, it’s this one over here.

When the system’s 50% busy, there’s 12 threads waiting for a CPU to give them dispatch.

It’s a bottleneck.

It’s right there on the screen.

Now when I started out back ten years ago, the numbers that I was seeing here were 300s, 400s, 500s, because it was a JVM and it didn’t make use of the floating points at all.

The app, just itself, didn’t make use of floating points.

There wasn’t much algorithm there, it was just moving graphs of data around.

So the utilization was really low, but they were starved for CPU.

The point is, don’t trust the first glance of the metric.

Don’t think you just know.

If you dig into it a little bit, you can find gold.

And that gold can save your company a bunch of money, and do you some pretty good things for your career.

Memory utilization

So memory utilization there, I like the Datadog screens for this, this is a really great screen for that.

And that transition was a little quick, but it just lets you kind of, at a 10,000-foot view, just visualize the memory utilization going on here.

Okay, again, this is another one of those situations where there’s a configuration here that’s causing us problems, a configuration that’s causing us headaches.

Most of you probably haven’t changed this setting on any of your Linux boxes or even you Mac laptops.

It’s called “swappiness,” all right?

Now when I was going to school for computer science, the way the instructor taught me about swap pages and things like that is when the system’s really busy, you start to swap out in memory, and the system just keeps going on, and it acts like a shock absorber for the memory system.

But the configuration that comes out of the box on most of the modern operating systems doesn’t talk about that.

They have the swappiness value set to 60.

So what that tells the kernel to do is just start paging whenever you have free time.

So your JVMs will page out in the middle of the day, right when things are really busy.

If it’s a page that hasn’t been used in a while, it just sticks it in a swap file on you.

And then when it goes to access that page again, instead of taking 70 nanoseconds, it takes 7 milliseconds, 10 milliseconds.

So you get these mystifying disruptions in service, so apps just stop working.

Why did it stop working?

Well, who knows?

It’s very difficult to see after the fact.

But if you watch this stuff and you see this is a system swap utilization, this guy’s using a couple hundred megabytes of swap, but he’s got gigabytes of free memory.

What the heck?

That guy crashed just after I took this picture.

So we understood what happened there, and were able to fix it because we just happened to be watching at the time.

But you have to be watching the regular metrics so that you can see when there’s something abnormal that happens.

Datadog is a great tool that helps you do that.

Disk performance

Another one that’s really common, at least in my shop, and I imagine in your shops too, is your disk performance.

You think that these things have this incredible capacity.

They have 100-gigabyte-per-second access paths between your server and your disks.

But nobody ever talks about the queue depth.

Now what the heck is queue depth?

Well, it’s a limiting control, it’s a control built into the operating system that says, “This is the number of IOs that I’ll allow you to do to a disk.

So it sounds like a lot, 32 concurrent things on a one-millisecond response time.

Hey, you can get a lot of work done.

But if you’ve got a busy apps server, if you’ve got a busy database server, and you’ve put one disk into that system of one terabyte in size, you just constrained the heck out of your database and its ability to do work.

This is a difficult one to monitor too.

IO stat will only help you infer it, there’s not any real monitoring that will tell you besides going in with taps, right?

So the only real way to monitor this is with taps, or digging into the actual communications infrastructure between your server and the storage.

This is the thing that controls the performance of your databases.

If you can monitor this, you can unleash the power of your database.

You don’t need as many CPUs, you don’t need as much memory, you don’t need as much horsepower.

If you can unlock the disks, you can dramatically increase the capacity of your database server, okay?

Anybody here want to do that, right?

Save a bunch of money for your business.

Another one that I like to use, so this is just an example of one of our ZooKeeper setups.

Again, the system is running clean, we’re running fine for hours.

And then all of a sudden, it started taking 50 milliseconds to get an IO done, okay?

It was an election, it was starting to write a bunch of stuff to the logs, fell over and caused an election, and went to the next node, so the system kept going, because we allowed for it to do that, but it was a disruption, right?

There were some clients that were affected by this.

And Datadog does a nice job here, but the point I’m trying to make here is like watch your scales, the scales sometimes change on these graphs, and sometimes they can be deceptive, so.

But this was the one right here that actually caused our app to stop working.

Just there, sailing along fine, and then boom.

You’ve gotta monitor the normal to know the abnormal.

Another really great command that is out there, that will really help you understand what’s going on in your environment, is lsblk.

Just tells you what’s running where, right?

Anybody else heard of this one?

A couple, good.

So this one’s a really great one, it helps you quickly analyze what’s going on with your disk, and it tells you if you’ve got your queue depth stuff laid out right.

If you’ve got two or three or four busy file systems on the same disk, you’re probably not gonna be happy, right?

You’re gonna probably be getting paged at 3:00 a.m. when the backups are running and the apps are trying to run, okay?

Segregating your file systems between the disks, blocking and tackling is crucial.

And why?

Because you only get 32 IOPS per disk, concurrent.

Network performance

This next one is actually my favorite one, and it was solely because I dug through that slide from Mr. Gregg.

SS, Socket State, okay?

So network monitoring is hard.

It’s hard in a lot of different ways.

You have to get the camera in the right place to take the data in and get the data ingested into the tool to visualize what’s happening.

And then you have to understand all the realities of that, and you have to actually capture the data.

And then you get all this payload, you have to be careful with it, right?

If it’s got credit card data or personal information in it, it’s a huge security vulnerability, it’s a huge nightmare.

Then, once you’ve gone over all of those hurdles, now you’ve got all this data that you have to sift through.

And all you want to know is if you lost any packets.

Did something go wrong on the network?

Just tell me.

Well, this tool does a pretty nice job.

It doesn’t just just tell you, but it does a pretty nice job.

It gives you pretty much everything you want to know about what’s going on in your network from the perspective of the system.

So you don’t have to worry about the camera, you don’t have to worry about, “Where did you get this data from?” because it’s the server itself.

It gives you the roundtrip time between the end points.

It tells you which process is actually doing the communication.

So this is ours, again, our ZooKeeper infrastructure, right?

And really importantly, it tells you if you had any re-transmits.

So the notation here is, the zero is in the one second where the tool was running, and the two is in the life of the TCP session.

Now, who cares, right?

Re-transmits are no big deal, they happen.

And at this rate, I don’t care.

But when this number shows 1,200s, 1,500s, when it’s incrementing 10 and 20 and 30 per minute, that’s bad.

Every time this metric increments you have to go through TCP slow start, okay?

That means you’re getting knocked down to zero for bandwidth, and then you slowly ramp up, ramp up, ramp up.

You have this huge pipe, but you don’t have access to it when you have a re-transmit.

Now why did this box have re-transmits?

Was there something wrong with the network card?

Was there something wrong with the switch?

No, the network guy has dug through, everything was fine and healthy.

Here, we starved the box.

We starved the box for memory.

“The box had gigabytes of free memory, Rob.

It’s not starved for memory."

But I only told the operating system that it has 84 kilobytes of memory to use for TCP sessions.

This is a ZooKeeper server, it’s got hundreds of sessions.

And we’re splitting up 100K of data, 100K of memory for TCP buffers?

Why?

That’s the default.

Why would you ever change that?

Well, if you change it, your system stays up.

If you change it, the work just flows.

You don’t get paged at night.

And I think we can all agree that’s really important.

This is some other data here that’s really useful.

It just tells you the computed bandwidth between the end points, right?

It’s not necessarily what’s actually flowing through there, but it tells you what the two end points think that they can do with each other, right?

So if you know you’ve tuned this end and you’re still getting low numbers for the bandwidth, it’s probably the other guy’s fault, right?

What else?

It tells you the number of bytes that you received in the session, and so what we do is we capture this every five minutes and we throw it up as a custom metric, right?

Incredibly powerful way to use Datadog’s custom metric facility.

Recap

So this is just a recap.

Again, the most useful slide in all the world, okay?

Be sure to come up with that one-sentence definition of the problem.

If you can define it in one sentence, if you can distill the problem down to one phrase, you have a very good chance of tackling the problem.

Oftentimes people are running around trying to understand the impacts, and talk to the customer about, “What’s going on?” and, “Can you just give me a one-sentence phrase about what’s wrong, so I can start digging into what’s wrong?”

If you can do this, you can fix your problem.

You’ve gotta hunt through the vector metrics.

If you don’t have instrumentation, you’re just guessing.

Go optimize your systems, they will set you free.

And look at Brendan Gregg’s website, it will really help you out.

That’s my presentation.

Q&A

If there’s any questions please let me know… I’m sorry?

Audience member: Could you go back to the last slide?

Robert: Oh sure.

Question.

Audience member: One quick one.

So a lot of those defaults were set probably some time ago—

Robert: In the era of the 486.

Audience member: Yeah.

So basically a lot of them were set…oh.

A lot of the defaults were set back more around protecting user space, things like that.

So it’s protecting the core of the system from the user.

So has there been any good studies to come up with more aligned defaults for…like, for example, you mentioned ZooKeeper.

ZooKeeper could have these type of settings that would help performance.

And rather than having to spend the next month of Sundays going through for each of those, basically thinking more like a one-pager or a research document, or a GitHub repo or something that people have…

Robert: Yeah.

So that was why I cited Brendan’s page here.

Brendan has some really excellent stuff about what the Netflix guys do for their servers and how they queue them.

They basically give all their nodes 16 megabytes, not gigabytes, megabytes of TCP memory, and that allows all their movies to stream, right?

So that’s how they’re supporting all the video streams that they’re delivering.

There’s a whole host of settings that control that stuff, there’s not just one.

But there in Brendan’s blog, and he’s got a pretty good one.

But your mileage may vary, right?

So to your point, the configurations were put in by systems admins way back in the day when they had their 486 and they had 30 people using it and they had 25 different apps on there.

So they didn’t want any one’s thing to steal all the resources away and hog all the resources.

But now, most of the time, you just want the app to have free run of the box.

So that’s what the real change has to be in the Linux community.

The Windows guys did this because the Exchange guys forced them to do it, right?

The Exchange guys were getting pinned every day with network problems and disk problems, so they just went in and tuned it to support an Exchange server, which for you and me is pretty good.

So the Windows guys just have this problem a lot less because of that.