Monitoring in Motion | Datadog

Monitoring In Motion


Published: 4月 21, 2016
00:00:00
00:00:00

Introduction

Good morning today, or good afternoon. My name is Ilan Rabinovitch.

We are in the Monitoring in Motion session.

We’re gonna get to chat today a little bit about monitoring containers and ECS and Amazon and what that sort of dynamic infrastructure does to your ability to operate and monitor at scale as things are moving around underneath your feet.

So just to do a quick background on myself, my name is Ilan, again, Ilan.

I’m the Director, Tech Community at Datadog, where I get to work with folks like yourself here in the audience, get to capture your interesting monitoring stories, get to tell them and share them with other customers, and sort of establish best practices.

I also have the fantastic opportunity of working with our open source community around all of our open source integrations for the Datadog platform.

So if you have something that we do not yet monitor in our 150-plus integrations, drop me a line.

I’d love to chat with you about how we can get that going in a pull request.

My background tends to be in running large scale of operations teams usually around infrastructure, automation, and tooling, building monitoring systems like Datadog before they existed and not nearly as well as they do.

My hobbies tend to be running open source conferences, like SCALE, Texas Linux Fest, and a couple of DevOps space in L.A., in Silicon Valley, what have you.

So hopefully, I will get an opportunity to see you guys at some of those.

And then Datadog, of course, my employer, I will drop a quick pitch.

But really, if you want to learn about us, you should drop by the exhibit hall floor.

We got some amazing demos. I will try to keep the Datadog-specific content to a minimum.

We’ll try to talk about the hard container facts here.

What is Datadog

But the background, what Datadog is is we’re a SaaS-based infrastructure and application monitoring platform.

We tend to focus on monitoring modern infrastructure, so things like the cloud, containers, all the technology that you are using as you’re building your microservices, places where things are moving quite a bit.

We work fantastically in your on-prem environments as well, but honestly, we are here at re:Invent and we all like to scale our infrastructure with code rather than with people moving servers and drives. So we are going to focus on that here, and that’s where a lot of the power of Datadog comes in.

To give you a sense of our scale, we process nearly a trillion metric today.

So that’s, you know, you can imagine that’s customers both large and small.

And then really, where, I think, we excel is some of the intelligent alerting that we offer, whether that be the fact that we can include a full runbook in those alerts that we send you, rather than you trying to figure out at 3:00 in the morning where that wiki page is, or things like outlier detection where you send us a bunch of data from hundreds of hosts and we tell you which one’s, you know, misbehaving or acting weird in the cloud.

But again, I will try to go through this, the Datadog commercial pretty quickly here.

So our goal is to help you monitor everything from all levels of your stack, let you make intelligent decisions about your own infrastructure and applications, lets you go to your boss or your colleague in another engineering team and arm yourself with graphs.

It’s much easier to prove your point with a graph and an alert than it is to do with your opinion.

So we want to help you solve business problems like this, right?

What’s going on with my environment?

Why is this host getting more traffic or this host getting less traffic?

Maybe it’s a load balancing problem.

Quick overview of this talk

A quick overview of the plan for the day or for, at least, the next 45 minutes or so, we’ll start off with a quick introduction on why we are all containerizing and why that’s exciting to me and hopefully to all you folks here that are joining me here during the lunch hour.

Well, then to dive a little bit into the how of collecting some of this data and how you are going to pull metrics from Docker, as well as through ECS, drop into some theory, some of the best practices that Datadog’s picked up from working with a lot of our customers, as well as from other leaders in the industry on how to avoid things like pager fatigue, make sure that you’re only alerting your teams on the leading indicators, the things that are most important to you and your customers.

And then finally, we’ll try to fit it all together and dive into how we would plug this in with ECS, with Docker, and send you home with some tool you might be able to use when you get back to your desks.

So why containerization?

Just by a quick show of hands, how many of you guys are using Docker today?

So that’s…I’ll assume that depending on the level of your hand in the air, that’s whether or not you’re a dabbler or an adopter.

But that’s a probably good 75%, 80% of the audience here is that…sounds like at least you run Docker on your laptop somewhere, but likely using ECS as well.

So clearly, there’s a lot of interest there.

Docker adoption

Let’s dive a little bit into why this is interesting to us at Datadog.

Over the last year, we ran, in Q4, we ran a study on Docker adoption so a lot of the stats that I want to talk about in this talk come from the study.

Link will be in the slide deck if you want to pick this up later. No need to take pictures of the screen.

And so you can dive into that there. But we have, being a SaaS space monitoring company, we have a large view of how thousands upon thousands of folks are…what technology they’re monitoring and how they’re monitoring them.

We can come back and see industry trends in real time.

And so with Docker, what we saw is just this amazing adoption purpose.

So clearly, as a SaaS space monitoring vendor, it’s important to us to be able to monitor the technology that our customers are interested in.

And over the last year that we ran the study, we saw a five X increase in that adoption.

So the big tick was, of course, around the 1.0 when Docker started to stabilize.

But the five X in one year, I mean, when was the last time you saw a technology see that type of adoption?

So this curve’s amazing.

I mean, the other thing is folks are going from this dabbling piece in the middle almost immediately up to the adoption curve, where they’re running it in production and across their entire environments.

And very few people are abandoning it, which is, again, something that we don’t tend to see with a lot of other technologies.

Containers have really become sort of the de facto standard at this point for distributing complex, distribute applications, shipping them, building them, etc.

Take a slightly different view on this, we went from zero hosts being monitored running Docker to over 6% of our customers, the 6% of the hosts that we’re monitoring for our customers being Docker, having something on a Docker or containerization on them, and it’s not stopping anytime soon.

So that’s why it’s important to me as a vendor. But why is it important to all of us together as users?

Well, how many guys have run into this type of situation with dependency hell in some port of your environment?

Maybe you have packages that your operating system level that are conflicting.

Why containers?

I mean, I know I started dealing with stuff like this back in the late ’90s, early 2000s with RPMs in Red Hat. I then graduated to dealing with things like Python and Ruby as I’m using RBNs and virtual lens and containers with ears and waters and maven.

And the reality is no matter what technology has come out the door claiming that it’s going to solve these dependency problems for me, it hasn’t.

In some environment, I run into some conflict and I spend way more time fighting this than I would ever like.

So that’s a serious problem. We want to try to avoid that.

Containers are one way that we go about doing that.

So let’s talk a little bit about how, right now, rather than using something like the deployment tooling that I have used in the past or configuration management tooling all the libraries, all the different tools around these containers or around the applications that we’re interested in deploying, we build a unified pipeline.

So we get both the compiled output of our applications along with the shared libraries and all the system level packages that might be important. them up, put them into a single Docker container, we get that one . Makes it much easier to get things out there, very consistent deployments across our environments, not a lot of questions around what’s in staging versus what’s in production.

It’s a very clear of that one binary.

And so our application for structure starts to take a little bit of a shift.

Over the last 15 years, maybe it used to be in your on-prem environment, you have one or more applications running on top of some sort of an application stack. Maybe it’s your Java world or some sort of dispatcher like Gunicorn or unicorn for Ruby and Python app running on your OS, and you likely have multiple things running there because you want to make the most use of this expensive bare metal that was sitting in your data centers.

Avoid dependency hell

Over the last five, eight years or so, both in Amazon and in our own data centers where things like UCX and KVM and all the amazing virtualization tech we found, we thought we’ll make this easier.

We’ll deploy one app per VM and we’ll make many of those VMs.

We’ll slice our hosts down.

We’ll get better density in our environment, better usage, better management of our costs and we’ll go there.

But really, that just exacerbated the problem.

Right now, rather than having one host for every piece of hardware, we now multiply that by two.

It’s that many more places we have to worry about, those dependencies and those deployments, security updates, all these other things that we’re talking about.

Single artifact deployments

So now, today, as we’re talking about containers, what we’re really looking at is these very small deployment units, things that are just enough operating system.

If you can get away with it, you’re using things like Alpine where there’s almost no operating system at all.

All it is is the base libraries that you need for running your service.

So let us take things like these kinds of messy moving trucks where we’re strapping things to the top of our cars and shoving things in, trying to figure out how to make use of all of our resources, something a little bit more standardized, environments where we know exactly how much capacity we have.

We’re allocating that capacity efficiently so that we’re able to make use of all of our resources, manage our costs, etc., so that any crane can pick up one of the shipping containers, drop it on a truck, drop it on a boat, but it looks the same whether you’re, again, Amazon, on-prem, what have you.

Quick, low-cost provisioning

The last sort of selling point of containers for us, in addition to that, is, of course, the speed.

How long did it take, does it take to the spin up a VM?

Maybe that’s in minutes, maybe that’s in tens of minutes.

How long did it take you to bake those images and those AMIs and manage that? It’s a bit of time.

And so the speed of Docker lets us do things like, makes things like red, blue deployment a lot easier.

Maybe you’ve got two versions of your application, you quickly deploy version 2, start to shut down your old containers.

If there is an issue, it’s very quick. Again, within seconds, you can flip right back. There’s not a lot of spin up or boot up time in there.

You’re not letting your AMIs maybe run for weeks because you’re afraid to shut them down because it’s going to take, again, dozens of minutes maybe to get up and running.

Docker challenges

And so to kind of back that up, some of the data from our study shows that the median container lives about three days right now, and I would imagine the next time we ran the study, it’s going to be even shorter than that, whereas the VMs on the other hand are 12 days.

And this tend to be a lot of cloud users, folks where they’re already using things like auto-scaling groups.

They’re already using things like AMI. They’re comfortable treating their infrastructure as cattle rather than pets, not naming them, not treating them as friends and family, not trying to not try to keep them around for years.

But they’re still, again, 12 days versus 3 days. That’s a lot of churn.

Maintaining container churn

So how do we manage that type of churn? Well, Amazon offers some great service, Elastic Container Service.

We saw a little bit about this in the keynote today.

It lets us automatically manage and schedule these tasks. We’re not running into every host that we want to deploy to, running the containers manually, you know, Docker and what have you.

That’s a little painful, especially if you’re talking about doing things dynamically.

Your fingers don’t move that fast. I guarantee it.

But also ensuring that those tasks are running, maybe they hook these things up with something like an auto-scaling group, and instances are coming and going, making sure that we always have the right number of containers.

It’s important that those containers we’re scheduling are working.

And so that’s where ECS comes in.

And finally, things like port management start to be a little bit of a pain with Docker.

You don’t actually know what port some of your services are going to come on, and so ECS lets us hook that up with ELB and just do that fairly smoothly.

You just ask Amazon and your services will always be there for you.

Defining normal for Docker

As we start to talk about some of the challenges here, the goal of our monitoring is to help us define normal and alert on it when it changes, or come back and see how our systems have been performing.

And how do you define normal when normal is different from one minute to the next or one second to next?

It’s a bit of a challenge. Again, containers are moving between hosts.

Things are changing ports.

It feels like you’re standing on quicksand.

It’s standing on quicksand, but it’s turns out it’s actually pretty stable. You need to know how to look at it.

Tracking containers

So some of the other challenges, adding up the numbers, right, we’ve now gone from maybe having one instance to having four number of containers, for instance.

Docker itself, just in 1.9, which is what comes with the ECS and optimized AMI, is going to have just over 220 metrics per container.

These are things that you don’t want to keep track of, things like your memories, and CPU, network, what have you.

We’ll dive into that a little bit later.

That’s a lot of data to keep track of.

CloudWatch adds another six metrics, give or take, again, depending on the number of services house you’re running.

Those can be about corporate cluster that are going to tell you about the cluster health as a whole, another two per per task that you’re running to tell you how that’s performing.

You’re gonna start to throw in your OS metrics.

On any given instance, you’re likely monitoring about 100 metrics, on average, we see from the operating system.

On your applications, likely, each have about 50 metrics each.

So you start to add that all together as you’re running four containers on a host at a time, which is what we’ve been seeing in our study.

Again, it adds up fast, turns into a bit of metrics overload.

So where do you keep all of that, right?

A lot of your legacy monitoring tooling is likely not built for this type of capacity or this type of change.

You start to kind of feel like maybe you’re looking at a relic of the past and whether or not it fits into, you know, what we’re trying to do today in the cloud of this dynamic infrastructure.

Getting the bigger picture

Another challenge is that a lot of the tools that we’ve been using up until now tend to be host-centric.

With things like ECS abstracting the host away from us so that we can focus on what we care about most which is the service we’re offering our customers, it’s a little challenging to think about this.

So this is the model…the picture you’re seeing here is a model of the solar system.

This is back when we thought the sun revolved around the Earth, along with all the other planets nearest the universe.

It’s kind of crazy. Look at all these lines.

This is pretty hard math to do.

And we start to flip things around and think about services which is, again, what we really care about, what we’re offering our customers, starts to look a lot neater, right?

These are not…the math around this is much, much simpler.

The other thing we want to start to look at is how to avoid gaps in our graph.

You move from one host to the next, you want to see a consistent line across any graph that you’re tracking.

In this case, we’re looking at application latency.

But you won’t be able to look at that across your entire infrastructure, whether or not hosts are coming and going.

Now, these were challenges, again, this is a challenge you had again when you were using auto-scaling groups or really any dynamic infrastructure.

But again, what containers have done is just thrown this into overdrive.

The change is just leveling up all the time, so we want to be able to avoid gaps.

How do we do that?

Monitoring infrastructure in layers

It’s important to start thinking about our applications and our infrastructure as layers, and looking at different tools to monitor each of those layers.

There’s a little bit of overlap between each of these layers.

But if you look at that from this perspective, you can find the right tools to look at it from each point and then you can use something like Datadog to aggregate all that data together.

So you get the bottom, maybe at the bottom there, we have something like CloudWatch or some other type of infrastructure monitoring tool at the lowest level letting us know how our hypervisor is performing, what hypervisor is seeing about our instances.

You go up a level and you have, again, infrastructure monitoring.

This is where things like your Datadog, your , your other…some of the other out there comes into play. They’ll tell you about those individual containers or those individual VMs, tell you things like application metrics, like requests per second or really any of those other type of work output that you’re doing.

And at the highest level, what you’re looking at is APM, and those might be sort of external that might be profiling tools or external synthetic monitoring.

Tagging

And now, you can’t be successful at knowing what’s going on here unless you have all of these layers, but those layers together provide a very powerful picture about what’s going on in your environment.

Plug those altogether using tags; tags are really the most important piece of all of this as we’re starting to monitoring our systems regardless of what tools you’re using.

You want to make strong use of tags both within your container and infrastructure.

And if you’re using labels on as part of Docker, tags as part of ECS, you want to tag your AWS instances, tags all the way down your infrastructure so that you can very clearly slice and dice those metrics and know what you’re looking at.

Tags could be anything from instance types to what availability zone they’re running in, what version of application is running, all of those bits.

And together, they let you put together what the other query is that ring true about your environment regardless of how many hosts or containers you’re running on a given day.

Excuse me. So you know, for example, you want to be able to ask your monitoring tooling questions like this.

Business questions, again, show me all my containers, alert me any time a container, you know, is running a particular image across a given region, across all availability zones with some sort of value on the metrics.

On this case, we’re looking at container web across U.S., across all of our availability zones, where the average memory usage is 1.5 times the average running on a particular instance.

All of these things that are sort of in bold and underlined, those are tags.

These are the types of things that you can use to query your environment, ask it questions rather than asking about a particular host to something listening on port 80 today.

You’re going to ask it how is that application performing.

So that’s a quick overview of some of the challenges of container-based monitoring.

Next, we’re going to that theory we’re talking about earlier.

Monitoring 101

Monitoring 101, again, this is sort of the theory and lessons that we’ve picked up from working with a number of our customers over the last few years as they monitor their infrastructure and as we monitor ours.

We all know that metrics are important.

We all know monitoring’s important, the same way we wouldn’t drive down the highway with our headlights off and the wipers off in the middle of the night.

We we won’t want service and production without monitoring-driven development.

Forget about test-driven development.

Monitoring-driven development, start monitoring these things and staging in development long before they hit production because otherwise, you won’t know what they look like when you get there.

And when you end up in that situation, you got this right?

You’re going to hit something along the road, and you’re going to do that even if you prepared.

But along, if you prepared, you’re gonna have that data that you need because collecting that data is cheap when you have it, but it’s quite expensive to try to collect it and manifest it out of thin air later.

Haven’t you guys ever done a post-mortem where you said to your boss, “You know, I don’t know why that happened but the action item is going to be to add more monitoring so I can tell you next time?”

Okay, there’s people hesitating to raise their hands.

I think they’re afraid to be seen on camera.

But the point is, you never want to go to your customer and say, “You know, we’re going to have this instance twice so that I can tell you why it happened,” and prevent it the third time.

You want to have a never.

But if it’s going to happen, have it happen once and never again.

So again, collecting that data is really cheap when you have it, super expensive when you don’t.

It’s going to cost you in those post-mortems.

Instrument all the things

And so we say instrument all the things.

Again, earlier, we talked about the volume of this, and clearly, there’s a lot of data.

But you gotta weigh that cost.

We’re in the land of Amazon and cloud services.

You get to scale your storage and your compute layers with a credit card.

That credit card transaction, I guarantee you, will be cheaper than collecting that data even if you only look at it 50% of the time or 20% of time when that one incident kicks off.

So as we’re looking at all that data, again, there’s this firehose of data coming in, we really need some modern methodologies for looking at it.

We can’t grab our 1960s NASA antennas and use them to figure out what’s going on here.

We put together this, a bit of a guide, and I’m going to give the TLDR edition of it for the writers/editors of you out there, but you can find the long version up on our site.

It’s a fantastic article. I encourage you all to read it and take a look at how it might fit into your environment, whether you’re a Datadog customer or not.

I hope you are, and if you are, I’d love to chat with you afterwards as well.

So the short of it is…the biggest piece of monitoring 101 for us is the idea of categorizing your metrics.

Really, we see your metrics falling into three major areas, the first being your work metrics.

These are, for example, if you think about your application or your environment as a factory, let’s say, we’re making cars.

Work metrics

Tesla started doing those preorders recently.

They’re trying to turn them out as fast as they can.

All those orders came through.

What’s the throughput?

These are the things that they’re…the throughput is how many of those cars are coming off of the assembly line at any given time so that we can sell them.

Success and error rates are really going to be how or more about the quality of the cars that are coming out there, how many of those are missing hubcaps or have cracked windshields or don’t have wheels at all?

How successful is our assembly line in outputting this?

And then performance, when one of those requests come in, how quickly can we turn that car around?

Resource metrics

Resource metrics being all the pieces of the things that go into there, so maybe the rubber that goes into the tires and the number of tires that are available for the car.

And really, you want these because they tell you how much slack you have in your pipeline, how many more cars could you sell, how many more workers could you have assembling things right now?

And that’s where utilization and saturation come in.

They’re two sides of the same coin.

And then, again, error rates and availability, how much of these resources you have available, how many errors are you having if you get access to some of these resources.

Event metrics

And the final area is events, events or things that provide context.

They’re more qualitative, so it’s going to be things like, hey, you just started a huge onsale that lets your customers order $1 billion of your car that you’ve not yet started making yet.

You’ve got a big backlog.

You’re gonna want to think about that.

Code changes or formula change, maybe you’ve changed what goes into the rubber that goes into those tires.

Maybe that’s why things are taking a little bit longer to assemble.

Alerts, things that have notified you, maybe you bought a SuperBowl ad and now, you gotta track that.

But these are all the things that tell you why your metric dipped or changed at a given point in time.

Examples

NGINX work metrics

So apply this to some examples.

NGINX is one of the most popular things we see running in containers and so we’ll use as a quick example here.

So what are some things that could be considered work metrics in the case of a web application like NGINX?

Requests per second, of course, that’s that throughput we were talking about earlier.

Drop connections, these are, you know, might be that error rate.

Request time, that’s going to be that performance we’re talking about earlier that some of the latency that your customers are experiencing, and then, again, back to error rates, maybe 200 sources, 500 sources, 400.

That’s based on what your application does, but these are some examples.

NGINX resource metrics

Resources on this side are the piece that go into that.

In the case of an API, you’re not using rubber or tires or hubcaps, but you are using things like disk.

You are looking at things like memory.

You’re looking at things like CPU or the depth of the queues that you’re interacting with.

NGINX event metrics

And then, of course, events, these are purely staple, right? You guys are doing deployments all the time, whether those are the red/green type deployments that we’re talking about earlier or maybe something a little bit more complex.

But the idea is these are things that you’ve done to your environment, whether it’s being, you know, upgrading NGINX, restarting it to reload a new config, deploying some new code on the backend, big web ad campaign that you’ve started, things like that.

So these are the things that are going to help you understand why your environment has changed.

When to let a sleeping engineer lie?

So which of these do you think we paid you on?

When do you let a sleeping engineer lie?

How many guys you get are on-call?

Okay, again, pretty much all of the audience. I’ll let you avoid turning your heads.

So when do we want to wake up in the middle of the night?

We care about these leading indicators, right?

How many of you had a CEO call you and say, “You know, you’re using too much CPU. I know the API is returning perfectly, but you’re using too much CPU?”

Okay, maybe the credit card bill was a little high on that last month that you were using AWS or something, but the reality is they care about the business that you’re turning around to the customers.

And so really, what you want to do is look at those work metrics.

These are the symptoms of what your customers are experiencing, so those APIs that are failing to return or turning slowly.

And you want to use those in conjunction with events and resource metrics that we’re talking about to do some of that investigation.

So really, you want to take a look at something like this.

Every layer in your stack has a work metric, doesn’t matter if you’re a DBA.

All the way at the bottom, you have a work metric.

You’re providing something to some customer in your environment, whether it’s an app developer or a business user.

There is a work metric there.

If you don’t know what it is, find it.

That’s what you’re being rated on.

And when you come into a situation where you have an incident where you’ve been paged at 3:00 in the morning, you’re trying to figure out what’s going on, you’re going to start at the first work metric, the first symptom that a customer has noticed.

API requests are slow, what’s going on?

You start diving in. Maybe you’re running out of threads in your app here, you don’t have enough containers running, what’s going on?

You’re gonna work all your way down the stack till you find out maybe you have a slow SQL query, I don’t know. Not to blame the database, spent my time as a DBA as well, just an example.

So again, really, the key thing is to find those work metrics all the way down and then work your way through the resource and events that tie to them.

Collecting metrics

We’re now sort of getting towards the latter part of the presentation.

We were talking, we said we’d start with why containers are important.

We talked about that a bit.

We said we talk about some of the challenges that we have there, as well as some theory about how to monitor them.

So now, we’re going to start to dive in on maybe how to get the stuff out of ECS and Docker.

So you see, as metrics are clearly, again, things that are your cluster health, the Docker metrics and your application metrics might be the things that are more specific to your environment.

And then you’re going to want to start figure out the resource metrics versus those work metrics and what we’re gonna alert on.

So in the world of containers, really, it’s not all that different from the NGINX application that we were looking at before.

You’re gonna look at things like CPU and memory IO, network traffic, etc.

These are utilization. Again, these are your resource metrics.

You’re gonna look at things, again, saturization, swap.

All of these things are key to knowing whether or not you have enough resources to do the job that your customers are asking you to do.

Docker and ECS metrics

And then events, these are things that are, again, these are mostly going to come from Docker and the ECS APIs.

ECS APIs offer you deployment events that you can pull in for tools like Datadog to tools like your logging event management tooling.

But other events might be auto-scaling or other underlying instances.

We’ve all fallen in love with our spot instances and our auto-scaling groups, ensuring that we always have the right amount of the thing we want when we want it.

We’re not giving that up just because we’re using container since there would be some events around that changing it quite a bit as well.

So we go back to this diagram that we were looking at before where we were looking at the layers.

And so most of the things that we just talked about there in terms of resources and events, they’re going to come from this lower level part here.

You’re getting a lot of that from CloudWatch.

Again, CloudWatch is going to give you a good sense of how many of these resources you reserved, how much you’re using, and how much is just sitting out there unreserved.

You want to get familiar with these.

These are going to tell you how much slack you have in your environment to deploy more containers.

Moving up a layer in the stack, again, we’re talking about our infrastructure tooling.

We’re going to start talking about things like file system that we have available, again, when you’re working with things like…disk usage is not one of the metrics that CloudWatches are churning for you on the ECS, but also things, also a lot of the work metrics that we’re talking about earlier, the number of queries your returning, things of that nature.

So how do we get at those metrics?

Well, luckily, Docker offers a ton, a myriad of ways to get at these.

The most popular three are the ones listed up here.

We have pseudo-files and we’ll talk about those in a second.

Those give you a point in time view into things like CPU, memory, some of the IO metrics, and the number of stats available there are growing over time with each release.

In the case if you’re using ECS and using ECS optimized instances, most of the caveats out here honestly don’t matter to you.

They’re running a modern enough version of Docker that you just don’t care.

You can get all of this from any one of them, the stats command and then, of course, the Docker API, the stats API that we’re looking at earlier.

Pseudo-files

So what are pseudo-files?

Pseudo files are a way to get that containers via what seemingly feel like files on your file system.

These are under sysfs. If you’ve been using Linux for a while, you know what sysfs looks like.

Basically, these pass on your operating system.

It’s going to depend if you’re running Ubuntu versus Redhat versus Amazon, Linux of where these are being mounted.

But you’re gonna be able to pull up metrics on a per container basis with this last part here being the container ID that you fill in.

Fairly straightforward, fairly quick to get up, but you’re probably not sitting there tailing and cutting files all day.

But to give you some examples of things you can do, again, you just cut these out and you’ll start to get things like how much time are used, how much time are your containers been spending CPU license boot, things like how much time you’re spending exiting system calls, what have you.

There’s lots of great data here across a number of different files.

CPU accounting and CPU stats are two very useful ones to take a look at.

Throttling comes into play, maybe you schedule too many resources on that particular host and you find yourself up against the limits that you might have set.

So again, just because we’ve packed everything so efficiently into those containers on our host like the shipping containers that we’re talking about earlier doesn’t mean you have to worry about resources at all.

Docker API and STATS command

There’s the Docker API.

This is going to give you some fairly detailed streaming metrics over HTTP.

You can access those over a report or over a socket.

So in this example here, I’m showing you how to do it with the Unix socket.

Lets you avoid exposing that port anywhere and it’s going to give you a ton of data.

You’ll see that these, the numbers run off the screen here and no matter how small I made the font and how many times I cut up the image, couldn’t make that show up there. So go back, run it on your instances.

Again, you’re going to be able to pull these down both at a higher level but also on a per container level.

You’ll notice again here, this is the container ID at the end there, and these are streaming in.

So you do this, you run the scroll command, you’re just going to constantly be getting a stream of data coming through.

It’s a bit of a firehose to suck on end of.

And then finally, you have the stats, the Docker stats command.

It’s similar to top or PS if you’re used to it.

You just pop in the container ID, you’re going to get a very quick summary, human-readable format, very useful in a quick troubleshooting sense.

Side car containers, agents, and daemons

But really, we don’t want to sit, again, similar, as I was saying before, you don’t want to be sitting around all day tailing and cutting and curling things, right?

We’ve got better things to do.

We want to go home. We wanna be with our families.

We want to write some code for once rather than staring at CLIs all day.

So we’re going to pull in things like sidecar containers to monitor this for us.

So this is an example of what a data log deployment on top of ECS might look like.

Really, it’s the same in any containerized environment.

Deploy Datadog as a container next to your applications, just as you would any other application using ECS environment or your monitoring tool, how to pull up those metrics via the stats API, spits them back out into your dashboards and your alerts and what have you.

At that level, to be able to pick up metrics from all of your apps and the bits underlying it.

But ideally, we want to schedule these in the same way that we would any other any other task in our environment, right?

We have no desire to…we just talked about how you don’t want to run those containers, all your app containers manually in every host in your environment.

Why would you do that with your monitoring?

So the challenge here is that at least in ECS, at the moment, there’s not a good way to schedule, to say I want a particular container to run on every single host in my environment at least once, in the way that you would have some of your other daemons running, whether they be Cron or something else in your environment.

And so there’s a couple of options.

We can bake it into our images. That sort of feels like it’s going back a step.

We just talked about how sort of painful and heavy that was.

We can stall any chosen provision time.

You’re back to the manual piece, especially if you’re using something like auto-scaling groups.

The hosts are coming and going.

IAM privileges

Or the third, which is a little bit of a hack, but it’s been quite successful for our customers which is to automate this with user scripts and launch configs.

So what’s that look like, right?

We’re all familiar with IM, and if you’re not…I hope you are.

Please don’t be using your root accounts for anything.

But some of the newer features in IM let you delegate privileges to your underlying Amazon instances, right?

They can pick up keys on the fly to make API calls for very scoped use.

And the case here, what we’re going to go ahead and do is create some policies that let our instances schedule task and launch our container on to…So we go ahead and all the origin bits are variables that you’re going to want to fill in for your particular environment.

But we’re going to go ahead and create a role, attach a policy to it.

You’ve got a policy on the right hand side here.

What you’re saying is you can assume a role, our instance is going to assume a role to quickly register a container, deregister the container, collect some stats about it, sends some stats about it.

This is so that you can automate this against, that you’re not sitting there trying to…so you’re not doing two things.

One, don’t put keys on those hosts.

At some point, that key will get compromised.

You don’t want to rotate it.

Now, you’re figuring out how to go around rebaking keys all over the place, and it’s going to let you automate this all, so quite helpful.

We’re going to grab…again, this is a user script that takes advantage of that.

Again, in this case, we’re using the…we’re looking at the Datadog agent as the Docker container we’re pulling down and launching or task that we’re launching.

But really, this can work with any container, any daemonized container that you want, any container that represents the daemon or a set of services that you wanted shown on every host.

You can just change a couple variables here, and again, link to this is in the notes as well.

But you’ll see, what this does is it’s quickly spinning up, and when your instance fires up, you can use something like rc.local.

First thing it does is going to be start up this task on that host, register that instance with, ECS isn’t available, ECS instance, and then run your container on it.

Autoscaling

And then, of course, auto-scale, we talked about how much we love auto-scaling and how important that is to us. What you can quickly do is take those profiles and take those user scripts, jump them into launch config and use them for your auto-scaling just like you would anywhere else.

But aren’t we still missing a layer, right?

In the last couple minutes here, we talked about how to monitor the operating system.

We talked about how to monitor the cluster.

We talk about how to monitor some of your containers.

We still have some open questions.

Open questions

We crossed off sort of the top two, where is my container running and how much of the capacity of my cluster?

We have a couple others in there that we’ve answered as well, but we still don’t know things like what ports your app’s running on.

If you have something like an NGINX status page, how do you get to that via the metrics?

The port change every time you launch it.

We still don’t know the throughput of your particular app.

We have the throughput of the cluster.

I can tell you how much, you know, how many network bits came out the front of the particular instance or a cluster of instances, but I haven’t told you how to get it from your individual app.

And then we want to get things like response times across versions, not just across the app as a whole or the error rates, what have you.

Service discovery

So we still have a bunch of open questions here.

So there’s a couple different ways we can go about it.

This is the way that we’ve been quite successful with and something that some of our customers are doing today.

At the top here, we have things like our individual application, maybe it’s a web app that you wrote or an off the shelf component, something like or Postgrads or some other data, data store.

We’ve got all these different things running here and they’re changing ports all the time.

We have the container already collecting the host level metrics.

But where are these things going to get supports and all the other data about what’s going on there?

So you can pull some of that stuff from the Docker API.

You can know when a new container launched because you’re gonna watch those events.

You can get that from the ECS deployment events as well, at a high level that’s there.

ECS and CloudWatch, again, offer some bits of mediator on tags.

But really, what we’ve started to do is we’ve started to drop all of these things, use service discovery tools like etcd or Consul or some folks are using Zookeeper, I mean, etcd and Consul are the two that we support but there’s more coming every day, and start to store both configuration snippets in there and templates, but also also pick up when those services are running to let us do auto-configuration and identify, again, what port those things are running on at any given point in time.

Custom metrics

And then, of course, the most important, maybe not the most important but something I think is quite important, I hope you guys are doing as well, as you guys you know your applications way better than I do.

You might tell me that it’s a web app and I’ll tell you, yes, okay, 500 versus 200 are important to me.

But which of those APIs is a check-out on your shopping cart or the thing that actually puts money in your bank account?

I don’t know that, you know that.

So what we what we started, what we’ve seen with all our customers are doing and what we encourage folks to do is to start finding those key transactions and instrument your applications with them.

We offer a bunch of STKs around doing this, but there’s also open source tooling out there, as well as, and standard bit tooling out there, whether it’s via JMX or WMI, what have you.

You know, those are going to be more asynchronous calls, something that somebody has to connect into you to get.

But you can use things like Etsy’s STATSD, which are asynchronous, you’re going to be able to submit those metrics over time and send them to us or to some other tooling as they’re being generated from those transactions.

Again, you know your transactions best.

You really want to do this with, again, something that’s asynchronous as possible, something that’s fire and forget.

Stats these fantastic, you just send UDP.

Usually, that collector locally on the host, again, whether that’s Datadog or not, and that sends that along back to some sort of a graphing backend over a more reliable protocol like TCP.

The reason those async protocols are important is that you don’t want what your applications blocking on some backend service that’s waiting to tell you whether or not it received those metrics.

Stay out of the doghouse

In general, I mean, we’re coming towards the end here, I want to make sure I leave a little bit of time for questions.

But my goal here is really to help you guys figure out how to stay out of the doghouse even as you’re going into containers.

And so if I can chat with you a little bit, either, whether it’s here upfront or maybe later on down at the Datadog booth, I’d love to chat with you, ask a bit more about what you’re using to monitor, what ECS and containers, what types of things you’re containerizing, and just learn about some of the tools that you’re using.