Advances in container monitoring
Published: September 28, 2017
Michael: All right. So, good afternoon. I’m a Product Manager here, and I’m here to talk to you about new tools for understanding your infrastructure and for real-time debugging. I’m gonna introduce two new products and both of them will give you visibility and help to understand the most granular, numerous, and ephemeral parts of your deployment. So, I’m excited to announce Live Containers and Live Processes which is in beta. I’d like to start with Live Containers, which will give you insight into the status and performance of all of your containers in real time.
So, the container adoption is exploding. The benefits of running containerized infrastructure are clear. Containers allow teams to run applications in isolation. They are highly portable and they are lightweight. They’re easy to stop and start and run on any host across your infrastructure without special considerations, allowing for just-in-time provisioning. This enables team to move at higher velocity and make better use of resources.
With that in mind, it is not surprising that in our recent study, we’ve found that Docker adoption was up 40% year over year among you guys. We also found that Docker runs on 15% of all hosts running Datadog, and that is not just that more and more organizations are adopting Docker, but those that do adopt ramped very fast. The average company quintuples its Docker usage within nine months.
For a sense of scale, we have customers from the beta for this product, who we have helped to monitor tens of clusters comprised of thousands of hosts running tens of thousands of concurrent containers. Because of the portable nature of containers, they can churn very quickly, and some companies need to monitor hundreds of thousands of containers per day, which is a lot to monitor.
As adoption is becoming standard, we are seeing the industry mature and enter a new phase. Containerized applications are deployed to production. This is happening now. This is not a science experiment, and it’s become vital for our monitoring methods to give us complete visibility into these deployments.
Almost half of Dockerized organizations are making use of orchestrators to manage enormous deployments. With the use of an orchestrator, containers can be scheduled to run on hosts where they will make the best of use of resources and provide the best experience to customers without micromanagement by engineers.
The other thing we’re seeing is the migration of existing applications to Dockerized or to containerized workloads. A lot of the initial container adoption was microservices, was new, not cloud-native applications that were built for this architecture, but companies are wanting to see the benefits in monoliths for portability and CI without necessarily spending the time right now to chop them up into microservices.
Product Goals and Inspiration
So, the goals of this product were to treat containers as the primary objects of the system, to give you visibility into your inventory at any grain or by any metadata, to provide transparency into your container deployment, and to give you a method for real-time debugging. Oh, sorry. With all this in mind, we look to industry for inspiration and to build a new product to help monitor this environment.
Htop which provides visibility into the proc file system and its container sibling, ctop, give administrators and engineers quick answers to how their systems are performing. Kubectl, the command line interface for interacting with the Kubernetes API, allows users to reinventory their clusters at any grain. Powerful though they are, these are hosts and cluster-level tools. So, we have combined the value that we saw in each of these tools to give you a distributed service for monitoring your system.
As a quick overview, here, you can see all of the containers across my deployment. Each container is decorated based on the orchestrator integrations. In this case, it’s just Kubernetes, but there are others. And we provide a summary of metrics for the containers on the table, and service some popular tags.So, I’d like to step aside for a moment to allow Alan Scherger to talk about monitoring containers at HomeAway and about his experience during the beta.
Alan: Thanks, Michael. All right. So, I’m gonna give a little talk called, “Little PIDs Made of Ticky-Tacky,” and they all look just the same. So, who am I? I’m Alan Scherger. I’m a Senior Janitor at HomeAway. I’m a Mesos Marathon junky. So, long before this Kubernetes phase came out, we’ve been looking at Mesos Marathon. And I’m trying to become a Golang developer. If you’re trying to become a Golang developer, there’s an awesome book at gopl.io. I highly recommend you pick it up and read it. So, I’m gonna give you an overview of what we’re gonna talk about today.
First off is the history of HomeAway and where we came with containers. It’s primarily a complete lie but there are some truths in it. There is gonna be some clever metric tooling that we’ve built. So, Galen, who’s here, I’ll have you, yes, stand up later, and Chris Barry, who’s also here, did an awesome job with that. And finally, we’ll get into exploring metrics and how to use this new process agent.
A “History” Lesson
So, history lesson. Again, a complete lie, but probably a familiar story for all of you. You came with this great idea called HomeAway and you’ve got one server and it’s running HomeAway and that’s awesome. And then you’ve got your friends using it and now you have this terrible idea of adding reviews. And so now, you have two hosts to manage. An arbitrary amount of time later, and, boom, now, you’ve got multiple hosts now running way more apps than they ever should have, and you’ve got a net scale that are load balancing them. Another round of funding later, and some more terrible ideas, and now, you’ve got more hosts.
Multiple Data Centers
Who knows how long later, and eventually, you’ve got multidata center problems, right? And at this point in time, managing host, you know, like bespoke little pets, those doesn’t make sense. We’ve gotta start moving towards a cattle model. We wrote a rails app that essentially does nothing, which is the beautiful part of it and it’s the glue that holds everything together, it’s a central database. And from that, before microservices were even a thing, we built microservices that would ask that database, “Hey, should I be doing any work, and if it is, what work should that be?” And it was a simple workflow engine with state. And it was cool and it worked for a while.
But now, we want all of these things. So, predominantly, being programmatic routing and cloud, I wanna be able to bring up and tear down instances whenever I want. I want that to actually be cleaned up throughout all my infrastructure. I wanna actually be able to do global deployment pipelines, so that I can just pick a region and actually be there or pick a cloud and actually be there.
By the way, I just spent millions in a data center, so it would be great if we also just make that look like a cloud. And streams. I would all of the streams. I’d like to stream my data to more streams, have those streams stream on streams, and all of it be logged and have metrics. That would be great.
Yeah, so, if you’re an ops guy, you’re like, “You’re crazy.” But we can start to tear this problem apart, right? So, I don’t have the budget to go hire 100 ops engineers to build these bespoke little pets. But what I can start to take advantage of are schedulers, and schedulers are gonna be the future, guys. I don’t know what to tell you.
Inception of Schedulers
So, back in 2009, Ben wrote this awesome paper with some friends about Mesos and he turned it into a real product. It’s awesome, it actually works, and it’s very simple. Docker was then actually released after Mesos was born, and eventually, Mesos allowed Docker to become a first-class citizen. Kubernetes was then born. Out of Kubernetes, of course, Docker needed to compete, and so Docker Swarm was born. And most recently, you have Hashicorp’s Nomad scheduler which is really cool.
I don’t know if any of you guys attended Hashi conf., but the fact that you can run Nomad with on your CLI as a Golang app in dev mode and be able to actually schedule things to that host is pretty nifty, it’s pretty awesome for development. It’s pretty great.
So, naturally, we’re a pretty conservative company and we picked Mesos because it has been around the longest. Kubernetes was just getting its feet off the door. Nomad didn’t even exist. And we have a lot of it now. And it’s pretty awesome.
So, the core problems that you’ll see with the Mesos cluster is that it’s got a core dependency on Zookeeper. Knock on wood, that has actually not been any of our problem. So, Netflix released a side crawler called “Exhibitor.” And by the use of Exhibitor, we’re able to keep Zookeeper clusters up and running. And aside from one marathon upgrade, we haven’t had any issues of losing apps or needing to completely rebuild the cluster from scratch. That said, we’ve taken a very conservative approach on how we roll out our Mesos clusters. And so we have a whole orchestration platform on top of them.So, for the most part, it’s actually completely transparent as a developer that you’re even deploying to Mesos other than through the grapevine and having to be able to debug, you have probably found out that it’s a Mesos/Marathon cluster and how to debug those as far as getting access to your logs and metrics.
Discovering Metrics Problem
But this came to an interesting point when we were doing this POC of moving to a scheduler because, now, instead of having old these pets run on a specific host, and that host has been MyHost or it’s been co-tenant with a bunch of friends. Now, my app is gonna be moving around to a bunch of different hosts all of the time and we need to get metrics off the host. And so, naturally, you’re gonna pick Datadog because you’re all here at the summit, and why would you not pick the Datadog agent? And even if you’re not gonna pick Datadog as someone to pay you money, you really should just use our agent. It’s a pretty great agent. I’m certain that Datadog Agent 6 is gonna be even more awesome. It’s a great pattern.
But containers are ephemeral. So, if you start to look at the Datadog config, you’re gonna notice that all these configuration files or YAML files that live on a host. So, now, I have some options of “How do I actually get an app that’s moving around the different hosts to have their metrics picked up by a Datadog Agent, and the agent is constantly gonna be have to getting new configuration on every check run to figure out which app is running where, what kind of app is it, and how do I get metrics out of it?” And that comes to the discovering metrics problem.
And so, today, the easiest, probably, solution that you can use is either the auto discovery features or, honestly, having your apps just dump their data into Dogstatsd. However, both of these either didn’t exist or fairly immature, and that’s something I get to say a lot now. And that was we needed to roll our own solution and that’s where working with Chris, we’ve rolled two pieces of codes.
Check Generator Tool
So, one of it is a check-generator. So, the way this works is it uses another tool called a “Consul2LocalServiceRegistry,” tool. And that actually talks to consul. So, I’ll walk you through this without talking too much.
The service check that Chris wrote generates more service checks and it does that by talking to the local registry. And it’s able to do that because of what’s something we’ve defined as a metric family. So, I might have five apps that are all different and owned by different teams, but under the hood, they’re all nginx. So, I might create the family called “Nginx,” but according to the developers, they’re all different applications.
Similarly, you might have the same thing with Kafka clusters. There might be 10 teams that have 10 different Kafka clusters, but each one wants to know that they are their own thing. And when we want to instrument them, we’ll use a metric family of Kafka-broker to be able to actually distinguish that they’re different applications, but they all get their metrics collected the same way.
Collecting Dynamic Checks
Similarly with Dropwizard. We have all of these Drowizard apps. Dropwizard exposes their metrics as similar way, we’ll have a Dropwizard metric family to do that. And the way that we’re able to then take these families that are essentially different ways of collecting checks or different ways of instantiating these dynamic checks, we’ll do it through a very agnostic way which is through this Consul2LocalServiceRegistry.
So, we happen to pick consul, internally, for service discovery, but this pattern will work if you’re using SED, ZooKeeper, even a file on the disk. However you want, this tool will display a mapping to the Datadog service check to be able to generate all of those generic checks. And, I think, you can see that… That was a bad idea.
All right. So, the way that this works is it exposes our apps. Bar and Foo are a family metric type codahale. And if you wanna get information about Bar or Foo, you can go to their address, you know their app name, what environment they’re in, because we might be co-hosting environments on the same hardware, so I wanna be able to actually know that Bar, in stage, has these metrics. And we’re able to expose the fact that this is actually the metric address and port that we’ll be listening for this app because many times, the metric address and port will be different than the actual port that’s serving the application. So, it’s a pretty cool pattern. We can talk about it more in depth later, but that’s one of the big things that was able to get us off and going through the races with containers at HomeAway, and using Datadog.
And then the next part was “Visualing & Monitoring Metrics” which Galen is gonna give a talk on, but he’ll be up available. Galen, I don’t know if you wanna stand up so people can come find you. But essentially, Galen and team wrote a bundle that we’re able to put into Dropwizard apps to dynamically build dashboards, as well as expose some Java annotations so that things can be monitored without a lot of work, which brings us to this product release, which is “What about the things we aren’t collecting metrics on?”
So, there are a lot of things that we just don’t have checks for, but it would still be nice to know how insane are they running. And that brings us to Process Agent, which is essentially a global, searchable process table across your entire infrastructure, which, for someone who’s running multiple Mesos clusters or multiple Kubernetes clusters, multiple of anything, I now have one giant process table that I’m able to search things on. And lucky for you, the installation is literally as easy as three steps.
Upgrade your Datadog agent, tell them you want that Process Agent enabled, and restart Datadog. And now, with the global release, we’ll actually be able to see this in the dashboards. And once you start doing this, you can start exploring. It essentially takes the process table of every app and dumps it into Datadog for you. So, right away, I was actually able to, actually, save HomeAway 124 gigs of RAM on two boxes because as we started to roll this out, these two PIDs started to rise to the top as the biggest consumers of RAM and they would collect the processes that had run away.
The other problem that I have to solve a lot, which, of course, we don’t instrument with Datadog at all is “Security Agent X is doing something weird.” And is it doing something weird across everywhere or just on a few hosts? So, right away, what you’re gonna be able to see is some awesome functionalities. So, what’s blurred out is me being able to search on all of the process tables up at number one.
Number two, which is, I think, a not terribly advertised feature is… By the way, you have these many processes running. So, when a security team comes to me and says, “Hey, I think we’re running, you know, we’re definitely running Agent Foo Bar,” and I go to Datadog now, I can search for Foo Bar and be like, “You’re definitely not because it’s only installed on three boxes.” I can know that from that information right there.
The third thing is being able to group based on either users or availability zones. It’s awesome, and within those, being able to actually filter on those groups is even more powerful.
The next thing is graphs. So, there are a lot of things that are actually being graphed out of this agent. They’ve got an awesome availability of tooling from there, and there are more graphs. So, there are actually roll-up graphs that do heat maps, essentially, of what the values are within things. And, of course, when I can actually click on that heat map, wink-wink, I’ll be able to actually be able to identify the host, that is the outlier, and maybe that will be a feature someday, which brings us to the next set of features which is the containerization and being able to visualize the containers.
And that’s really critical because when we start to look at the metrics of things that are running on these Mesos clusters, you start to see things that are just like, “That app is just burning fire.” Like, but, seriously, what are you doing? And it’s the fact that you are a log generator generating logs of logs, for logs, by logs, chipping logs. And so maybe that’s okay, you, setting that machine on fire is completely okay and not interesting, and I can quickly visualize and search on that.
And the next thing is, actually, visualizing the madness across, again, all of your infrastructure. There is no multi-Mesos cluster roll up of metrics. There is no multi-Kubernetes roll up of metrics. Now, I can do it and I can do it at the process layer, and I know that it’s working, and I don’t care about your new technology. It’s there, and I can group it up by availability zone and know that we’re running stuff and you’re up and it’s not doing anything but burning money, and why? Needless to say, this tool is pretty awesome and I’m excited for its release and that you guys can get to use it. So, thanks, Michael.
Michael: No, thank you. All right. So, I’m good, right? No, thank you very much. So, some of these has been covered now, but I do wanna walk through a couple of those individual features as well.
Tags are a part of the heart and soul of monitoring with Datadog. Each container inherits the tags from the host where it has been spun up. And additionally, we attach integration-specific tags to the containers themselves. Here, to reduce system noise, I have filtered down to a single-name space. You can also do this to look at a single cluster-availability zone, provisioning role or whatever. I also wanna point out that all of the metrics on this page are out of the provision limits, where they have been set. So, this allows you to quickly highlight poorly provisioned containers and better use your resources.
Based on tagging, we have also implemented pivot tables for the container table, allowing you to understand your infrastructure at any grain. Is this playing? Okay. Yup, okay. So, instead of listing my containers, I might want to get a list of my deployments, or I might want to group by Docker version and see where I haven’t updated, or compare performance between versions. Now, I can pivot by host and see where containers are orchestrated.
Container Monitoring Features
Or instead of containers, I can also pivot by the pods. And so now, I can see where each of my pods are living and then drop down and inspect any of the pod host pairs, and then drill further down into the containers inside and see our local history. So, the tools that I talked about earlier, particularly htop and ctop, are extremely useful for a real-time debugging and we’ve designed this product with that workflow in mind. You’ve already seen the query and response interface with Alan’s stuff and what I just showed you, and summary and inspection graphs for local and global context.
Finally, all metrics in Live Containers are reported every two seconds. This is especially important for highly bottle metrics which characterizes a lot of important container metrics like CPU and transmit operations. So, here, you can see on my right, yeah, that far side, the 10-second resolution, and then closer to me, the 2-second resolution, and you can see how, at the coarser grain, it does round out those CPU spikes. Oh, here we go, bigger.
So, I have shown you a new product in Datadog, which can help you understand your container inventory at any grain, and with the ability to drill down into the finest details. This will give you better visibility into the provisioning of resources and new tools for real-time debugging. We are happy to announce that general availability is today for all users. Live Container collection is enabled by default on the most recent agent, 5.17.2, and there’s no configuration necessary. We have enabled the feature across all of the accounts and it can be found nested in your inventory menu.
Live Processes Overview
So, I mentioned it earlier, and Alan gave you a very good overview, but I also wanted to spend a moment to talk about Live Processes, which was developed in the same spirit as what you just saw. Processes are what are doing the actual work of your application there, what are being literally contained within containers. And by monitoring all the processes across your deployment, we go further in providing you with that fine granularity. By enriching data we already exposed in Live Containers, we can peak into each container. Containers are often running just a single process, but in our experience, that is not true even where you expect it to be.
Moreover, migrations of existing applications to containerized environments often start with a lift and shift operation before the months and years of work are spent breaking older applications into microservices. We have seen individual containers running more than 30 processes, which, from the process side, we know is almost half our virtual machine. So here, you can see that we enriched the container data and do allow you to inspect an individual container, in this case, I think, it’s a Mongo sidecar, and see the process tree inside. You can also…let’s see.
Yeah, the application of Live Processes is not limited to containerized environments. We expose every process on every host across your infrastructure and include their arguments and other metadata. Like, Live Containers, this takes advantage of the full tagging capabilities of Datadog, and you can do all those pivots and filters and searches as well.
So, for this, it’s available to anybody in public beta starting today. And, please, email me there and I’ll enable it for you. I’ll leave that up for a little bit just so you can see the email address and whatever. Great.