Installing Datadog on Mesos with DC/OS
Apache Mesos, as its developers tout, “abstracts the entire data center into a single pool of computing resources, simplifying running distributed systems at scale.” And that scale can be tremendous, as proven by prominent adopters such as Twitter, Airbnb, Netflix, and Apple (for Siri).
But Mesos is not the simplest system to deploy, as even its developers acknowledge. As a bare-bones kernel, it requires engineers to find and configure their own compatible solutions for service discovery, load balancing, monitoring, and other key tasks. That could mean more tinkering than you’re comfortable with, or a larger team than you have budget for.
To the rescue comes DC/OS (for “Datacenter Operating System”). As the name suggests, it’s a full-fledged operating system for Mesos clusters, bundled with compatible technologies to handle all the above tasks and more. Plus, it’s easy to install and configure, even for small teams.
Autodesk, the developer of entertainment and design software such as AutoCAD, wrote here about how DC/OS greatly sped up their rollout of a complete Mesos solution. In their case, just a single engineer was able to quickly deploy key cloud services across multiple regions, while reducing costs, data center resources, and build times.
Datadog + DC/OS
Datadog is a popular choice for monitoring Mesos clusters because it’s easy to deploy at scale, and it’s designed to collect and aggregate data from distributed infrastructure components.
As we’ll show here, DC/OS makes it even easier to deploy the Datadog Agent across your Mesos cluster, and out-of-the-box integrations for Mesos, Docker, and related services let you view your cluster’s many metrics at whatever level of granularity you need.
Installing Datadog: Agent nodes first
Mesos clusters are composed of master and agent nodes (both of which are explained here, along with other key Mesos concepts). We’ll start by installing Datadog on the agent nodes.
As you can see below, the DC/OS package universe already includes Datadog, greatly simplifying the installation process.
Just log into your master node via the DC/OS web UI and click on the Universe tab. Find the “datadog” package and click the Install button to roll out the Datadog Agent across all agent nodes.
Aside from the default values provided, the installer needs only your Datadog API key and the number of agent nodes in your cluster (which you can determine by clicking the Nodes tab on the left side of the DC/OS web UI).
Afterward, you can inspect the DC/OS service view under the Services tab and confirm that Datadog is running across all agent nodes. You’re now collecting metrics from both Mesos and Docker, which we’ll explore in a moment. But first, let’s expand our monitoring coverage by installing the Datadog Agent on our master nodes as well.
Now for the master nodes
The Datadog package in the DC/OS universe installs the Datadog Agent within a standard Mesos framework. In the Mesos scheme, only agent nodes may execute frameworks (see here for more detail). Therefore, to monitor the cluster’s master nodes, you’ll need to install Datadog as a standalone Docker container on each.
To do so, run the following command, which takes care of both installation and initialization:
docker run -d --name dd-agent \ -v /var/run/docker.sock:/var/run/docker.sock:ro \ -v /proc/:/host/proc/:ro \ -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro \ -e API_KEY=$YOUR_API_KEY \ -e MESOS_MASTER=yes \ -e MARATHON_URL=http://leader.mesos:8080 \ -e SD_BACKEND=docker \ datadog/docker-dd-agent:latest
That’s it. If the command returns no error, Datadog is now running on that master node. You can confirm this by running the command below to ensure that the Datadog Agent is running in a Docker container:
docker ps -a | grep dd-agent
Note that we’ve passed three extra parameters in the example above: a Datadog API key, a flag indicating that the node is a master, and the URL to the Marathon leader (which is usually the same as shown above). The
MESOS_MASTER=yes option tells Datadog to look for metrics from the specialized set of services that run on a master node: Mesos and Docker (same as on agent nodes), plus Marathon and ZooKeeper. Marathon is the Mesos framework responsible for launching long-running processes across agent nodes; ZooKeeper is a distributed coordination service that handles leader elections on your DC/OS cluster.
With the Datadog Agent now installed on all Mesos nodes, you can start monitoring all your cluster metrics.
Bring up your template dashboards
In your list of available dashboards, you’ll see an out-of-the-box dashboard built specifically for monitoring Mesos clusters (if you don’t, just click the Install Integration button in the Mesos integration tile). You’ll also see resource metrics from your containers populating the pre-built Docker dashboard, and metrics from your Mesos masters flowing into the ZooKeeper dashboard. Once you enable additional integrations (such as NGINX, Apache, or MySQL), you will have access to dashboards specific to those technologies as well.
The Mesos dashboard, as shown below, gives you a handy, high-level view of all that data center abstraction we mentioned above. You can see at a glance, for example, that our cluster has three master nodes and four agent nodes, and what the aggregate totals are for CPU, disk usage, RAM, etc.
The Docker dashboard provides a slightly more granular view, breaking down resource consumption by Docker images and containers.
Enable additional integrations
Once you’ve set up Datadog on your cluster, you’ll want to start collecting insights from the services and applications running on your cluster. Datadog has built-in integrations with more than 200 technologies, so you can start monitoring your entire stack quickly.
Automated monitoring with Autodiscovery
When you install the Datadog package on DC/OS, or use the instructions above for your master nodes, Autodiscovery is enabled by default. This feature automatically detects which containerized services are running on which nodes, and configures the Datadog Agent on each node to collect metrics from those services.
For certain services, such as Redis, Elasticsearch, and Apache server (httpd), the Datadog Agent comes with auto-configuration templates that will work in most cases. (See the Autodiscovery docs for the full list.) If you’re running one of those services, you should start seeing metrics in Datadog within minutes.
Monitoring additional services with custom configs
For services that demand a bit more specificity in their configuration (for instance, a database that requires a password to access metrics), you can create monitoring configuration templates that Datadog will apply whenever it detects the service running on a node in your cluster. For production use, you will likely want to use a key-value store such as etcd or Consul to manage these templates for you. To demonstrate the basics of Autodiscovery, though, here we’ll walk through a simple example using configuration templates stored in a directory on the host that is mounted by the Datadog Agent container.
Create a config template
On your agent nodes, create a directory to house your custom configs and create a YAML file for the service you want to monitor (drawing on the example YAML templates that ship with the Datadog Agent). In this example we’ll set up Autodiscovery for MySQL, using credentials for a
datadog user that has the necessary permissions to access database metrics:
mkdir -p /opt/dd-agent-conf.d/auto_conf/ touch /opt/dd-agent-conf.d/auto_conf/mysql.yaml
Open your newly created
mysql.yaml file and paste in the basic config template from the Datadog Agent, then add a
docker_images section at the top to tell Datadog which Docker images this template applies to:
docker_images: - mysql init_config: instances: - server: 172.17.0.1 user: datadog pass: password_for_datadog_user_in_db
Apply the config template
In order to apply your newly created config template, you need to mount the config files on the DC/OS agent node into the Datadog container running on that host.
You can do that easily in the DC/OS web UI by adding a new volume to the Datadog service. From the Services list, select Datadog, and then click the Edit button to modify the service parameters. Under “Volumes,” add a volume with the following parameters:
- Container path:
- Host path:
- Mode: Read Only
Deploy the changes to the Datadog service. You can then verify that the configuration is correct by running the Datadog
info command on your agent node. For MySQL the command and output should look something like the snippet below:
[agent-01]# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 7bc168abea39 datadog/docker-dd-agent:11.0.5112 "/entrypoint.sh super" 10 minutes ago Up 10 minutes 7777/tcp, 8126/tcp, 0.0.0.0:8125->8125/udp, 0.0.0.0:9001->9001/tcp mesos-xxx 22bb4ef7de7e mysql:5.7.12 "docker-entrypoint.sh" 22 minutes ago Up 22 minutes 0.0.0.0:3306->3306/tcp mesos-xxx [agent-01]# docker exec 7bc168abea39 service datadog-agent info ... mysql ----- - instance #0 [OK] - Collected 58 metrics, 0 events & 1 service check - Dependencies: - pymysql: 0.6.6.None mesos_slave ----------- - instance #0 [OK] - Collected 34 metrics, 0 events & 1 service check ntp --- - Collected 0 metrics, 0 events & 0 service checks disk ---- - instance #0 [OK] - Collected 40 metrics, 0 events & 0 service checks docker_daemon ------------- - instance #0 [OK] - Collected 64 metrics, 0 events & 1 service check ...
Now that you’re collecting metrics from your containerized services along with the underlying infrastructure, you can set sophisticated alerts, correlate metrics between systems, and build comprehensive dashboards that bring together data from all your core systems. You can also conduct open-ended exploration of your metrics and infrastructure to surface new insights.
Survey your infrastructure
The Host Map in Datadog lets you explore your cluster from both high and low altitudes, and quickly determine which services are running, if resource consumption is within expected limits, or if any segment of your infrastructure is overloaded.
For example, you can view all the nodes in the cluster, both agents and masters. (The latter group, as you can see in the screenshot, are running the extra, masterly services: Marathon and ZooKeeper.)
If you want to focus on only the master nodes, you can take advantage of Datadog’s extensive tagging support and filter only for
By default, the Host Map node color is coded to CPU usage. So if a single master node in your cluster began to consume excessive CPU cycles, you would see it painted as a red node, and you might then look for a hardware failure or rogue process on that node. (And if many master nodes crept into the red, you might suspect that the cluster is under-provisioned and consider deploying more nodes.)
You can also color (or even size) the nodes by any available metric, whether from Mesos or from a Mesos-related component like Marathon or ZooKeeper.
Get some fresh air
Datadog alerts can also watch your cluster for you while you do things other than squint into bright screens. You can set a trigger on any metric, using fixed or dynamic thresholds, and receive instant notification via email or through third-party services like PagerDuty, Slack, and HipChat.
Time to get your hands dirty
We’ve only just touched on the many capabilities of Datadog for DC/OS cluster monitoring, but it’s more fun anyway to try them out for yourself.
As a first step, sign up here for a free 14-day Datadog trial.
Then, if you’re not using DC/OS yet, the getting started guide here explains how to spin up a simple test cluster in just 30 minutes or so on AWS. If you want to get started even faster, head here to try spinning up a pre-configured cluster from an AWS CloudFormation template.
Once your cluster is up, pop Datadog on your master and agent nodes and start exploring the rich stream of cluster metrics now flowing into your Datadog account.