Dash conference! July 11-12, NYC

How to monitor Elasticsearch with Datadog

/ / /
Published: September 26, 2016

This post is part 3 of a 4-part series on monitoring Elasticsearch performance. Part 1 provides an overview of Elasticsearch and its key performance metrics, Part 2 explains how to collect these metrics, and Part 4 describes how to solve five common Elasticsearch problems.

If you’ve read our post on collecting Elasticsearch metrics, you already know that the Elasticsearch APIs are a quick way to gain a snapshot of performance metrics at any particular moment in time. However, to truly get a grasp on performance, you need to track Elasticsearch metrics over time and monitor them in context with the rest of your infrastructure.

elasticsearch alert dashboard
Datadog's out-of-the-box Elasticsearch dashboard

This post will show you how to set up Datadog to automatically collect the key metrics discussed in Part 1 of this series. We’ll also show you how to set alerts and use tags to effectively monitor your clusters by focusing on the metrics that matter most to you.

Set up Datadog to fetch Elasticsearch metrics

Datadog’s integration enables you to automatically collect, tag, and graph all of the performance metrics covered in Part 1, and correlate that data with the rest of your infrastructure.

Install the Datadog Agent

The Datadog Agent is open source software that collects and reports metrics from each of your nodes, so you can view and monitor them in one place. Installing the Agent usually only takes a single command. View installation instructions for various platforms here. You can also install the Agent automatically with configuration management tools like Chef or Puppet.

Configure the Agent

After you have installed the Agent, it’s time to create your integration configuration file. In your Agent configuration directory, you should see a sample Elasticsearch config file named elastic.yaml.example. Make a copy of the file in the same directory and save it as elastic.yaml.

Modify elastic.yaml with your instance URL, and set pshard_stats to true if you wish to collect metrics specific to your primary shards, which are prefixed with elasticsearch.primaries. For example, elasticsearch.primaries.docs.count tells you the document count across all primary shards, whereas elasticsearch.docs.count is the total document count across all primary and replica shards. In the example configuration file below, we’ve indicated that we want to collect primary shard metrics. We have also added a custom tag, elasticsearch-role:data-node, to indicate that this is a data node.

# elastic.yaml 

   - url: http://localhost:9200
     # username: username
     # password: password
     # cluster_stats: false
     pshard_stats: true
     # pending_task_stats: true
     # ssl_verify: false
     # ssl_cert: /path/to/cert.pem
     # ssl_key: /path/to/cert.key
       - 'elasticsearch-role:data-node'

Save your changes and verify that the integration is properly configured by restarting the Agent and running the Datadog info command. If everything is working properly, you should see an elastic section in the output, similar to the below:

      - instance #0 [OK]
      - Collected 142 metrics, 0 events & 3 service checks

The last step is to navigate to Elasticsearch’s integration tile in the Datadog App and click on the Install Integration button under the “Configuration” tab. Once the Agent is up and running, you should see your hosts reporting metrics in Datadog, as shown below:

Elasticsearch alerts - integration

Dig into the metrics!

Once the Agent is configured on your nodes, you should see an Elasticsearch overview screenboard among your list of available dashboards.

Datadog’s out-of-the-box dashboard displays many of the key performance metrics presented in Part 1 and is a great starting point to gain more visibility into your clusters. You may want to clone and customize it by adding system-level metrics from your nodes, like I/O utilization, CPU, and memory usage, as well as metrics from other elements of your infrastructure.

Tag your metrics

In addition to any tags assigned or inherited from your nodes’ other integrations (e.g. Chef role, AWS availability-zone, etc.), the Agent will automatically tag your Elasticsearch metrics with host and url. Starting in Agent 5.9.0, Datadog also tags your Elasticsearch metrics with cluster_name and node_name, which are pulled from cluster.name and node.name in the node’s Elasticsearch configuration file (located in elasticsearch/config). (Note: If you do not provide a cluster.name, it will default to elasticsearch.)

You can also add your own custom tags in the elastic.yaml file, such as the node type and environment, in order to slice and dice your metrics and alert on them accordingly.

For example, if your cluster includes dedicated master, data, and client nodes, you may want to create an elasticsearch-role tag for each type of node in the elastic.yaml configuration file. You can then use these tags in Datadog to view and alert on metrics from only one type of node at a time.

Tag, you’re (alerting) it

Now that you’ve finished tagging your nodes, you can set up smarter, targeted Elasticsearch alerts to watch over your metrics and notify the appropriate people when issues arise. In the screenshot below, we set up an alert to notify team members when any data node (tagged with elasticsearch-role:data-node in this case) starts running out of disk space. The elasticsearch-role tag is quite useful for this alert—we can exclude dedicated master-eligible nodes, which don’t store any data.

elasticsearch alerts - disk space monitor

Other useful alert triggers include long garbage collection times and search latency thresholds. You might also want to set up an Elasticsearch integration check in Datadog to find out if any of your master-eligible nodes have failed to connect to the Agent in the past five minutes, as shown below:

Elasticsearch alerts - status check monitor

Start monitoring Elasticsearch with Datadog

In this post, we’ve walked through how to use Datadog to collect, visualize, and alert on your Elasticsearch metrics. If you’ve followed along with your Datadog account, you should now have greater visibility into the state of your clusters and be better prepared to address potential issues. The next part in this series describes how to solve five common Elasticsearch scaling and performance issues.

If you don’t yet have a Datadog account, you can start monitoring Elasticsearch right away with a .

Source Markdown for this post is available on GitHub. Questions, corrections, additions, etc.? Please let us know.

Want to write articles like this one? Our team is hiring!