Monitor Amazon Elasticsearch Service with Datadog

Emily Chang

Editor’s note: Amazon Web Services uses the term “master” to describe its architecture and certain metric names. Datadog does not use this term. Within this blog post, we will refer to this term as “primary”, except for the sake of clarity in instances where we must reference a specific metric name.

Amazon Elasticsearch Service is a managed service that helps users configure, deploy, and maintain their Elasticsearch clusters in AWS cloud environments.

With Datadog’s Amazon Elasticsearch Service integration, you can automatically collect key metrics, visualize them in an out-of-the-box dashboard like the one shown below, and get alerts that notify you of resource shortages or performance issues. You can also connect the Datadog Agent to your hosted Elasticsearch cluster to pull in additional metrics beyond those provided by AWS CloudWatch.

Amazon Elasticsearch Service default dashboard — Datadog’s out-of-the-box Amazon Elasticsearch Service dashboard

Amazon Elasticsearch Service overview

Elasticsearch is a distributed document store that is designed for horizontal scalability and availability. Amazon Elasticsearch Service (Amazon ES) allows users to provision and configure AWS resources (such as EC2 instances and EBS volumes) to support an Elasticsearch cluster. Amazon ES resources are oriented around the concept of the domain, which can be interpreted as an Elasticsearch cluster.

The service sets certain limitations and default settings to help simplify the process of managing your clusters. It is also designed with a number of convenient features, some of which are highlighted below.

Automated data backups

As mentioned in our Elasticsearch monitoring guide, it is a good idea to regularly back up your clusters with the snapshot and restore module. Amazon ES automates this process by taking daily snapshots of your cluster and storing them for two weeks in S3.

Availability

Amazon ES helps you capitalize on Elasticsearch’s distributed design through the “zone awareness” feature. If you enable zone awareness on a domain, the service will automatically distribute the instances across multiple availability zones. The primary node will also assign replica shards across instances and zones as much as possible. If the service detects that a node has failed, it will automatically provision a new instance and restore the data from a backup snapshot if possible.

Monitoring

Amazon ES exposes metrics through CloudWatch, so that you can monitor the performance of your clusters and make adjustments as needed. Datadog’s integration enables you to correlate those performance metrics with metrics from other parts of your infrastructure.

Amazon ES metrics to (Cloud)Watch

Datadog’s Amazon Elasticsearch Service integration enables you to collect, visualize, and alert on key metrics, including:

cluster status (green, yellow, or red)
minimum amount of free storage space on a single data node
maximum JVM heap usage on a single node across a cluster/domain
maximum CPU utilization across your data nodes or dedicated primary nodes
Read/write latency, throughput, and IOPS (I/O operations per second) on EBS volumes

Amazon ES read and write latency metrics

As you can see in the screenshot above, each Amazon ES metric is tagged in Datadog with its domain name. Other tags inherited from Amazon ES are: elasticsearch_version and region, as well as two tags that have a true/false value: dedicated_master_enabled and zone_awareness_enabled.

Using these tags, you can slice and dice data to view specific aspects of your domains, such as max CPU utilization across dedicated primary nodes, or read/write throughput broken down by domain. The integration will also pull in custom tags created in CloudWatch, as well as any custom tags that you apply through the Datadog AWS integration tile.

Get notified about potential issues

Using Elasticsearch in production, you may eventually run into performance and scaling issues. Setting up Datadog alerts can help you respond to performance degradations and resource shortages before they become more pressing issues. For example, you may want to get notified when:

maximum CPU utilization is consistently high on any node (remedial action: upgrade to a larger instance or add more instances to distribute the workload)
minimum free storage space on any data node dips below an acceptable threshold (upgrade the EBS volume size or increase the number of data nodes in your cluster)
85 percent of heap is consistently in use on any Amazon ES instance (consider adding more instances, or upgrading memory)

Increased visibility with the Datadog Agent

Although Amazon ES exposes basic metrics through CloudWatch, you can access even more Elasticsearch metrics (including garbage collection frequency, refresh latency, and flush latency) with the Datadog Agent.

Because Amazon ES does not allow you to directly install the Agent on any of your nodes, you will need to install the Agent on another host and point it at your Amazon ES domain endpoint (remember to start the endpoint with http://). You can find out the name of your endpoint by navigating to your domain in the AWS console. Also, make sure that your cluster has an access policy that makes it accessible to the Agent.

In the Agent’s elastic.yaml configuration file, make sure to set cluster_stats to true and pending_task_stats to false (Amazon ES does not provide access to the Pending Tasks API). Here’s an example of what your configuration file might look like:

1
  - url: http://<YOUR_AWS_ES_ENDPOINT> # e.g. http://search-domainname-domainid.us-east-1.es.amazonaws.com
2
    cluster_stats: true
3
    pending_task_stats: false

Get started

If you’re already using Datadog’s main AWS integration, you can start monitoring Amazon Elasticsearch Service by checking off the “ES” box under “Limit metric collection” in the AWS integration tile, and make sure to grant your Datadog role/user the required permissions.

If you’re not yet using Datadog, you can sign up for a free trial.

Get Started with Datadog