Monitoring Your EKS Cluster With Datadog

The Datadog Agent is open source software that collects and forwards metrics, logs, and traces from each of your nodes and the containers running on them.

Once you deploy the Agent, you will have immediate access to the full range of Kubernetes cluster state and resource metrics discussed in Part 1. The Agent will also begin reporting additional system-level metrics from your nodes and containers.

We will go over how to deploy the Datadog Cluster Agent and node-based Datadog Agents across your EKS cluster. Note that the following instructions are tailored to monitoring Amazon EKS pods running on EC2 instances. See our documentation for instructions on deploying the Datadog Agent to any Amazon EKS pods running on AWS Fargate.

The Cluster Agent communicates with the Kubernetes API servers to collect cluster-level information, while the node-based Agents report data from each node’s kubelet.

While it is possible to deploy the Datadog Agent without the Cluster Agent, using the Cluster Agent is recommended as it offers several benefits, particularly for large-scale EKS clusters:

It reduces overall load on the Kubernetes API by using a single Cluster Agent as a proxy for querying cluster-level metrics.
It provides additional security because only one Agent needs the permissions required to access the API server.
It lets you automatically scale your pods using any metric that is collected by Datadog.

You can read more about the Datadog Cluster Agent here.

Before turning to the Agent, however, make sure that you’ve deployed kube-state-metrics Recall that kube-state-metrics is an add-on service that generates cluster state metrics and exposes them to the Metrics API. After you install the service, Datadog will be able to aggregate these metrics along with other resource and application data.

Deploying the Datadog Cluster Agent

The Datadog Cluster Agent runs on a single node and serves as a proxy between the API servers and the rest of the node-based Agents in your cluster. It also makes it possible to configure Kubernetes’s Horizontal Pod Autoscaling to use any metric that Datadog collects (more on this below).

There are several steps needed to prepare your cluster for the Agent. These involve providing the appropriate permissions to the Cluster Agent and to the node-based Agents so each can access the information it needs. First, we need to configure RBAC permissions, and then create and deploy the Cluster Agent and node-based Agent manifests.

Configure RBAC permissions for the Cluster Agent and node-based Agents

EKS uses AWS IAM for user authentication and access to the cluster, but it relies on Kubernetes role-based access control (RBAC) to authorize calls by those users to the Kubernetes API. So, for both the Cluster Agent and the node-based Agents, we’ll need to set up a service account, a ClusterRole with the necessary RBAC permissions, and then a ClusterRoleBinding that links them so that the service account can use those permissions.

First, create the Cluster Agent’s RBAC file, cluster-agent-rbac.yaml:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: datadog-cluster-agent
rules:
- apiGroups:
  - ""
  resources:
  - services
  - events
  - endpoints
  - pods
  - nodes
  - componentstatuses
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - "autoscaling"
  resources:
  - horizontalpodautoscalers
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - configmaps
  resourceNames:
  - datadogtoken             # Kubernetes event collection state
  - datadog-leader-election  # Leader election token
  verbs:
  - get
  - update
- apiGroups:                 # To create the leader election token
  - ""
  resources:
  - configmaps
  verbs:
  - create
  - get
  - update
- nonResourceURLs:
  - "/version"
  - "/healthz"
  verbs:
  - get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: datadog-cluster-agent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: datadog-cluster-agent
subjects:
- kind: ServiceAccount
  name: datadog-cluster-agent
  namespace: default
---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: datadog-cluster-agent
  namespace: default

Next, create the node-based Agent’s RBAC file, datadog-rbac.yaml:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: datadog-agent
rules:
- apiGroups:      # This is required by the agent to query the Kubelet API.
  - ""
  resources:
  - nodes/metrics
  - nodes/spec
  - nodes/proxy   # Required to get /pods
  - nodes/stats # Required to get /stats/summary
  verbs:
  - get
---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: datadog-agent
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: datadog-agent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: datadog-agent
subjects:
- kind: ServiceAccount
  name: datadog-agent
  namespace: default

Deploy both of them:

$ kubectl apply -f /path/to/cluster-agent-rbac.yaml
$ kubectl apply -f /path/to/datadog-rbac.yaml

Confirm you have both ClusterRoles:

# Get a list of all ClusterRoles
$ kubectl get clusterrole
NAME                                                                   AGE
[...]
datadog-agent                                                          1h
datadog-cluster-agent                                                  1h
[...]

Secure communication between node-based Agents and the Cluster Agent

The next step is to ensure that the Cluster Agent and node-based Agents can securely communicate with each other. The best way to do this is by creating a Kubernetes secret. To generate a token to include in the secret, run the following:

echo -n '<32_CHARACTER_LONG_STRING>' | base64

Copy the resulting string. Then, create a file, dca-secret.yaml, with the following:

apiVersion: v1
kind: Secret
metadata:
  name: datadog-auth-token
type: Opaque
data:
  token: <TOKEN>

Replace with the string from the previous step. Then create the secret:

$ kubectl apply -f /path/to/dca-secret.yaml

You can confirm that the secret was created with the following:

# Get a list of all secrets
$ kubectl get secret
NAME                        TYPE                                  DATA      AGE
datadog-auth-token          Opaque                                1         21h

Now that we have a secret in Kubernetes, we can include it in our Cluster Agent and node-based Agent manifests so that they can securely communicate with each other.

Create and deploy the Cluster Agent manifest

To deploy the Cluster Agent, create a manifest, datadog-cluster-agent.yaml, which creates the Datadog Cluster Agent Deployment and Service, links them to the Cluster Agent service account we deployed above, and points to the newly created secret:

datadog-cluster-agent.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datadog-cluster-agent
  namespace: default
spec:
  selector:
      matchLabels:
        app: datadog-cluster-agent
  template:
    metadata:
      labels:
        app: datadog-cluster-agent
      name: datadog-agent
    spec:
      serviceAccountName: datadog-cluster-agent
      containers:
      - image: public.ecr.aws/datadog/cluster-agent:latest
        imagePullPolicy: Always
        name: datadog-cluster-agent
        env:
          - name: DD_API_KEY
            value: <YOUR_API_KEY>
          - name: DD_APP_KEY		# Optional
            value: <YOUR_APP_KEY>	# Optional
          - name: DD_COLLECT_KUBERNETES_EVENTS
            value: "true"
          - name: DD_LEADER_ELECTION
            value: "true"
          - name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED
            value: "true"
          - name: DD_CLUSTER_AGENT_AUTH_TOKEN
            valueFrom:
              secretKeyRef:
                name: datadog-auth-token
                key: token
---
apiVersion: v1
kind: Service
metadata:
  name: datadog-cluster-agent
  labels:
    app: datadog-cluster-agent
spec:
  ports:
  - port: 5005 # Has to be the same as the one exposed in the DCA. Default is 5005.
    protocol: TCP
  selector:
    app: datadog-cluster-agent

Make sure to insert your Datadog API key as indicated in the manifest above. You can also include a Datadog app key. Both keys are accessible here in your Datadog account. Providing an app key is necessary if you want to autoscale your EKS applications based on any metric you’re collecting with Datadog. See below for more information on this.

The final environment variable—DD_CLUSTER_AGENT_AUTH_TOKEN—points the Cluster Agent to the datadog-auth-token secret we just created.

Note that the Datadog Cluster Agent is configured as a Deployment and Service, rather than as a DaemonSet, because we’re not installing it on every node. Deploy the Cluster Agent and Service:

$ kubectl apply -f /path/to/datadog-cluster-agent.yaml

Deploy the node-based Agent DaemonSet

The final step is to deploy the node-based Agents as a DaemonSet. We use a DaemonSet here because, unlike the Cluster Agent, we want to deploy the node-based Agent to all of our nodes, including new ones as they are launched. (You can also use nodeSelectors to install it only on a specified subset of nodes.)

Create a datadog-agent.yaml manifest file (making sure to fill in your Datadog API key):

datadog-agent.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: datadog-agent
spec:
  selector:
      matchLabels:
        app: datadog-agent
  template:
    metadata:
      labels:
        app: datadog-agent
      name: datadog-agent
    spec:
      serviceAccountName: datadog-agent
      containers:
      - image: public.ecr.aws/datadog/agent:latest
        imagePullPolicy: Always
        name: datadog-agent
        ports:
          - containerPort: 8125
            # hostPort: 8125
            name: dogstatsdport
            protocol: UDP
          - containerPort: 8126
            # hostPort: 8126
            name: traceport
            protocol: TCP
        env:
          - name: DD_API_KEY
            value: "<YOUR_API_KEY>"
          - name: DD_COLLECT_KUBERNETES_EVENTS
            value: "true"
          - name: KUBERNETES
            value: "true"
          - name: DD_KUBERNETES_KUBELET_HOST
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: DD_CLUSTER_AGENT_ENABLED
            value: "true"
          - name: DD_CLUSTER_AGENT_AUTH_TOKEN
            valueFrom:
              secretKeyRef:
                name: datadog-auth-token
                key: token
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "256Mi"
            cpu: "200m"
        volumeMounts:
          - name: dockersocket
            mountPath: /var/run/docker.sock
          - name: procdir
            mountPath: /host/proc
            readOnly: true
          - name: cgroups
            mountPath: /host/sys/fs/cgroup
            readOnly: true
        livenessProbe:
          exec:
            command:
            - ./probe.sh
          initialDelaySeconds: 15
          periodSeconds: 5
      volumes:
        - hostPath:
            path: /var/run/docker.sock
          name: dockersocket
        - hostPath:
            path: /proc
          name: procdir
        - hostPath:
            path: /sys/fs/cgroup
          name: cgroups

Deploy the node-based Agent:

$ kubectl apply -f /path/to/datadog-agent.yaml

You can use the following kubectl command to verify that your Cluster Agent and node-based Agent pods are running (the -o wide flag includes more details in the output):

# Get running pods with "agent" in their name
$ kubectl get pods -o wide | grep agent
NAME                                    READY     STATUS    RESTARTS   AGE       IP                NODE
datadog-agent-44x9q                     1/1       Running   0          9d        192.168.124.130   ip-192-168-124-130.us-west-2.compute.internal
datadog-agent-8ngmb                     1/1       Running   0          9d        192.168.165.188   ip-192-168-165-188.us-west-2.compute.internal
datadog-agent-gfl98                     1/1       Running   0          9d        192.168.193.120   ip-192-168-193-120.us-west-2.compute.internal
datadog-agent-hbxqh                     1/1       Running   0          9d        192.168.182.134   ip-192-168-182-134.us-west-2.compute.internal
datadog-agent-hqbj2                     1/1       Running   0          9d        192.168.112.185   ip-192-168-112-185.us-west-2.compute.internal
datadog-agent-knd2j                     1/1       Running   0          9d        192.168.195.211   ip-192-168-195-211.us-west-2.compute.internal
datadog-cluster-agent-f797cfb54-b5qs8   1/1       Running   0          7s        192.168.248.206   ip-192-168-195-211.us-west-2.compute.internal

You should see that each node in your cluster is running a datadog-agent replica, and that one node is also running a datadog-cluster-agent pod.

Now that the Agent has been deployed to your cluster, you should see information from your EKS infrastructure automatically flowing into Datadog. This includes system-level metrics from your nodes and containers as well as metrics from Datadog’s Kubernetes and Docker integrations, which you can immediately visualize in a customizable, out-of-the-box dashboard:

EKS monitoring with Datadog Kubernetes dashboard — Visualize your Kubernetes and Docker metrics together in Datadog.

Autoscale your EKS cluster with Datadog metrics

Out of the box, Kubernetes’s Horizontal Pod Autoscaler (HPA) can autoscale a controller’s replica count based on a targeted level of CPU utilization averaged across that controller’s pods. (Note that this cannot apply to a DaemonSet, as a DaemonSet automatically launches a pod on each available node.) As of version 1.10, Kubernetes also supports autoscaling based on custom metrics so long as you have configured an External Metrics Provider resource that can communicate with the Kubernetes API server. The Datadog Cluster Agent can act as an External Metrics Provider, meaning that if you are using the Cluster Agent to monitor your EKS infrastructure, you can deploy an HPA that will autoscale your pods based on any metric collected by Datadog.

In the screenshot below, we’ve deployed an HPA that will monitor requests per second to pods running NGINX across our cluster, averaged by pod. If this metric crosses a threshold, it will spin up new NGINX pods in our cluster until the average requests per second per pod falls below the threshold, or the number of running pods meets the maximum limit we defined in the HPA manifest.

EKS monitoring with Datadog HPA scaling — Our HPA scales up NGINX pods in response to spikes in requests per second.

Deploying HPAs can help your cluster automatically respond to dynamic workloads by spinning up new pods, for example, to add resource capacity or to distribute requests. Find more information about deploying an HPA using Datadog metrics in our blog post.

Collect, visualize, and alert on Amazon EKS metrics in minutes with Datadog.

Enable Datadog’s AWS integrations

So far, we have covered how to use Datadog to monitor Kubernetes and Docker. But as we discussed in Part 1, that’s only part of the EKS story; you will also want to monitor the performance and health of the various infrastructure components in your cluster that are provisioned from AWS services, such as EBS volumes, ELB load balancers, and others.

Datadog’s AWS integration pulls in CloudWatch metrics and events so that you can visualize and alert on them from a central platform, even if you don’t install the Datadog Agent on your nodes. Datadog automatically collects any tags that you add in AWS as well as metadata about each AWS component. For example, your EC2 metrics will be tagged with the instance type, availability zone, Auto Scaling group, etc. Having these tags in Datadog lets you easily find and drill down to the specific cluster components you are interested in.

Get access to CloudWatch

Datadog needs read-only access to your AWS account in order to query CloudWatch metrics. To do this, create a new role in the AWS IAM Console and attach a policy that has the required permissions to query the CloudWatch API for metrics. You may also need to grant additional permissions to access data from any AWS services you want to monitor. See Datadog’s documentation for detailed instructions on this process.

Configure Datadog’s AWS integration

Once you’ve created the required role, go to Datadog’s AWS integration tile. Enter your AWS account ID and the name of the role you created in the previous step. Under “Limit metric collection,” check off the AWS services you want to monitor with Datadog. For our EKS cluster, we want to make sure to collect at least EC2 metrics. Here, you can see that we will also collect metrics from our EBS volumes and Elastic Load Balancers.

EKS monitoring with Datadog AWS integration

Once you’ve enabled the integrations for the services you want to monitor, Datadog will begin pulling in those metrics so you can view, correlate, and alert on them along with all your other data. Datadog includes customizable, out-of-the-box dashboards for many AWS services, and you can easily create your own dashboards to focus on the metrics that are most important to your organization.

EKS monitoring with Datadog ELB dashboard — Datadog's out-of-the-box AWS ELB dashboard.

Datadog will also import AWS event information for certain services. Events include, for example, scheduled maintenance for EC2 instances, or Auto-Scaling triggers for starting or terminating instances. Viewing these alongside Kubernetes events can give you a better picture of what is going on with your cluster’s infrastructure.

EKS monitoring with Datadog event stream

The full power of Datadog

With the Datadog Agent and AWS integrations working, you now have access to all of the metrics covered in Part 1 of this series. Now, we’ll go over how to use Datadog to get full visibility into your EKS cluster and the applications and services running on it. This includes:

Detailed tagging for filtering and sorting
Powerful visualization and real-time monitoring features
Autodiscovery to automatically identify what’s running on your containers
Enabling log, trace, process, and custom metric collection
Sophisticated alerting options

The beauty of tags

Datadog automatically imports metadata from Kubernetes, Docker, AWS services, and other technologies, then creates tags from that metadata. This makes it easier to visualize and alert on your metrics, traces, and logs at a more granular level. For example, you can filter and view your resources by Kubernetes Deployment (kube_deployment) or Service (kube_service), or by Docker image (image_name). Datadog also automatically pulls in any host tags from your EC2 instances (both those attached by AWS and any custom tags), so you can view your nodes by availability zone or by EC2 instance type.

In your node-based Datadog Agent manifest, you can add custom host-level tags with the environment variable DD_TAGS followed by key:value pairs separated by spaces. For example, below we’ve added two tags that will be applied to all nodes in your cluster that the Agent is deployed on:

[...]
env:
  - name: DD_TAGS
    value: owner:maxim role:eks-demo
[...]

You can also import Kubernetes pod labels as tags. This lets you pull pod-level metadata that you define in your manifests into Datadog as tags. For example, you may label certain pods related to a specific application and then filter down in Datadog to visualize the infrastructure for that application.

Visualize your EKS cluster

Once the Datadog Agent has been deployed to your cluster, you should be able to see information about your EKS infrastructure flowing into Datadog. You should also be able to quickly drill down into specific sets of containers by using tags to sort and filter by pod, deployment, service, and more.

The host map gives you a high-level view of your nodes. You can group and filter your nodes by host-level tags, such as AWS availability zone or instance types, and the host map will fill them in based on the real-time value of any metric being collected on those nodes. This lets you see at a glance if, for example, one of your nodes is using higher levels of CPU than others, or if a particular zone is getting more requests than others.

Datadog’s container map view provides similar insights into your container fleet. Datadog updates the map every few seconds to reflect changes, such as containers being launched or terminated.

EKS monitoring with Datadog container map — Use Datadog's container map to organize your container fleet and surface possible problem spots.

Similarly, Datadog’s Live Container view gives you real-time insight into the status and performance of your containers, updated every two seconds. You can also sort your containers by resource usage to quickly surface resource-heavy containers.

EKS monitoring with Datadog Live Container View

Tags let you filter for specific sets of containers—for example, Kubernetes pods, deployments, or services—to get more granular insights into your EKS cluster. Below, we have drilled down to a container by a host tag (owner:maxim) and then by the service it’s running (kube_service:redis). You can also get additional context by looking at the other tags from different sources that Datadog has automatically applied to the container.

EKS monitoring with Datadog container tags

Autodiscover your EKS cluster

Monitoring dynamic infrastructure can be difficult as containers churn—along with the applications and services running on them. The Datadog Agent’s Autodiscovery feature solves this problem by continuously listening for Docker events related to the creation and destruction of containers. When new containers are launched, the Agent identifies if they are running a service Datadog should be monitoring. If so, the Agent then automatically configures and runs the appropriate check.

Autodiscovery is active by default. When you deploy a service (e.g., Redis) to your cluster, you can include pod annotations that will provide the information Autodiscovery needs to detect and monitor that service on any containers that are launched with a specific name. These annotations all begin with the following format:

ad.datadoghq.com/<CONTAINER_IDENTIFIER>

The container identifier tells Datadog what to look for in the names of new containers. So, let’s say we want the Datadog Agent to automatically detect whenever a container is running Redis, and configure a check to start collecting Redis metrics from that container. We can add the following annotations to our Redis deployment manifest:

annotations:
  ad.datadoghq.com/redis.check_names: '["redisdb"]'
  ad.datadoghq.com/redis.init_configs: '[{}]'
  ad.datadoghq.com/redis.instances: '[{"host": "%%host%%","port":"6379"}]'

These annotations ensure that Datadog will recognize that any new containers with the container identifier redis set as their name in the deployment manifest are built from the redis image. Datadog will then enable its Redis monitoring check (redisdb) and query port 6379 of that container’s local host IP for metrics.

For information that may change often in a containerized environment, like host IPs and container ports, it’s helpful to use template variables so that the Agent can dynamically detect and communicate this information. In this case, the template variable, %%host%%, will auto-detect the host IP. Note that if a container is named something other than redis, Autodiscovery will not include it in this check, even if it is built from the same image. See the documentation for more information on configuring Autodiscovery.

Processes and logs and traces and custom metrics (oh my)

Datadog’s Agent will automatically collect metrics from your nodes and containers. To get even more insight into your cluster, you can also have Datadog collect process-level data from your containers, as well as logs, request traces, and custom metrics from the applications on your cluster.

These features are not configured by default, but you can easily enable them by adding a few more configurations to your Datadog Agent manifest (not the Cluster Agent manifest if you are using the Cluster Agent). We will cover how to collect:

Process data
Logs
Traces
Custom metrics

Monitor your processes in real time

With Datadog’s Process Monitoring, you can get real-time, granular insight into the individual processes running on your hosts and containers. To enable it, you’ll need to make a few updates to your node-based Datadog Agent manifest (datadog-agent.yaml).

Under env, include:

env:
  [...]
  - name: DD_PROCESS_AGENT_ENABLED
    value: "true"

Then, add the required volume mount and volume:

volumeMounts:
  [...]
  - name: passwd
    mountPath: /etc/passwd
    readOnly: true
  [...]
volumes:
  [...]
  - hostPath:
      path: /etc/passwd
    name: passwd

Deploy the changes:

$ kubectl apply -f /path/to/datadog-agent.yaml

With Process Monitoring enabled, you can navigate to Datadog’s Live Process view to visualize all running processes in your cluster. You can use tags to easily filter, search, and drill down to see the exact data you need. For example, in the screenshot below, we’re looking at process information for the dd-agent Kubernetes service across any nodes or pods that are in the maxim-eks environment.

EKS monitoring with Datadog Live Process Monitoring

See our documentation for more details on using Live Process Monitoring.

Collect and analyze EKS logs

Datadog can automatically collect logs for Docker, many AWS services, and other technologies you may be running on your EKS cluster. Logs can be invaluable for troubleshooting problems, identifying errors, and giving you greater insight into the behavior of your infrastructure and applications.

In order to enable log collection from your containers, add the following environment variables:

env:
  [...]
  - name: DD_LOGS_ENABLED
    value: "true"
  - name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
    value: "true"

Then, add the following to volumeMounts and volumes:

volumeMounts:
  [...]
  - name: pointerdir
    mountPath: /opt/datadog-agent/run
  [...]
volumes:
  [...]
  - hostPath:
      path: /opt/datadog-agent/run
    name: pointerdir

Deploy (or redeploy) the manifest:

$ kubectl apply -f /path/to/datadog-agent.yaml

With log collection enabled, you should start seeing logs flowing into the Log Explorer.

EKS monitoring with Datadog Log Explorer

With Datadog’s log integrations, you can automatically ingest, process, and parse logs from a variety of technologies for analysis and visualization. It’s particularly useful to ensure your logs have a source tag and a service tag attached to your logs. The source sets the context for the log, letting you pivot from metrics to related logs. It also tells Datadog which log processing pipeline to use to properly parse key attributes from your logs, such as the timestamp and the severity.

Likewise, if you have Datadog APM enabled, the service tag lets you pivot seamlessly from logs to application-level metrics and request traces from the same service, for more detailed troubleshooting. As of the Datadog Agent version 6.8, Datadog will attempt to automatically generate these tags for your logs from the image name. For example, logs from our Redis containers will be tagged source:redis and service:redis. You can also provide custom values by including the following Kubernetes annotation in the manifest for the service you are deploying to your cluster:

annotations:
  ad.datadoghq.com/<CONTAINER_IDENTIFIER>.logs: '[{"source":"<SOURCE>","service":"<SERVICE>"}]'

For example, let’s say our application uses a service, redis-cache. When we deploy Redis to our cluster, we can tell Datadog to ingest Redis logs from pods running that service using the following annotation:

ad.datadoghq.com/redis.logs: '[{"source": "redis","service":"redis-cache"}]'

This tells Datadog’s Autodiscovery to look for containers identified by redis and tag logs coming from them with source:redis and service:redis-cache.

EKS monitoring with Datadog Redis logs — Datadog's log pipelines can parse logs based on their source.

Track application performance

Datadog APM traces individual requests as they propagate across your nodes, containers, and services. You can then use Datadog to visualize the full lifespan of these requests, end to end. This gives you deep visibility into your services, endpoints, and database queries and quickly surfaces errors and latency issues.

Datadog APM includes support for auto-instrumenting applications; consult the documentation for supported languages and details on how to get started.

First, enable tracing in the Datadog Agent by adding the following environment variable to your Datadog Agent manifest:

env:
  [...]
  - name: DD_APM_ENABLED
    value: "true"

Then, uncomment the hostPort for the Trace Agent so that your manifest includes:

ports:
  - containerPort: 8126
    hostPort: 8126
    name: traceport
    protocol: TCP

Apply the changes:

$ kubectl apply -f /path/to/datadog-agent.yaml

Next, we have to ensure that our application containers send traces only to the Datadog Agent instance running on the same node. This means configuring the application’s Deployment manifest to provide the host node’s IP as an environment variable using Kubernetes’s Downward API. We can do this with the DATADOG_TRACE_AGENT_HOSTNAME environment variable, which tells the Datadog tracer in your instrumented application which host to send traces to.

In the Deployment manifest for your application containers, ensure you have the following in your container specs:

spec:
      containers:
      - name: <CONTAINER_NAME>
        image: <CONTAINER_IMAGE>:<TAG>
        env:
          - name: DATADOG_TRACE_AGENT_HOSTNAME
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP

When you deploy your instrumented application, it will automatically begin sending traces to Datadog. From the APM tab of your Datadog account, you can see a breakdown of key performance indicators for each of your instrumented services, with information about request throughput, latency, and errors.

EKS monitoring with Datadog APM service overview

Dive into an individual trace to inspect a flame graph that breaks that traced request down into spans. Spans represent each individual database query, function call, or operation executed over the lifetime of the request. If you select a span, you can view system metrics as well as relevant logs from the host that executed that span of work, scoped to the same timeframe.

EKS monitoring with Datadog request trace flame graph

Report custom metrics

In addition to the metrics that you get through Datadog’s integrations, you can send custom metrics from your applications running on your EKS cluster to Datadog using the DogStatsD protocol. Datadog provides or supports a number of libraries you can use to emit custom metrics from your applications.

The Agent DaemonSet manifest in our deployment steps above includes a hostPort, 8125, which is the default port that DogStatsD listens on. Uncomment that line so that your manifest has the following:

ports:
  - containerPort: 8125
    hostPort: 8125
    name: dogstatsdport
    protocol: UDP

Deploy or redeploy the Agent:

$ kubectl apply -f /path/to/datadog-agent.yaml

You can now instrument your applications to send custom metrics on port 8125 of the node they are running on.

Be alert(ed)

Datadog provides a number of powerful alerts so that you can detect possible issues before they cause serious problems for your infrastructure and its users, all without needing to constantly monitor your cluster. These alerts can apply to any of the metrics, logs, or APM data that Datadog collects. This means that you can set alerts not just on the EKS cluster itself but also on the applications and services running on it. Using tags, you can set different alerts that are targeted to specific resources.

As an example, below we’re setting a threshold alert that monitors a Kubernetes metric, CPU requests, measured per node. We’re also limiting this alert to nodes in our cluster that are tagged as being maxim-eks-workers and that are part of the maxim-eks environment. We know that the instance type for these nodes has a maximum CPU capacity of two vCPUs, or two cores, so we can set the alert thresholds to notify us when the total CPU requests on any single node begin to approach that limit. If we get alerted, this will give us enough time to see if we need to spin up more nodes to avoid problems launching new pods.

EKS monitoring with Datadog threshold alert

Datadog includes a number of checks based on Kubernetes indicators, such as node status, which you can also use to define alerts. So for example, below we’re looking at the same group of nodes as before, but this time we’re setting an alert to notify us when any of the nodes in that group fails the check three consecutive times (to give EKS a chance to reload the nodes if possible). The alert will automatically resolve if the same node passes the check two consecutive times.

EKS monitoring with Datadog service check alert

Add smarts to your alerts

In addition to threshold alerts tied to specific metrics, you can also create machine-learning-driven alerts. For example, forecasting tracks metric trends in order to predict and reveal possible future problems. You can create a forecast alert to predict when, based on historical data, available disk space on a PersistentVolume will fall below a certain threshold, providing ample time to add resources before you actually run out of disk space.

Datadog’s Watchdog automatically detects anomalies in your application performance metrics without any manual configuration, surfacing abnormal behavior in services across your infrastructure.

Datadog alerts integrate with notification services like PagerDuty and Slack, letting you easily notify the right teams. You can read more about how to use Datadog’s alerts in our documentation.

Dive deeper into your EKS cluster

In this post, we’ve gone over how to use Datadog to gain deep visibility into your EKS cluster and the applications running on it. The Datadog Agent aggregates Kubernetes state metrics and host- and container-level resource metrics from all of the nodes in your cluster. Enabling Datadog’s AWS integrations lets you pull in CloudWatch metrics and events across your AWS services. These together with Datadog’s Autodiscovery allow you to monitor, visualize, and alert on all of the key EKS metrics as well as logs, distributed request traces, and other data across a dynamic, containerized environment.

If you don’t yet have a Datadog account, you can sign up for a 14-day free trial and start monitoring your EKS clusters today.

Acknowledgment

We wish to thank our friends at AWS for their technical review of this series.

Source Markdown for this post is available on GitHub. Questions, corrections, additions, etc.? Please let us know.

Want to work with us? We're hiring!

Monitoring your EKS cluster with Datadog

Further Reading

Deploy the Agent to your EKS cluster

Deploying the Datadog Cluster Agent

Configure RBAC permissions for the Cluster Agent and node-based Agents

Secure communication between node-based Agents and the Cluster Agent

Create and deploy the Cluster Agent manifest

Deploy the node-based Agent DaemonSet

Autoscale your EKS cluster with Datadog metrics

Enable Datadog’s AWS integrations

Get access to CloudWatch

Configure Datadog’s AWS integration

The full power of Datadog

The beauty of tags

Visualize your EKS cluster

Autodiscover your EKS cluster

Processes and logs and traces and custom metrics (oh my)

Monitor your processes in real time

Collect and analyze EKS logs

Track application performance

Report custom metrics

Be alert(ed)

Add smarts to your alerts

Dive deeper into your EKS cluster

Acknowledgment

Further Reading

Start monitoring your metrics in minutes

Monitoring your EKS cluster with Datadog

Further Reading

Related jobs at Datadog

Further Reading

Monitoring Kubernetes with Datadog

SAP HANA monitoring with Datadog

Monitor systemd with Datadog

Monitor Azure DevOps workflows and pipelines with Datadog