Monitoring your EKS cluster with Datadog

Maxim Brown

In Parts 1 and 2 of this series, we saw that key EKS metrics come from several sources, and can be broken down into the following main types:

Cluster state metrics from the Kubernetes API
Node and container resource metrics from your nodes' kubelets
AWS service metrics from AWS CloudWatch

In this post, we'll explore how Datadog's integrations with Kubernetes, Docker, and AWS will let you track the full range of EKS metrics, as well as logs and performance data from your cluster and applications. Datadog gives you comprehensive coverage of your dynamic infrastructure and applications with features like Autodiscovery to track services across containers; sophisticated graphing and alerting options; and full support for AWS services.

In this post, we will cover:

Installing the Datadog Agent across your cluster to track host- and container-level resource metrics
Enabling Datadog's AWS integration to collect CloudWatch metrics and events
Using Datadog's full feature set to get end-to-end visibility into your EKS infrastructure and hosted applications and services

If you don't already have a Datadog account but want to follow along and start monitoring your EKS cluster, sign up for a free trial.

Deploy the Agent to your EKS cluster

The Datadog Agent is open source software that collects and forwards metrics, logs, and traces from each of your nodes and the containers running on them.

Once you deploy the Agent, you will have immediate access to the full range of Kubernetes cluster state and resource metrics discussed in Part 1. The Agent will also begin reporting additional system-level metrics from your nodes and containers.

We will go over how to deploy the Datadog Cluster Agent and node-based Datadog Agents across your EKS cluster. Note that the following instructions are tailored to monitoring Amazon EKS pods running on EC2 instances. See our documentation for instructions on deploying the Datadog Agent to any Amazon EKS pods running on AWS Fargate.

The Cluster Agent communicates with the Kubernetes API servers to collect cluster-level information, while the node-based Agents report data from each node's kubelet.

While it is possible to deploy the Datadog Agent without the Cluster Agent, using the Cluster Agent is recommended as it offers several benefits, particularly for large-scale EKS clusters:

It reduces overall load on the Kubernetes API by using a single Cluster Agent as a proxy for querying cluster-level metrics.
It provides additional security because only one Agent needs the permissions required to access the API server.
It lets you automatically scale your pods using any metric that is collected by Datadog.

You can read more about the Datadog Cluster Agent here.

Before turning to the Agent, however, make sure that you've deployed kube-state-metrics Recall that kube-state-metrics is an add-on service that generates cluster state metrics and exposes them to the Metrics API. After you install the service, Datadog will be able to aggregate these metrics along with other resource and application data.

Deploying the Datadog Cluster Agent

The Datadog Cluster Agent runs on a single node and serves as a proxy between the API servers and the rest of the node-based Agents in your cluster. It also makes it possible to configure Kubernetes's Horizontal Pod Autoscaling to use any metric that Datadog collects (more on this below).

There are several steps needed to prepare your cluster for the Agent. These involve providing the appropriate permissions to the Cluster Agent and to the node-based Agents so each can access the information it needs. First, we need to configure RBAC permissions, and then create and deploy the Cluster Agent and node-based Agent manifests.

Configure RBAC permissions for the Cluster Agent and node-based Agents

EKS uses AWS IAM for user authentication and access to the cluster, but it relies on Kubernetes role-based access control (RBAC) to authorize calls by those users to the Kubernetes API. So, for both the Cluster Agent and the node-based Agents, we'll need to set up a service account, a ClusterRole with the necessary RBAC permissions, and then a ClusterRoleBinding that links them so that the service account can use those permissions.

First, create the Cluster Agent's RBAC file, cluster-agent-rbac.yaml:

1
kind: ClusterRole
2
apiVersion: rbac.authorization.k8s.io/v1
3
metadata:
4
  name: datadog-cluster-agent
5
rules:
6
- apiGroups:
7
  - ""
8
  resources:
9
  - services
10
  - events
11
  - endpoints
12
  - pods
13
  - nodes
14
  - componentstatuses
15
  verbs:
16
  - get
17
  - list
18
  - watch
19
- apiGroups:
20
  - "autoscaling"
21
  resources:
22
  - horizontalpodautoscalers
23
  verbs:
24
  - list
25
  - watch
26
- apiGroups:
27
  - ""
28
  resources:
29
  - configmaps
30
  resourceNames:
31
  - datadogtoken             # Kubernetes event collection state
32
  - datadog-leader-election  # Leader election token
33
  verbs:
34
  - get
35
  - update
36
- apiGroups:                 # To create the leader election token
37
  - ""
38
  resources:
39
  - configmaps
40
  verbs:
41
  - create
42
  - get
43
  - update
44
- nonResourceURLs:
45
  - "/version"
46
  - "/healthz"
47
  verbs:
48
  - get
49
---
50
apiVersion: rbac.authorization.k8s.io/v1
51
kind: ClusterRoleBinding
52
metadata:
53
  name: datadog-cluster-agent
54
roleRef:
55
  apiGroup: rbac.authorization.k8s.io
56
  kind: ClusterRole
57
  name: datadog-cluster-agent
58
subjects:
59
- kind: ServiceAccount
60
  name: datadog-cluster-agent
61
  namespace: default
62
---
63
kind: ServiceAccount
64
apiVersion: v1
65
metadata:
66
  name: datadog-cluster-agent
67
  namespace: default

Next, create the node-based Agent's RBAC file, datadog-rbac.yaml:

1
apiVersion: rbac.authorization.k8s.io/v1
2
kind: ClusterRole
3
metadata:
4
  name: datadog-agent
5
rules:
6
- apiGroups:      # This is required by the agent to query the Kubelet API.
7
  - ""
8
  resources:
9
  - nodes/metrics
10
  - nodes/spec
11
  - nodes/proxy   # Required to get /pods
12
  - nodes/stats # Required to get /stats/summary
13
  verbs:
14
  - get
15
---
16
kind: ServiceAccount
17
apiVersion: v1
18
metadata:
19
  name: datadog-agent
20
  namespace: default
21
---
22
apiVersion: rbac.authorization.k8s.io/v1
23
kind: ClusterRoleBinding
24
metadata:
25
  name: datadog-agent
26
roleRef:
27
  apiGroup: rbac.authorization.k8s.io
28
  kind: ClusterRole
29
  name: datadog-agent
30
subjects:
31
- kind: ServiceAccount
32
  name: datadog-agent
33
  namespace: default

Deploy both of them:

1
$ kubectl apply -f /path/to/cluster-agent-rbac.yaml
2
$ kubectl apply -f /path/to/datadog-rbac.yaml

Confirm you have both ClusterRoles:

1
# Get a list of all ClusterRoles
2
$ kubectl get clusterrole
3
NAME                                                                   AGE
4
[...]
5
datadog-agent                                                          1h
6
datadog-cluster-agent                                                  1h
7
[...]

Secure communication between node-based Agents and the Cluster Agent

The next step is to ensure that the Cluster Agent and node-based Agents can securely communicate with each other. The best way to do this is by creating a Kubernetes secret. To generate a token to include in the secret, run the following:

1
echo -n '<32_CHARACTER_LONG_STRING>' | base64

Copy the resulting string. Then, create a file, dca-secret.yaml, with the following:

1
apiVersion: v1
2
kind: Secret
3
metadata:
4
  name: datadog-auth-token
5
type: Opaque
6
data:
7
  token: <TOKEN>

Replace <TOKEN> with the string from the previous step. Then create the secret:

1
$ kubectl apply -f /path/to/dca-secret.yaml

You can confirm that the secret was created with the following:

1
# Get a list of all secrets
2
$ kubectl get secret
3
NAME                        TYPE                                  DATA      AGE
4
datadog-auth-token          Opaque                                1         21h

Now that we have a secret in Kubernetes, we can include it in our Cluster Agent and node-based Agent manifests so that they can securely communicate with each other.

Create and deploy the Cluster Agent manifest

To deploy the Cluster Agent, create a manifest, datadog-cluster-agent.yaml, which creates the Datadog Cluster Agent Deployment and Service, links them to the Cluster Agent service account we deployed above, and points to the newly created secret:

1
apiVersion: apps/v1
2
kind: Deployment
3
metadata:
4
  name: datadog-cluster-agent
5
  namespace: default
6
spec:
7
  selector:
8
      matchLabels:
9
        app: datadog-cluster-agent
10
  template:
11
    metadata:
12
      labels:
13
        app: datadog-cluster-agent
14
      name: datadog-agent
15
    spec:
16
      serviceAccountName: datadog-cluster-agent
17
      containers:
18
      - image: public.ecr.aws/datadog/cluster-agent:latest
19
        imagePullPolicy: Always
20
        name: datadog-cluster-agent
21
        env:
22
          - name: DD_API_KEY
23
            value: <YOUR_API_KEY>
24
          - name: DD_APP_KEY    # Optional
25
            value: <YOUR_APP_KEY>  # Optional
26
          - name: DD_COLLECT_KUBERNETES_EVENTS
27
            value: "true"
28
          - name: DD_LEADER_ELECTION
29
            value: "true"
30
          - name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED
31
            value: "true"
32
          - name: DD_CLUSTER_AGENT_AUTH_TOKEN
33
            valueFrom:
34
              secretKeyRef:
35
                name: datadog-auth-token
36
                key: token
37
---
38
apiVersion: v1
39
kind: Service
40
metadata:
41
  name: datadog-cluster-agent
42
  labels:
43
    app: datadog-cluster-agent
44
spec:
45
  ports:
46
  - port: 5005 # Has to be the same as the one exposed in the DCA. Default is 5005.
47
    protocol: TCP
48
  selector:
49
    app: datadog-cluster-agent

Make sure to insert your Datadog API key as indicated in the manifest above. You can also include a Datadog app key. Both keys are accessible here in your Datadog account. Providing an app key is necessary if you want to autoscale your EKS applications based on any metric you're collecting with Datadog. See below for more information on this.

The final environment variable—DD_CLUSTER_AGENT_AUTH_TOKEN—points the Cluster Agent to the datadog-auth-token secret we just created.

Note that the Datadog Cluster Agent is configured as a Deployment and Service, rather than as a DaemonSet, because we're not installing it on every node. Deploy the Cluster Agent and Service:

1
$ kubectl apply -f /path/to/datadog-cluster-agent.yaml

Deploy the node-based Agent DaemonSet

The final step is to deploy the node-based Agents as a DaemonSet. We use a DaemonSet here because, unlike the Cluster Agent, we want to deploy the node-based Agent to all of our nodes, including new ones as they are launched. (You can also use nodeSelectors to install it only on a specified subset of nodes.)

Create a datadog-agent.yaml manifest file (making sure to fill in your Datadog API key):

1
apiVersion: apps/v1
2
kind: DaemonSet
3
metadata:
4
  name: datadog-agent
5
spec:
6
  selector:
7
      matchLabels:
8
        app: datadog-agent
9
  template:
10
    metadata:
11
      labels:
12
        app: datadog-agent
13
      name: datadog-agent
14
    spec:
15
      serviceAccountName: datadog-agent
16
      containers:
17
      - image: public.ecr.aws/datadog/agent:latest
18
        imagePullPolicy: Always
19
        name: datadog-agent
20
        ports:
21
          - containerPort: 8125
22
            # hostPort: 8125
23
            name: dogstatsdport
24
            protocol: UDP
25
          - containerPort: 8126
26
            # hostPort: 8126
27
            name: traceport
28
            protocol: TCP
29
        env:
30
          - name: DD_API_KEY
31
            value: "<YOUR_API_KEY>"
32
          - name: DD_COLLECT_KUBERNETES_EVENTS
33
            value: "true"
34
          - name: KUBERNETES
35
            value: "true"
36
          - name: DD_KUBERNETES_KUBELET_HOST
37
            valueFrom:
38
              fieldRef:
39
                fieldPath: status.hostIP
40
          - name: DD_CLUSTER_AGENT_ENABLED
41
            value: "true"
42
          - name: DD_CLUSTER_AGENT_AUTH_TOKEN
43
            valueFrom:
44
              secretKeyRef:
45
                name: datadog-auth-token
46
                key: token
47
        resources:
48
          requests:
49
            memory: "256Mi"
50
            cpu: "200m"
51
          limits:
52
            memory: "256Mi"
53
            cpu: "200m"
54
        volumeMounts:
55
          - name: dockersocket
56
            mountPath: /var/run/docker.sock
57
          - name: procdir
58
            mountPath: /host/proc
59
            readOnly: true
60
          - name: cgroups
61
            mountPath: /host/sys/fs/cgroup
62
            readOnly: true
63
        livenessProbe:
64
          exec:
65
            command:
66
            - ./probe.sh
67
          initialDelaySeconds: 15
68
          periodSeconds: 5
69
      volumes:
70
        - hostPath:
71
            path: /var/run/docker.sock
72
          name: dockersocket
73
        - hostPath:
74
            path: /proc
75
          name: procdir
76
        - hostPath:
77
            path: /sys/fs/cgroup
78
          name: cgroups

Deploy the node-based Agent:

1
$ kubectl apply -f /path/to/datadog-agent.yaml

You can use the following kubectl command to verify that your Cluster Agent and node-based Agent pods are running (the -o wide flag includes more details in the output):

1
# Get running pods with "agent" in their name
2
$ kubectl get pods -o wide | grep agent
3
NAME                                    READY     STATUS    RESTARTS   AGE       IP                NODE
4
datadog-agent-44x9q                     1/1       Running   0          9d        192.168.124.130   ip-192-168-124-130.us-west-2.compute.internal
5
datadog-agent-8ngmb                     1/1       Running   0          9d        192.168.165.188   ip-192-168-165-188.us-west-2.compute.internal
6
datadog-agent-gfl98                     1/1       Running   0          9d        192.168.193.120   ip-192-168-193-120.us-west-2.compute.internal
7
datadog-agent-hbxqh                     1/1       Running   0          9d        192.168.182.134   ip-192-168-182-134.us-west-2.compute.internal
8
datadog-agent-hqbj2                     1/1       Running   0          9d        192.168.112.185   ip-192-168-112-185.us-west-2.compute.internal
9
datadog-agent-knd2j                     1/1       Running   0          9d        192.168.195.211   ip-192-168-195-211.us-west-2.compute.internal
10
datadog-cluster-agent-f797cfb54-b5qs8   1/1       Running   0          7s        192.168.248.206   ip-192-168-195-211.us-west-2.compute.internal

You should see that each node in your cluster is running a datadog-agent replica, and that one node is also running a datadog-cluster-agent pod.

Now that the Agent has been deployed to your cluster, you should see information from your EKS infrastructure automatically flowing into Datadog. This includes system-level metrics from your nodes and containers as well as metrics from Datadog's Kubernetes and Docker integrations, which you can immediately visualize in a customizable, out-of-the-box dashboard:

EKS monitoring with Datadog Kubernetes dashboard — Visualize your Kubernetes and Docker metrics together in Datadog.

Autoscale your EKS cluster with Datadog metrics

Out of the box, Kubernetes's Horizontal Pod Autoscaler (HPA) can autoscale a controller's replica count based on a targeted level of CPU utilization averaged across that controller's pods. (Note that this cannot apply to a DaemonSet, as a DaemonSet automatically launches a pod on each available node.) As of version 1.10, Kubernetes also supports autoscaling based on custom metrics so long as you have configured an External Metrics Provider resource that can communicate with the Kubernetes API server. The Datadog Cluster Agent can act as an External Metrics Provider, meaning that if you are using the Cluster Agent to monitor your EKS infrastructure, you can deploy an HPA that will autoscale your pods based on any metric collected by Datadog.

In the screenshot below, we've deployed an HPA that will monitor requests per second to pods running NGINX across our cluster, averaged by pod. If this metric crosses a threshold, it will spin up new NGINX pods in our cluster until the average requests per second per pod falls below the threshold, or the number of running pods meets the maximum limit we defined in the HPA manifest.

EKS monitoring with Datadog HPA scaling — Our HPA scales up NGINX pods in response to spikes in requests per second.

Deploying HPAs can help your cluster automatically respond to dynamic workloads by spinning up new pods, for example, to add resource capacity or to distribute requests. Find more information about deploying an HPA using Datadog metrics in our blog post.

Collect, visualize, and alert on Amazon EKS metrics in minutes with Datadog.

Enable Datadog's AWS integrations

So far, we have covered how to use Datadog to monitor Kubernetes and Docker. But as we discussed in Part 1, that's only part of the EKS story; you will also want to monitor the performance and health of the various infrastructure components in your cluster that are provisioned from AWS services, such as EBS volumes, ELB load balancers, and others.

Datadog's AWS integration pulls in CloudWatch metrics and events so that you can visualize and alert on them from a central platform, even if you don't install the Datadog Agent on your nodes. Datadog automatically collects any tags that you add in AWS as well as metadata about each AWS component. For example, your EC2 metrics will be tagged with the instance type, availability zone, Auto Scaling group, etc. Having these tags in Datadog lets you easily find and drill down to the specific cluster components you are interested in.

Get access to CloudWatch

Datadog needs read-only access to your AWS account in order to query CloudWatch metrics. To do this, create a new role in the AWS IAM Console and attach a policy that has the required permissions to query the CloudWatch API for metrics. You may also need to grant additional permissions to access data from any AWS services you want to monitor. See Datadog's documentation for detailed instructions on this process.

Configure Datadog's AWS integration

Once you've created the required role, go to Datadog's AWS integration tile. Enter your AWS account ID and the name of the role you created in the previous step. Under "Limit metric collection," check off the AWS services you want to monitor with Datadog. For our EKS cluster, we want to make sure to collect at least EC2 metrics. Here, you can see that we will also collect metrics from our EBS volumes and Elastic Load Balancers.

EKS monitoring with Datadog AWS integration

Once you've enabled the integrations for the services you want to monitor, Datadog will begin pulling in those metrics so you can view, correlate, and alert on them along with all your other data. Datadog includes customizable, out-of-the-box dashboards for many AWS services, and you can easily create your own dashboards to focus on the metrics that are most important to your organization.

EKS monitoring with Datadog ELB dashboard — Datadog's out-of-the-box AWS ELB dashboard.

Datadog will also import AWS event information for certain services. Events include, for example, scheduled maintenance for EC2 instances, or Auto-Scaling triggers for starting or terminating instances. Viewing these alongside Kubernetes events can give you a better picture of what is going on with your cluster's infrastructure.

EKS monitoring with Datadog event stream

The full power of Datadog

With the Datadog Agent and AWS integrations working, you now have access to all of the metrics covered in Part 1 of this series. Now, we'll go over how to use Datadog to get full visibility into your EKS cluster and the applications and services running on it. This includes:

Detailed tagging for filtering and sorting
Powerful visualization and real-time monitoring features
Autodiscovery to automatically identify what's running on your containers
Enabling log, trace, process, and custom metric collection
Sophisticated alerting options

The beauty of tags

Datadog automatically imports metadata from Kubernetes, Docker, AWS services, and other technologies, then creates tags from that metadata. This makes it easier to visualize and alert on your metrics, traces, and logs at a more granular level. For example, you can filter and view your resources by Kubernetes Deployment (kube_deployment) or Service (kube_service), or by Docker image (image_name). Datadog also automatically pulls in any host tags from your EC2 instances (both those attached by AWS and any custom tags), so you can view your nodes by availability zone or by EC2 instance type.

In your node-based Datadog Agent manifest, you can add custom host-level tags with the environment variable DD_TAGS followed by key:value pairs separated by spaces. For example, below we've added two tags that will be applied to all nodes in your cluster that the Agent is deployed on:

1
[...]
2
env:
3
  - name: DD_TAGS
4
    value: owner:maxim role:eks-demo
5
[...]

You can also import Kubernetes pod labels as tags. This lets you pull pod-level metadata that you define in your manifests into Datadog as tags. For example, you may label certain pods related to a specific application and then filter down in Datadog to visualize the infrastructure for that application.

Visualize your EKS cluster

Once the Datadog Agent has been deployed to your cluster, you should be able to see information about your EKS infrastructure flowing into Datadog. You should also be able to quickly drill down into specific sets of containers by using tags to sort and filter by pod, deployment, service, and more.

The host map gives you a high-level view of your nodes. You can group and filter your nodes by host-level tags, such as AWS availability zone or instance types, and the host map will fill them in based on the real-time value of any metric being collected on those nodes. This lets you see at a glance if, for example, one of your nodes is using higher levels of CPU than others, or if a particular zone is getting more requests than others.

Datadog's container map view provides similar insights into your container fleet. Datadog updates the map every few seconds to reflect changes, such as containers being launched or terminated.

EKS monitoring with Datadog container map — Use Datadog's container map to organize your container fleet and surface possible problem spots.

Similarly, Datadog's Live Container view gives you real-time insight into the status and performance of your containers, updated every two seconds. You can also sort your containers by resource usage to quickly surface resource-heavy containers.

EKS monitoring with Datadog Live Container View

Tags let you filter for specific sets of containers—for example, Kubernetes pods, deployments, or services—to get more granular insights into your EKS cluster. Below, we have drilled down to a container by a host tag (owner:maxim) and then by the service it's running (kube_service:redis). You can also get additional context by looking at the other tags from different sources that Datadog has automatically applied to the container.

EKS monitoring with Datadog container tags

Autodiscover your EKS cluster

Monitoring dynamic infrastructure can be difficult as containers churn—along with the applications and services running on them. The Datadog Agent's Autodiscovery feature solves this problem by continuously listening for Docker events related to the creation and destruction of containers. When new containers are launched, the Agent identifies if they are running a service Datadog should be monitoring. If so, the Agent then automatically configures and runs the appropriate check.

Autodiscovery is active by default. When you deploy a service (e.g., Redis) to your cluster, you can include pod annotations that will provide the information Autodiscovery needs to detect and monitor that service on any containers that are launched with a specific name. These annotations all begin with the following format:

1
ad.datadoghq.com/<CONTAINER_IDENTIFIER>

The container identifier tells Datadog what to look for in the names of new containers. So, let's say we want the Datadog Agent to automatically detect whenever a container is running Redis, and configure a check to start collecting Redis metrics from that container. We can add the following annotations to our Redis deployment manifest:

1
annotations:
2
  ad.datadoghq.com/redis.check_names: '["redisdb"]'
3
  ad.datadoghq.com/redis.init_configs: '[{}]'
4
  ad.datadoghq.com/redis.instances: '[{"host": "%%host%%","port":"6379"}]'

These annotations ensure that Datadog will recognize that any new containers with the container identifier redis set as their name in the deployment manifest are built from the redis image. Datadog will then enable its Redis monitoring check (redisdb) and query port 6379 of that container's local host IP for metrics.

For information that may change often in a containerized environment, like host IPs and container ports, it's helpful to use template variables so that the Agent can dynamically detect and communicate this information. In this case, the template variable, %%host%%, will auto-detect the host IP. Note that if a container is named something other than redis, Autodiscovery will not include it in this check, even if it is built from the same image. See the documentation for more information on configuring Autodiscovery.

Processes and logs and traces and custom metrics (oh my)

Datadog's Agent will automatically collect metrics from your nodes and containers. To get even more insight into your cluster, you can also have Datadog collect process-level data from your containers, as well as logs, request traces, and custom metrics from the applications on your cluster.

These features are not configured by default, but you can easily enable them by adding a few more configurations to your Datadog Agent manifest (not the Cluster Agent manifest if you are using the Cluster Agent). We will cover how to collect:

Process data
Logs
Traces
Custom metrics

Monitor your processes in real time

With Datadog's Process Monitoring, you can get real-time, granular insight into the individual processes running on your hosts and containers. To enable it, you'll need to make a few updates to your node-based Datadog Agent manifest (datadog-agent.yaml).

Under env, include:

1
env:
2
  [...]
3
  - name: DD_PROCESS_AGENT_ENABLED
4
    value: "true"

Then, add the required volume mount and volume:

1
volumeMounts:
2
  [...]
3
  - name: passwd
4
    mountPath: /etc/passwd
5
    readOnly: true
6
  [...]
7
volumes:
8
  [...]
9
  - hostPath:
10
      path: /etc/passwd
11
    name: passwd

Deploy the changes:

1
$ kubectl apply -f /path/to/datadog-agent.yaml

With Process Monitoring enabled, you can navigate to Datadog's Live Process view to visualize all running processes in your cluster. You can use tags to easily filter, search, and drill down to see the exact data you need. For example, in the screenshot below, we're looking at process information for the dd-agent Kubernetes service across any nodes or pods that are in the maxim-eks environment.

EKS monitoring with Datadog Live Process Monitoring

See our documentation for more details on using Live Process Monitoring.

Collect and analyze EKS logs

Datadog can automatically collect logs for Docker, many AWS services, and other technologies you may be running on your EKS cluster. Logs can be invaluable for troubleshooting problems, identifying errors, and giving you greater insight into the behavior of your infrastructure and applications.

In order to enable log collection from your containers, add the following environment variables:

1
env:
2
  [...]
3
  - name: DD_LOGS_ENABLED
4
    value: "true"
5
  - name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
6
    value: "true"

Then, add the following to volumeMounts and volumes:

1
volumeMounts:
2
  [...]
3
  - name: pointerdir
4
    mountPath: /opt/datadog-agent/run
5
  [...]
6
volumes:
7
  [...]
8
  - hostPath:
9
      path: /opt/datadog-agent/run
10
    name: pointerdir

Deploy (or redeploy) the manifest:

1
$ kubectl apply -f /path/to/datadog-agent.yaml

With log collection enabled, you should start seeing logs flowing into the Log Explorer.

EKS monitoring with Datadog Log Explorer

With Datadog's log integrations, you can automatically ingest, process, and parse logs from a variety of technologies for analysis and visualization. It's particularly useful to ensure your logs have a source tag and a service tag attached to your logs. The source sets the context for the log, letting you pivot from metrics to related logs. It also tells Datadog which log processing pipeline to use to properly parse key attributes from your logs, such as the timestamp and the severity.

Likewise, if you have Datadog APM enabled, the service tag lets you pivot seamlessly from logs to application-level metrics and request traces from the same service, for more detailed troubleshooting. As of the Datadog Agent version 6.8, Datadog will attempt to automatically generate these tags for your logs from the image name. For example, logs from our Redis containers will be tagged source:redis and service:redis. You can also provide custom values by including the following Kubernetes annotation in the manifest for the service you are deploying to your cluster:

1
annotations:
2
  ad.datadoghq.com/<CONTAINER_IDENTIFIER>.logs: '[{"source":"<SOURCE>","service":"<SERVICE>"}]'

For example, let's say our application uses a service, redis-cache. When we deploy Redis to our cluster, we can tell Datadog to ingest Redis logs from pods running that service using the following annotation:

1
ad.datadoghq.com/redis.logs: '[{"source": "redis","service":"redis-cache"}]'

This tells Datadog's Autodiscovery to look for containers identified by redis and tag logs coming from them with source:redis and service:redis-cache.

EKS monitoring with Datadog Redis logs — Datadog's log pipelines can parse logs based on their source.

Track application performance

Datadog APM traces individual requests as they propagate across your nodes, containers, and services. You can then use Datadog to visualize the full lifespan of these requests, end to end. This gives you deep visibility into your services, endpoints, and database queries and quickly surfaces errors and latency issues.

Datadog APM includes support for auto-instrumenting applications; consult the documentation for supported languages and details on how to get started.

First, enable tracing in the Datadog Agent by adding the following environment variable to your Datadog Agent manifest:

1
env:
2
  [...]
3
  - name: DD_APM_ENABLED
4
    value: "true"

Then, uncomment the hostPort for the Trace Agent so that your manifest includes:

1
ports:
2
  - containerPort: 8126
3
    hostPort: 8126
4
    name: traceport
5
    protocol: TCP

Apply the changes:

1
$ kubectl apply -f /path/to/datadog-agent.yaml

Next, we have to ensure that our application containers send traces only to the Datadog Agent instance running on the same node. This means configuring the application's Deployment manifest to provide the host node's IP as an environment variable using Kubernetes's Downward API. We can do this with the DATADOG_TRACE_AGENT_HOSTNAME environment variable, which tells the Datadog tracer in your instrumented application which host to send traces to.

In the Deployment manifest for your application containers, ensure you have the following in your container specs:

1
spec:
2
      containers:
3
      - name: <CONTAINER_NAME>
4
        image: <CONTAINER_IMAGE>:<TAG>
5
        env:
6
          - name: DATADOG_TRACE_AGENT_HOSTNAME
7
            valueFrom:
8
              fieldRef:
9
                fieldPath: status.hostIP

When you deploy your instrumented application, it will automatically begin sending traces to Datadog. From the APM tab of your Datadog account, you can see a breakdown of key performance indicators for each of your instrumented services, with information about request throughput, latency, and errors.

EKS monitoring with Datadog APM service overview

Dive into an individual trace to inspect a flame graph that breaks that traced request down into spans. Spans represent each individual database query, function call, or operation executed over the lifetime of the request. If you select a span, you can view system metrics as well as relevant logs from the host that executed that span of work, scoped to the same timeframe.

EKS monitoring with Datadog request trace flame graph

Report custom metrics

In addition to the metrics that you get through Datadog's integrations, you can send custom metrics from your applications running on your EKS cluster to Datadog using the DogStatsD protocol. Datadog provides or supports a number of libraries you can use to emit custom metrics from your applications.

The Agent DaemonSet manifest in our deployment steps above includes a hostPort, 8125, which is the default port that DogStatsD listens on. Uncomment that line so that your manifest has the following:

1
ports:
2
  - containerPort: 8125
3
    hostPort: 8125
4
    name: dogstatsdport
5
    protocol: UDP

Deploy or redeploy the Agent:

1
$ kubectl apply -f /path/to/datadog-agent.yaml

You can now instrument your applications to send custom metrics on port 8125 of the node they are running on.

Be alert(ed)

Datadog provides a number of powerful alerts so that you can detect possible issues before they cause serious problems for your infrastructure and its users, all without needing to constantly monitor your cluster. These alerts can apply to any of the metrics, logs, or APM data that Datadog collects. This means that you can set alerts not just on the EKS cluster itself but also on the applications and services running on it. Using tags, you can set different alerts that are targeted to specific resources.

As an example, below we're setting a threshold alert that monitors a Kubernetes metric, CPU requests, measured per node. We're also limiting this alert to nodes in our cluster that are tagged as being maxim-eks-workers and that are part of the maxim-eks environment. We know that the instance type for these nodes has a maximum CPU capacity of two vCPUs, or two cores, so we can set the alert thresholds to notify us when the total CPU requests on any single node begin to approach that limit. If we get alerted, this will give us enough time to see if we need to spin up more nodes to avoid problems launching new pods.

EKS monitoring with Datadog threshold alert

Datadog includes a number of checks based on Kubernetes indicators, such as node status, which you can also use to define alerts. So for example, below we're looking at the same group of nodes as before, but this time we're setting an alert to notify us when any of the nodes in that group fails the check three consecutive times (to give EKS a chance to reload the nodes if possible). The alert will automatically resolve if the same node passes the check two consecutive times.

EKS monitoring with Datadog service check alert

Add smarts to your alerts

In addition to threshold alerts tied to specific metrics, you can also create machine-learning-driven alerts. For example, forecasting tracks metric trends in order to predict and reveal possible future problems. You can create a forecast alert to predict when, based on historical data, available disk space on a PersistentVolume will fall below a certain threshold, providing ample time to add resources before you actually run out of disk space.

Datadog's Watchdog automatically detects anomalies in your application performance metrics without any manual configuration, surfacing abnormal behavior in services across your infrastructure.

Datadog alerts integrate with notification services like PagerDuty and Slack, letting you easily notify the right teams. You can read more about how to use Datadog's alerts in our documentation.

Dive deeper into your EKS cluster

In this post, we've gone over how to use Datadog to gain deep visibility into your EKS cluster and the applications running on it. The Datadog Agent aggregates Kubernetes state metrics and host- and container-level resource metrics from all of the nodes in your cluster. Enabling Datadog's AWS integrations lets you pull in CloudWatch metrics and events across your AWS services. These together with Datadog's Autodiscovery allow you to monitor, visualize, and alert on all of the key EKS metrics as well as logs, distributed request traces, and other data across a dynamic, containerized environment.

If you don't yet have a Datadog account, you can sign up for a 14-day free trial and start monitoring your EKS clusters today.

Acknowledgment

We wish to thank our friends at AWS for their technical review of this series.

Source Markdown for this post is available on GitHub. Questions, corrections, additions, etc.? Please let us know.

Monitoring your EKS cluster with Datadog

Deploy the Agent to your EKS cluster

Deploying the Datadog Cluster Agent

Configure RBAC permissions for the Cluster Agent and node-based Agents

Secure communication between node-based Agents and the Cluster Agent

Create and deploy the Cluster Agent manifest

Deploy the node-based Agent DaemonSet

Autoscale your EKS cluster with Datadog metrics

Collect, visualize, and alert on Amazon EKS metrics in minutes with Datadog.

Enable Datadog's AWS integrations

Get access to CloudWatch

Configure Datadog's AWS integration

The full power of Datadog

The beauty of tags

Visualize your EKS cluster

Autodiscover your EKS cluster

Processes and logs and traces and custom metrics (oh my)

Monitor your processes in real time

Collect and analyze EKS logs

Track application performance

Report custom metrics

Be alert(ed)

Add smarts to your alerts

Dive deeper into your EKS cluster

Acknowledgment

Related Articles

Key metrics for monitoring AWS Fargate

AWS Fargate monitoring with Datadog

How to collect metrics and logs from AWS Fargate workloads

Monitor Amazon EKS Distro with Datadog

Start monitoring your metrics in minutes

Get Started with Datadog

Collect, visualize, and alert on Amazon EKS metrics in minutes with Datadog.

Related Articles

Key metrics for monitoring AWS Fargate

AWS Fargate monitoring with Datadog

How to collect metrics and logs from AWS Fargate workloads

Monitor Amazon EKS Distro with Datadog

Related jobs at Datadog

We're always looking for talented people to collaborate with

Start monitoring your metrics in minutes