The Service Map for APM is here!
How to monitor Google Kubernetes Engine with Datadog

How to monitor Google Kubernetes Engine with Datadog

/ / / / /
Published: December 19, 2017

Google Kubernetes Engine (GKE), a service on the Google Cloud Platform (GCP), is a hosted platform for running and orchestrating containerized applications. Similar to Amazon’s Elastic Container Service (ECS), GKE manages Docker containers deployed on a cluster of machines. However, unlike ECS, GKE uses Kubernetes, an increasingly popular open source orchestrator that can deploy, schedule, and scale containers on the fly.

A container cluster on GKE typically comprises multiple Google Compute Engine (GCE) instances (known as nodes) running groups of one or more Docker containers (known as pods). A cluster master runs Kubernetes processes that orchestrate the nodes and pods within the cluster.

With GKE’s Cluster Autoscaler and Kubernetes’s Horizontal Scaling, the size of the cluster and the number of pods within each node will dynamically scale to ensure peak performance with minimum resource usage.

Architecture of a typical system running on GKE

Get visibility into your Google Kubernetes Engine cluster

Due to Kubernetes’s compartmentalized nature and dynamic scheduling, it can be difficult to diagnose points of failure or track down other issues in your infrastructure. By collecting metrics and other application performance data from across your cluster—such as CPU and memory usage, container and pod events, network throughput, and individual request traces—you can be ready to tackle any issue you might encounter.

This article will guide you through obtaining a comprehensive view of the health and performance of your GKE cluster. We will walk through:

  1. Setting up Datadog’s GCE integration to start collecting node-level metrics from your cluster
  2. Deploying the Datadog Agent to collect metrics from Docker, Kubernetes, and your containerized applications
  3. Collecting custom metrics and distributed traces with application performance monitoring for more granular insights
Monitor every layer of your Google Kubernetes Engine infrastructure

Before you begin

If you already have a GKE container cluster up and running, and have configured the Google Cloud SDK so you can run kubectl commands locally, you can move straight onto monitoring your container cluster. Otherwise, use the steps below to create a three-node cluster so you can follow along.

First, you will need to ensure your role in your GCP project has the proper permissions to use GKE. As well, you will need to enable the Google Container Engine API for your project. A project is just a managed environment with its own resources and settings. If your role is at least editor, you will most likely have the required permissions.

Next, install the Google Cloud SDK, and then the kubectl command line tool using the SDK, on your local computer. Once you pair the Cloud SDK with your GCP account, you can control your clusters directly from your local machine using kubectl.

Finally, create a small GKE cluster named “doglib,” with the ability to access the Cloud Datastore (a NoSQL document store) and other Google Cloud Platform APIs, by running the following command:

$ gcloud container clusters create doglib --num-nodes 3 --zone "us-central1-b" --scopes "cloud-platform"

Once your cluster is up and running, you are ready to start monitoring its health and performance.

Set up the GCE integration and dashboard

A quick and easy way to start monitoring your cluster is by gathering system-level resource metrics from all your nodes. Because the nodes in a GKE cluster are just individual GCE instances, you can start collecting node-level metrics right away with Datadog’s GCE integration.

To install the integration, follow the steps listed in the Datadog documentation.

You can then access an out-of-the-box Google Compute Engine dashboard that displays key metrics such as disk I/O, CPU utilization, and network traffic.

Monitor your nodes with the out-of-the-box Google Compute Engine dashboard

Collect Docker and Kubernetes metrics

Although node-level metrics can be valuable for diagnosing some issues, such as CPU overutilization or disk throttling, collecting container- and pod-level metrics provides additional visibility into how your cluster and applications are performing. For example, these more specific metrics can help you pinpoint which application pod is causing high CPU usage on your nodes. As a result, you can zero in on potential resource issues faster, with a finer level of detail.

To start collecting these metrics from Docker and Kubernetes, you will need to deploy a containerized version of the Datadog Agent on your Kubernetes cluster. The Agent is deployed as a DaemonSet, which means that Kubernetes will ensure that each node in the cluster has a running copy of the Agent pod.

First, save a copy of the Datadog Agent manifest, dd-agent.yaml, to your local computer. A copy of the manifest is available within your Datadog account, which automatically includes your primary Datadog API Key. Then, deploy the Agent on your Kubernetes cluster by running the following command on your local machine:

$ kubectl create -f dd-agent.yaml

You can then check the status of the newly created pods by running:

$ kubectl get pods
NAME             READY     STATUS    RESTARTS   AGE
dd-agent-ng7b8   1/1       Running   0          5s
dd-agent-qqx00   1/1       Running   0          5s
dd-agent-tm1lz   1/1       Running   0          5s

Once the Agent pods are up and running, you can check the status of an individual Agent pod by running the datadog-agent info command, as shown below (replacing dd-agent-ng7b8 with your pod name):

$ kubectl exec dd-agent-ng7b8 -- service datadog-agent info
...

  Checks
  ======
  ...
  
    kubernetes (5.18.1)
    -------------------
      - instance #0 [OK]
      - Collected 120 metrics, 0 events & 3 service checks
  
    ...
  
    docker_daemon (5.18.1)
    ----------------------
      - instance #0 [OK]
      - Collected 152 metrics, 0 events & 1 service check
...

If all looks well, head over to the out-of-the-box dashboard for Docker. You can now see valuable container metrics, such as CPU utilization broken down by container, as well as an overview of which container images are running in your clusters.

Monitor your containers with the out-of-the-box Docker dashboard

In addition, with Datadog’s new Live Container monitoring, you can easily inspect your entire container infrastructure in one single hub. The Live Container view allows you to filter, sort, and aggregate your container metrics to better monitor the health and performance of your containers.

Monitor your entire container infrastructure quickly and easily
With Live Container monitoring, you can monitor important metrics from individual containers, or even compare and contrast metrics from your selected containers using summary graphs.

You can also view the out-of-the-box dashboard for Kubernetes, which will populate with Kubernetes metrics automatically once you deploy the dd-agent DaemonSet.

Next, we’ll show you how to configure the Agent to collect even more Kubernetes metrics by deploying the “kube-state-metrics” service.

Collect even more metrics with kube-state-metrics

To collect further Kubernetes metrics, such as the number of desired and available pods per deployment, you will need to deploy kube-state-metrics, a service that listens to the Kubernetes API server to generate additional data. With the Datadog Agent’s Autodiscovery feature, the Agent will automatically detect the kube-state-metrics service and start retrieving the additional metrics.

To set up kube-state-metrics, you will need to follow the instructions on the kube-state-metrics project page.

After setting up kube-state-metrics, you should start getting additional metrics in your Kubernetes dashboard.

Monitor your pods with the out-of-the-box Kubernetes dashboard

Analyze application-level metrics

These node-level and pod-level metrics can point you in the right direction when an issue occurs, but to better identify issues and understand root causes, you often need to look at the application itself. For example, high CPU usage may or may not be a concern on its own, but if you see increased application latency combined with saturated CPU, you will want to take action. If you have enough depth of visibility, you might even discover that a specific function in your call stack is to blame.

With the addition of application-level metrics, you can correlate and compare performance data across your infrastructure for better insights into the health and performance of your entire system.

Out of the box, the Datadog Agent will start collecting metrics from a number of common containerized services, such as Redis and Apache (httpd), thanks to its Autodiscovery feature. You can also configure the Agent to monitor any containerized services that require custom check configurations or authentication credentials, by defining configuration templates.

Here we will cover two complementary approaches for collecting data on application usage and performance:

  1. Instrumenting your app code to report custom metrics
  2. Configuring application performance monitoring (APM) with distributed tracing

Python sample application: dog-libs

For this guide, we have created a sample Python application called dog-libs. It is a containerized Flask application that takes a series of phrases and inserts them into a templated story. The application has been pre-configured to deploy as a Docker container on your GKE cluster.

To set up the sample application, you will first need to do the following:

After downloading the sample application to your local computer, enter your GCP Project ID in doglib-frontend.yaml.

To build and push the sample application as a Docker container to your private Google Container Registry, run the following commands in the root of the sample application directory, dog-libs:

$ docker build -t gcr.io/[YOUR_PROJECT_ID]/doglib .
$ gcloud docker -- push gcr.io/[YOUR_PROJECT_ID]/doglib

Then, you will need to deploy the application and load balancer service:

$ kubectl create -f doglib-frontend.yaml
$ kubectl create -f doglib-service.yaml

$ # Check the status of your newly created pods
$ kubectl get pods
NAME                               READY     STATUS    RESTARTS   AGE
dd-agent-ng7b8                     1/1       Running   0          10m
dd-agent-qqx00                     1/1       Running   0          10m
dd-agent-tm1lz                     1/1       Running   0          10m
doglib-frontend-1525457026-htdg0   1/1       Running   0          5s
doglib-frontend-1525457026-txq3v   1/1       Running   0          5s
doglib-frontend-1525457026-vcq20   1/1       Running   0          5s

$ # Check the status of your newly created service
$ kubectl get service doglib-service
NAME             TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)        AGE
doglib-service   LoadBalancer   10.19.232.52   35.164.102.2   80:32402/TCP   5s

Visit the listed external IP in your browser to load the sample app (and feel free to complete some dog-libs):

Sample Python Flask application: dog-libs

Custom metrics monitoring

With minimal instrumentation, you can generate custom application metrics and send them to DogStatsD, a metric aggregator bundled in the Agent. Here we will use the datadogpy library for Python, one of the many Datadog instrumentation libraries.

To enable custom metrics monitoring on GKE, you must configure your application pods to communicate with your Agent pods. This requires you to expose port 8125/UDP on your Agent pods and expose the host IP on your application pods.

To expose the port, edit dd-agent.yaml so that port 8125/UDP is exposed:

ports:
  - containerPort: 8125
    hostPort: 8125
    name: dogstatsdport
    protocol: UDP

And then apply these changes to your Datadog Agent pods by running:

$ kubectl delete daemonset dd-agent
$ kubectl create -f dd-agent.yaml

The sample application is pre-configured to expose the host IP on your application pods. This is accomplished using the application pod manifest,doglib-frontend.yaml, to expose an environment variable that contains the host IP. The application pod manifest demonstrates how to set the DATADOG_AGENT_HOST_IP environment variable to the host IP. The manifest also demonstrates how to set MY_POD_NAME to the pod name, which will be used to determine the source of metrics, since multiple copies of the same pod can run on the same node:

env:
- name: DATADOG_AGENT_HOST_IP
  valueFrom:
    fieldRef:
      fieldPath: status.hostIP
- name: MY_POD_NAME
  valueFrom:
    fieldRef:
      fieldPath: metadata.name

And then to configure the application to send metrics to the Agent on the correct host, we programmatically retrieve the host IP stored in the DATADOG_AGENT_HOST_IP environment variable. In this example, we have added a simple counter that will increase by one every time someone visits the index page:

import os
from datadog import initialize, statsd

# ...

options = {
        'statsd_host': os.environ['DATADOG_AGENT_HOST_IP'],
        'statsd_port': 8125
        }

initialize(**options)

@app.route('/', methods=['GET'])
    def view_index():
        statsd.increment("doglib.index_web_counter", tags=['pod_name:' + os.environ['MY_POD_NAME']])
        # ...

After visiting the index page a few times, the metric data should start rolling in. Since we have added a tag that attaches the name of the pod to our metrics, we can easily aggregate visits by pod in Datadog.

Visualize your custom metrics and sort them by tags

Integrate distributed tracing with APM

With Datadog’s APM, you can trace real requests as they propagate across distributed services and infrastructure components, allowing you to precisely determine the performance of your applications. A trace is, essentially, an end-to-end collection of spans that individually measure the amount of time that each function, or specified scope, in your call stack takes to complete. APM allows you to monitor the performance of every application and service in aggregate and at the request level, ensuring your system runs at peak performance.

You can use ddtrace-run, a command line wrapper application included in Datadog’s Python module, to automatically trace many web frameworks and database modules—without requiring any code changes. To trace the sample application, in the Dockerfile, we simply wrap the call that would typically start the web server with ddtrace-run:

# ...
CMD ddtrace-run gunicorn -b 0.0.0.0:$PORT main:app

Configuring the application to enable tracing is similar to setting up custom metrics monitoring—you will need to expose port 8126/TCP on your Agent pod. In addition, you will need to enable tracing in the Agent by setting the DD_APM_ENABLED environment variable to “true” in the Datadog Agent manifest, dd-agent.yaml:

ports:

# ...

  - containerPort: 8126
    hostPort: 8126
    name: datadogtracer
    protocol: TCP

# ...

env:
# ...
- name: DD_APM_ENABLED
  value: "true"

Apply these changes to your Datadog Agent pods by running:

$ kubectl delete daemonset dd-agent
$ kubectl create -f dd-agent.yaml

The wrapper application is configured by the environment variables set in the application pod manifest, doglib-frontend.yaml. Specifically, ddtrace-run uses DATADOG_TRACE_AGENT_HOSTNAME to connect to the Agent. The other environment variables are used to categorize and tag the traces.

env:

# ...
- name: DATADOG_TRACE_AGENT_HOSTNAME
  valueFrom:
    fieldRef:
      fieldPath: status.hostIP
- name: DATADOG_ENV
  value: "doglib"
- name: DATADOG_SERVICE_NAME
  value: "doglib-frontend"

To differentiate traces from one pod to another, the sample application adds a pod_name tag to the tracer:

import os
from ddtrace import tracer

# ...

tracer.set_tags({"pod_name": os.environ.get("MY_POD_NAME")})

Now, if you visit any page on the website, your application will start sending traces to the Agent. You can then filter by the environment, “doglib”, and view your detailed request traces in Datadog APM. Note that the Agent samples traces by default, so you may need to make multiple requests before your traces start to appear.

Flame Graph of APM Traces
In Datadog APM, flame graphs visualize how a request was executed, so you can see which parts of the application are contributing the most to overall latency and dive into the details of each individual call or query.

Putting it all together

In this post we have set up monitoring to provide insight into every facet of your GKE infrastructure:

  1. Node-level monitoring with Datadog’s GCE integration
  2. Container- and pod-level monitoring with the Docker and Kubernetes integrations
  3. Application-level monitoring with custom metrics and APM

Even though we have used a small Kubernetes cluster and a simple Python app in this guide, you can apply the same steps to start monitoring your production infrastructure and applications quickly. You can then build dashboards with the metrics that matter most to you, set up flexible alerting, configure anomaly detection, and more to meet the specific needs of your organization. And with over 250 integrations with popular technologies, you can monitor and correlate key metrics and events across your complex infrastructure.

Get started

If you are already using Datadog, you can start monitoring not only Google Kubernetes Engine, but the entire Google Cloud Platform by following our integration guide. If you are not using Datadog yet and want to gain insight into the health and performance of your infrastructure and applications, you can get started by signing up for a .