Google Kubernetes Engine (GKE), a service on the Google Cloud Platform (GCP), is a hosted platform for running and orchestrating containerized applications. Similar to Amazon’s Elastic Container Service (ECS), GKE manages Docker containers deployed on a cluster of machines. However, unlike ECS, GKE uses Kubernetes, an increasingly popular open source orchestrator that can deploy, schedule, and scale containers on the fly.
A container cluster on GKE typically comprises multiple Google Compute Engine (GCE) instances (known as nodes) running groups of one or more Docker containers (known as pods). A control plane runs Kubernetes processes that orchestrate the nodes and pods within the cluster.
With GKE’s Cluster Autoscaler and Kubernetes’s Horizontal Scaling, the size of the cluster and the number of pods within each node will dynamically scale to ensure peak performance with minimum resource usage.
Get visibility into your Google Kubernetes Engine cluster
Due to Kubernetes’s compartmentalized nature and dynamic scheduling, it can be difficult to diagnose points of failure or track down other issues in your infrastructure. By collecting metrics and other application performance data from across your cluster—such as CPU and memory usage, container and pod events, network throughput, and individual request traces—you can be ready to tackle any issue you might encounter.
This article will guide you through obtaining a comprehensive view of the health and performance of your GKE cluster. We will walk through:
- Setting up Datadog’s GCE integration to start collecting node-level metrics from your cluster
- Deploying the Datadog Agent to collect metrics from Docker, Kubernetes, and your containerized applications
- Collecting custom metrics and distributed traces with application performance monitoring for more granular insights
Note that we will focus on GKE Standard in this article. You can head over to our dedicated post to learn more about monitoring GKE Autopilot.
Before you begin
If you already have a GKE container cluster up and running, and have configured the Google Cloud SDK so you can run kubectl
commands locally, you can move straight onto monitoring your container cluster. Otherwise, use the steps below to create a three-node cluster so you can follow along.
First, you will need to ensure your role in your GCP project has the proper permissions to use GKE. As well, you will need to enable the Google Container Engine API for your project. A project is just a managed environment with its own resources and settings. If your role is at least editor
, you will most likely have the required permissions.
Next, install the Google Cloud SDK, and then the kubectl
command line tool using the SDK, on your local computer. Once you pair the Cloud SDK with your GCP account, you can control your clusters directly from your local machine using kubectl
.
Finally, create a small GKE cluster named “doglib,” with the ability to access the Cloud Datastore (a NoSQL document store) and other Google Cloud Platform APIs, by running the following command:
$ gcloud container clusters create doglib --num-nodes 3 --zone "us-central1-b" --scopes "cloud-platform"
Once your cluster is up and running, you are ready to start monitoring its health and performance.
Set up the GCE integration and dashboard
A quick and easy way to start monitoring your cluster is by gathering system-level resource metrics from all your nodes. Because the nodes in a GKE cluster are just individual GCE instances, you can start collecting node-level metrics right away with Datadog’s GCE integration.
To install the integration, follow the steps listed in the Datadog documentation.
You can then access an out-of-the-box Google Compute Engine dashboard that displays key metrics such as disk I/O, CPU utilization, and network traffic.
Collect Docker and Kubernetes metrics
Although node-level metrics can be valuable for diagnosing some issues, such as CPU overutilization or disk throttling, collecting container- and pod-level metrics provides additional visibility into how your cluster and applications are performing. For example, these more specific metrics can help you pinpoint which application pod is causing high CPU usage on your nodes. As a result, you can zero in on potential resource issues faster, with a finer level of detail.
To start collecting these metrics from Docker and Kubernetes, you will need to deploy a containerized version of the Datadog Agent on your Kubernetes cluster. The Agent is deployed as a DaemonSet, which means that Kubernetes will ensure that each node in the cluster has a running copy of the Agent pod.
The Datadog Agent requires proper RBAC permissions to authorize data collection from the Kubernetes API. This means setting up a service account, a ClusterRole with the proper permissions, and a ClusterRoleBinding to link them. You can find these manifests here. Deploy them with the following commands:
$ kubectl create -f "https://raw.githubusercontent.com/DataDog/datadog-agent/master/Dockerfiles/manifests/rbac/clusterrole.yaml"
$ kubectl create -f "https://raw.githubusercontent.com/DataDog/datadog-agent/master/Dockerfiles/manifests/rbac/serviceaccount.yaml"
$ kubectl create -f "https://raw.githubusercontent.com/DataDog/datadog-agent/master/Dockerfiles/manifests/rbac/clusterrolebinding.yaml"
Next, we highly recommend creating a secret that contains your Datadog API key. Create one with the following:
$ kubectl create secret generic datadog-secret --from-literal api-key="<YOUR_API_KEY>"
We are now ready to deploy the Agent. First, save a copy of the Datadog Agent manifest, dd-agent.yaml
, to your local computer. A copy of the manifest is available within your Datadog account. Then, deploy the Agent on your Kubernetes cluster by running the following command on your local machine:
$ kubectl create -f dd-agent.yaml
You can then check the status of the newly created pods by running:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dd-agent-ng7b8 1/1 Running 0 5s
dd-agent-qqx00 1/1 Running 0 5s
dd-agent-tm1lz 1/1 Running 0 5s
Once the Agent pods are up and running, you can check the status of an individual Agent pod by running the datadog-agent info
command, as shown below (replacing dd-agent-ng7b8
with your pod name):
$ kubectl exec dd-agent-ng7b8 -- service datadog-agent info
...
Checks
======
...
kubernetes (5.18.1)
-------------------
- instance #0 [OK]
- Collected 120 metrics, 0 events & 3 service checks
...
docker_daemon (5.18.1)
----------------------
- instance #0 [OK]
- Collected 152 metrics, 0 events & 1 service check
...
If all looks well, head over to the out-of-the-box dashboard for Docker. You can now see valuable container metrics, such as CPU utilization broken down by container, as well as an overview of which container images are running in your clusters.
In addition, with Datadog’s new Live Container monitoring, you can easily inspect your entire container infrastructure in one single hub. The Live Container view allows you to filter, sort, and aggregate your container metrics to better monitor the health and performance of your containers.
You can also view the out-of-the-box dashboard for Kubernetes, which will populate with Kubernetes metrics automatically once you deploy the dd-agent DaemonSet.
Next, we’ll show you how to configure the Agent to collect even more Kubernetes metrics by deploying the “kube-state-metrics” service.
Collect even more metrics with kube-state-metrics
To collect further Kubernetes metrics, such as the number of desired and available pods per deployment, you will need to deploy kube-state-metrics, a service that listens to the Kubernetes API server to generate additional data. With the Datadog Agent’s Autodiscovery feature, the Agent will automatically detect the kube-state-metrics service and start retrieving the additional metrics.
To set up kube-state-metrics, download the manifests. Then apply them to your cluster:
$ kubectl apply -f path/to/manifests/folder
After deploying kube-state-metrics, you should start getting additional metrics in your Kubernetes dashboard.
Analyze application-level metrics
These node-level and pod-level metrics can point you in the right direction when an issue occurs, but to better identify issues and understand root causes, you often need to look at the application itself. For example, high CPU usage may or may not be a concern on its own, but if you see increased application latency combined with saturated CPU, you will want to take action. If you have enough depth of visibility, you might even discover that a specific function in your call stack is to blame.
With the addition of application-level metrics, you can correlate and compare performance data across your infrastructure for better insights into the health and performance of your entire system.
Out of the box, the Datadog Agent will start collecting metrics from a number of common containerized services, such as Redis and Apache (httpd), thanks to its Autodiscovery feature. You can also configure the Agent to monitor any containerized services that require custom check configurations or authentication credentials, by defining configuration templates.
Here we will cover two complementary approaches for collecting data on application usage and performance:
- Instrumenting your app code to report custom metrics
- Configuring application performance monitoring (APM) with distributed tracing
Python sample application: dog-libs
For this guide, we have created a sample Python application called dog-libs. It is a containerized Flask application that takes a series of phrases and inserts them into a templated story. The application has been pre-configured to deploy as a Docker container on your GKE cluster.
To set up the sample application, you will first need to do the following:
- Install Docker on your local machine
- Clone the dog-libs repo on your local machine
- Ensure you can create an entity in Cloud Datastore
After downloading the sample application to your local computer, enter your GCP Project ID in doglib-frontend.yaml
.
To build and push the sample application as a Docker container to your private Google Container Registry, run the following commands in the root of the sample application directory, dog-libs
:
$ docker build -t gcr.io/[YOUR_PROJECT_ID]/doglib .
$ gcloud docker -- push gcr.io/[YOUR_PROJECT_ID]/doglib
Then, you will need to deploy the application and load balancer service:
$ kubectl create -f doglib-frontend.yaml
$ kubectl create -f doglib-service.yaml
$ # Check the status of your newly created pods
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dd-agent-ng7b8 1/1 Running 0 10m
dd-agent-qqx00 1/1 Running 0 10m
dd-agent-tm1lz 1/1 Running 0 10m
doglib-frontend-1525457026-htdg0 1/1 Running 0 5s
doglib-frontend-1525457026-txq3v 1/1 Running 0 5s
doglib-frontend-1525457026-vcq20 1/1 Running 0 5s
$ # Check the status of your newly created service
$ kubectl get service doglib-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
doglib-service LoadBalancer 10.19.232.52 35.164.102.2 80:32402/TCP 5s
Visit the listed external IP in your browser to load the sample app (and feel free to complete some dog-libs):
Custom metrics monitoring
With minimal instrumentation, you can generate custom application metrics and send them to DogStatsD, a metric aggregator bundled in the Agent. Here we will use the datadogpy library for Python, one of the many Datadog instrumentation libraries.
To enable custom metrics monitoring on GKE, you must configure your application pods to communicate with your Agent pods. This requires you to expose port 8125/UDP on your Agent pods and expose the host IP on your application pods.
To expose the port, edit dd-agent.yaml
so that port 8125/UDP is exposed:
ports:
- containerPort: 8125
hostPort: 8125
name: dogstatsdport
protocol: UDP
And then apply these changes to your Datadog Agent pods by running:
$ kubectl delete daemonset dd-agent
$ kubectl create -f dd-agent.yaml
The sample application is pre-configured to expose the host IP on your application pods. This is accomplished using the application pod manifest,doglib-frontend.yaml
, to expose an environment variable that contains the host IP. The application pod manifest demonstrates how to set the DATADOG_AGENT_HOST_IP
environment variable to the host IP. The manifest also demonstrates how to set MY_POD_NAME
to the pod name, which will be used to determine the source of metrics, since multiple copies of the same pod can run on the same node:
env:
- name: DATADOG_AGENT_HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
And then to configure the application to send metrics to the Agent on the correct host, we programmatically retrieve the host IP stored in the DATADOG_AGENT_HOST_IP
environment variable. In this example, we have added a simple counter that will increase by one every time someone visits the index page:
import os
from datadog import initialize, statsd
# ...
options = {
'statsd_host': os.environ['DATADOG_AGENT_HOST_IP'],
'statsd_port': 8125
}
initialize(**options)
@app.route('/', methods=['GET'])
def view_index():
statsd.increment("doglib.index_web_counter", tags=['pod_name:' + os.environ['MY_POD_NAME']])
# ...
After visiting the index page a few times, the metric data should start rolling in. Since we have added a tag that attaches the name of the pod to our metrics, we can easily aggregate visits by pod in Datadog.
Integrate distributed tracing with APM
With Datadog’s APM, you can trace real requests as they propagate across distributed services and infrastructure components, allowing you to precisely determine the performance of your applications. A trace is, essentially, an end-to-end collection of spans that individually measure the amount of time that each function, or specified scope, in your call stack takes to complete. APM allows you to monitor the performance of every application and service in aggregate and at the request level, ensuring your system runs at peak performance.
You can use ddtrace-run
, a command line wrapper application included in Datadog’s Python module, to automatically trace many web frameworks and database modules—without requiring any code changes. To trace the sample application, in the Dockerfile, we simply wrap the call that would typically start the web server with ddtrace-run
:
# ...
CMD ddtrace-run gunicorn -b 0.0.0.0:$PORT main:app
Configuring the application to enable tracing is similar to setting up custom metrics monitoring—you will need to expose port 8126/TCP on your Agent pod. In addition, you will need to enable tracing in the Agent by setting the DD_APM_ENABLED
environment variable to “true” in the Datadog Agent manifest, dd-agent.yaml
:
ports:
# ...
- containerPort: 8126
hostPort: 8126
name: datadogtracer
protocol: TCP
# ...
env:
# ...
- name: DD_APM_ENABLED
value: "true"
Apply these changes to your Datadog Agent pods by running:
$ kubectl delete daemonset dd-agent
$ kubectl create -f dd-agent.yaml
The wrapper application is configured by the environment variables set in the application pod manifest, doglib-frontend.yaml
. Specifically, ddtrace-run
uses DATADOG_TRACE_AGENT_HOSTNAME
to connect to the Agent. The other environment variables are used to categorize and tag the traces.
env:
# ...
- name: DATADOG_TRACE_AGENT_HOSTNAME
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: DATADOG_ENV
value: "doglib"
- name: DATADOG_SERVICE_NAME
value: "doglib-frontend"
To differentiate traces from one pod to another, the sample application adds a pod_name
tag to the tracer:
import os
from ddtrace import tracer
# ...
tracer.set_tags({"pod_name": os.environ.get("MY_POD_NAME")})
Now, if you visit any page on the website, your application will start sending traces to the Agent. You can then filter by the environment, “doglib”, and view your detailed request traces in Datadog APM. Note that the Agent samples traces by default, so you may need to make multiple requests before your traces start to appear.
Putting it all together
In this post we have set up monitoring to provide insight into every facet of your GKE infrastructure:
- Node-level monitoring with Datadog’s GCE integration
- Container- and pod-level monitoring with the Docker and Kubernetes integrations
- Application-level monitoring with custom metrics and APM
Even though we have used a small Kubernetes cluster and a simple Python app in this guide, you can apply the same steps to start monitoring your production infrastructure and applications quickly. You can then build dashboards with the metrics that matter most to you, set up flexible alerting, configure anomaly detection, and more to meet the specific needs of your organization. And with over 600 integrations with popular technologies, you can monitor and correlate key metrics and events across your complex infrastructure.
Start monitoring GKE with Datadog today
If you are already using Datadog, you can start monitoring not only Google Kubernetes Engine, but the entire Google Cloud Platform by following our integration guide. If you are not using Datadog yet and want to gain insight into the health and performance of your infrastructure and applications, you can get started by signing up for a 14-day free trial.