Collecting metrics with built-in Kubernetes monitoring tools | Datadog
Datadog's Research Report: The State of Serverless Report: The State of Serverless

Collecting metrics with built-in Kubernetes monitoring tools

Author Jean-Mathieu Saponaro
@JMSaponaro
Author John Matson
@jmtsn

Last updated: March 10, 2020

In the previous post in this series, we dug into the data you should track so you can properly monitor your Kubernetes cluster. Next, you will learn how you can start inspecting your Kubernetes metrics and logs using free, open source tools.

In this post we’ll cover several ways of retrieving and viewing observability data from your Kubernetes cluster:

Collect resource metrics from Kubernetes objects

Resource metrics track the utilization and availability of critical resources such as CPU, memory, and storage. Kubernetes provides a Metrics API and a number of command line queries that allow you to retrieve snapshots of resource utilization with relative ease.

First things first: Deploy Metrics Server

Before you can query the Kubernetes Metrics API or run kubectl top commands to retrieve metrics from the command line, you’ll need to ensure that Metrics Server is deployed to your cluster. As detailed in Part 2, Metrics Server is a cluster add-on that collects resource usage data from each node and provides aggregated metrics through the Metrics API. Metrics Server makes resource metrics such as CPU and memory available for users to query, as well as for the Kubernetes Horizontal Pod Autoscaler to use for auto-scaling workloads.

Depending on how you run Kubernetes, Metrics Server may already be deployed to your cluster. For instance, Google Kubernetes Engine clusters include a Metrics Server deployment by default, whereas Amazon Elastic Kubernetes Service clusters do not. Run the following command using the kubectl command line utility to see if metrics-server is running in your cluster:

kubectl get pods --all-namespaces | grep metrics-server

If Metrics Server is already running, you’ll see details on the running pods, as in the response below:

kube-system   metrics-server-v0.3.1-57c75779f-8sm9r                       2/2     Running   0          16h

If no pods are returned, you can deploy Metrics Server by cloning and applying a series of YAML manifests:

git clone https://github.com/kubernetes-sigs/metrics-server.git
cd metrics-server
kubectl create -f deploy/1.8+/

Use kubectl get to query the Metrics API

Once Metrics Server is deployed, you can query the Metrics API to retrieve current metrics from any node or pod using the below commands. You can find the name of your desired node or pod by running kubectl get nodes or kubectl get pods, respectively:

kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/<NODE_NAME> | jq

kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/<NAMESPACE>/pods/<POD_NAME> | jq

For example, the following command retrieves metrics on a busybox pod deployed in the default namespace:

kubectl get --raw /apis/metrics.k8s.io/v1beta1/namespaces/default/pods/busybox | jq

The Metrics API returns a JSON object, so (optionally) piping the response through jq displays the JSON in a more human-readable format:

{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "busybox",
    "namespace": "default",
    "selfLink": "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/busybox",
    "creationTimestamp": "2019-12-10T18:23:20Z"
  },
  "timestamp": "2019-12-10T18:23:12Z",
  "window": "30s",
  "containers": [
    {
      "name": "busybox",
      "usage": {
        "cpu": "0",
        "memory": "364Ki"
      }
    }
  ]
}

If multiple containers are running in the same pod, the API response will include separate resource statistics for each container.

You can query the CPU and memory usage of a Kubernetes node with a similar command:

kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes/gke-john-m-research-default-pool-15c38181-m4xw | jq

{
  "kind": "NodeMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "gke-john-m-research-default-pool-15c38181-m4xw",
    "selfLink": "/apis/metrics.k8s.io/v1beta1/nodes/gke-john-m-research-default-pool-15c38181-m4xw",
    "creationTimestamp": "2019-12-10T18:34:01Z"
  },
  "timestamp": "2019-12-10T18:33:41Z",
  "window": "30s",
  "usage": {
    "cpu": "62789706n",
    "memory": "641Mi"
  }
}

View metric snapshots using kubectl top

Once Metrics Server is deployed, you can retrieve compact metric snapshots from the Metrics API using kubectl top. The kubectl top command returns current CPU and memory usage for a cluster’s pods or nodes, or for a particular pod or node if specified.

For example, you can run the following command to display a snapshot of near-real-time resource usage of all cluster nodes:

kubectl top node

NAME                                               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
gke-john-m-research-2-default-pool-42552c4a-fg80   89m          9%     781Mi           67%       
gke-john-m-research-2-default-pool-42552c4a-lx87   59m          6%     644Mi           55%       
gke-john-m-research-2-default-pool-42552c4a-rxmv   53m          5%     665Mi           57%

This output shows three worker nodes in a GKE cluster. Each line displays the total amount of CPU (in cores, or in this case m for millicores) and memory (in MiB) that the node is using, and the percentages of the node’s allocatable capacity those numbers represent. Likewise, to query resource utilization by pod in the web-app namespace, run the command below (note that if you do not specify a namespace, the default namespace will be used):

kubectl top pod --namespace web-app

NAME                                CPU(cores)   MEMORY(bytes)   
nginx-deployment-76bf4969df-65wmd   12m           1Mi             
nginx-deployment-76bf4969df-mmqvt   16m           1Mi             

You can also display a resource breakdown at the container level within pods by adding a --containers flag. The command below shows that one of our kube-dns pods, which run in the kube-system namespace, comprises four individual containers, and breaks down the pod’s resource usage among those containers:

kubectl top pod kube-dns-79868f54c5-58hq8 --namespace kube-system --containers

POD                         NAME               CPU(cores)   MEMORY(bytes)   
kube-dns-79868f54c5-58hq8   prometheus-to-sd   0m           6Mi             
kube-dns-79868f54c5-58hq8   sidecar            1m           10Mi            
kube-dns-79868f54c5-58hq8   kubedns            1m           7Mi             
kube-dns-79868f54c5-58hq8   dnsmasq            1m           5Mi             

Query resource allocations with kubectl describe

If you want to see details about the resources that have been allocated to your nodes, rather than the current resource usage, the kubectl describe command provides a detailed breakdown of a specified pod or node. This can be particularly useful to list the resource requests and limits (as explained in Part 2) of all of the pods on a specific node. For example, to view details on one of the GKE hosts returned by the kubectl top node command above, you would run the following:

kubectl describe node gke-john-m-research-2-default-pool-42552c4a-fg80

The output is verbose, containing a full breakdown of the node’s workloads, system info, and metadata such as labels and annotations. Below, we’ll excerpt the workload portion of the output, which breaks down the resource requests and limits at the pod level, as well as for the entire node:

Non-terminated Pods:         (10 in total)
  Namespace                  Name                                                           CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                           ------------  ----------  ---------------  -------------  ---
  default                    nginx-deployment-76bf4969df-65wmd                              100m (10%)    0 (0%)      0 (0%)           0 (0%)         4d23h
  kube-system                fluentd-gcp-scaler-59b7b75cd7-l8wdg                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d5h
  kube-system                fluentd-gcp-v3.2.0-77r6g                                       100m (10%)    1 (106%)    200Mi (17%)      500Mi (43%)    5d5h
  kube-system                kube-dns-79868f54c5-k4rpd                                      260m (27%)    0 (0%)      110Mi (9%)       170Mi (14%)    5d5h
  kube-system                kube-dns-autoscaler-bb58c6784-nxzzz                            20m (2%)      0 (0%)      10Mi (0%)        0 (0%)         5d5h
  kube-system                kube-proxy-gke-john-m-research-2-default-pool-42552c4a-fg80    100m (10%)    0 (0%)      0 (0%)           0 (0%)         5d5h
  kube-system                kubernetes-dashboard-57df4db6b-tlq7z                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         3d
  kube-system                metrics-server-v0.3.1-57c75779f-vg6rq                          48m (5%)      143m (15%)  105Mi (9%)       355Mi (30%)    5d5h
  kube-system                prometheus-to-sd-zsp8k                                         1m (0%)       3m (0%)     20Mi (1%)        20Mi (1%)      5d5h
  kubernetes-dashboard       kubernetes-dashboard-6fd7ddf9bb-66gkk                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d2h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests     Limits
  --------                   --------     ------
  cpu                        629m (66%)   1146m (121%)
  memory                     445Mi (38%)  1045Mi (90%)
  ephemeral-storage          0 (0%)       0 (0%)
  attachable-volumes-gce-pd  0            0

Note that kubectl describe returns the percent of total available capacity that each resource request or limit represents. These statistics are not a measure of actual CPU or memory utilization, as is returned by kubectl top. (Because of this difference, the kubectl describe command will work even in the absence of Metrics Server.) In the above example, we see that the pod nginx-deployment-76bf4969df-65wmd has a CPU request of 100 millicores, accounting for 10 percent of the node’s capacity, which is one core.

Browse cluster objects in Kubernetes Dashboard

Kubernetes Dashboard is a web-based UI for monitoring and managing your cluster. Essentially, it is a graphical wrapper for the same functions that kubectl can provide: you can use Dashboard to deploy and manage applications, monitor Kubernetes objects, and more. Dashboard provides resource usage breakdowns for each node and pod, as well as detailed metadata about pods, services, Deployments, and other Kubernetes objects. Unlike kubectl top, Dashboard provides not only an instantaneous snapshot of resource usage but also some basic graphs tracking how those metrics have evolved over the previous 15 minutes.

Note that the metric graphs at the top of Dashboard’s main overview depend on the use of Heapster, which preceded Metrics Server as the primary source of resource usage data in Kubernetes. Heapster is officially deprecated, and its out-of-the-box deployment manifests will not work with some recent versions of Kubernetes. As of the time of this writing, a new version (2.0) of Dashboard is available, along with a lightweight Metrics Scraper to retrieve and store metrics from Metrics Server.

Install Dashboard

Installing Kubernetes Dashboard is fairly straightforward, as outlined in the project’s documentation. In fact, authenticating to Dashboard can be somewhat more complicated than actually deploying it, especially in a production environment. We’ll cover both installation and authentication below.

To install Kubernetes Dashboard, you can deploy the official manifest for the latest supported version. The command below deploys version 1.10.1, but you can check the GitHub page for the project to find the latest.

kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v1.10.1/src/deploy/recommended/kubernetes-dashboard.yaml

Once Dashboard has been deployed, you can start an HTTP proxy to access its UI via a web browser:

kubectl proxy

Starting to serve on 127.0.0.1:8001

Once you see that the HTTP proxy is starting to serve requests, you can access Kubernetes Dashboard at this URL. You’ll then be prompted with a login screen, as shown below.

Generate an auth token to access Dashboard

In a demo environment, you can quickly generate a token to authenticate to Dashboard by following the instructions here. In short, the process involves creating an admin-user Service Account and an associated Cluster Role Binding, which grants admin permissions that allow the user to view all the data in Dashboard. Once you have created the Service Account and Cluster Role Binding, you can retrieve a valid token at any time (including after the initial session expires) by running the following command:

kubectl --namespace kubernetes-dashboard describe secret $(kubectl -n kubernetes-dashboard get secret | grep admin-user | awk '{print $1}')

The output includes a token field under the Data section. You can copy the token value and paste it into the Dashboard authentication window:

[...]
Data
====
ca.crt:     1119 bytes
namespace:  20 bytes
token:      eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJhZG1pbi11c2VyLXRva2VuLXM4dGw5Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQubmFtZSI6ImFkbWluLXVzZXIiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiJlMjQ3OTE1ZC0xYzY2LTExZWEtOGJmMC00MjAxMGE4MDAxMjAiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6a3ViZXJuZXRlcy1kYXNoYm9hcmQ6YWRtaW4tdXNlciJ9.eVT-mDYfpjKQgJsmZmLsSUxMxbetNmd2FmPetQ_j5MprOcIRY84gd51k-7HtFWT3DyOV8t2zUoQz4ppG52Pzwygsg8lF1zOM81QcdrJMU7CEqmrvQS8HQWDiheeogR8CoauUD2RHRNxzQPELPktTM_sOdo73irh-R4mkdV9smYBOeqWe7CZM6gztrSJAe3ur07THTf-ZIG48TWmQztc0nXMllrqp8ehxoTBODvvXvnFjJ47LjeHU4_r4UKMAukxxxJN7wmiXVgwZJHsJiadb-RE3rKGrioa3EQyQEk6I3fLkII-C8_NbfyfiwSUn2Rvc41WljxIl1KcyHdDMfyc1tA

Once you deploy and log in to Kubernetes Dashboard, you’ll have access to metric summaries for each pod, node, and namespace in your cluster. You can also use the UI to edit Kubernetes objects—for instance, to scale up a Deployment or to change the image version in a pod’s specification.

Collect high-level cluster status metrics

In addition to monitoring the CPU and memory usage of cluster nodes and pods, you’ll need a way to collect metrics tracking the high-level status of the cluster and its constituent objects.

As covered in Part 2, the Kubernetes API server exposes data about the count, health, and availability of pods, nodes, and other Kubernetes objects. By installing the kube-state-metrics add-on in your cluster, you can consume these metrics more easily to help surface issues with cluster infrastructure, resource constraints, or pod scheduling.

Add kube-state-metrics to your cluster

The kube-state-metrics service provides additional cluster information that Metrics Server does not. Metrics Server exposes statistics about the resource utilization of Kubernetes objects, whereas kube-state-metrics listens to the Kubernetes API and generates metrics about the state of Kubernetes objects: node status, node capacity (CPU and memory), number of desired/available/unavailable/updated replicas per Deployment, pod status (e.g., waiting, running, ready), and so on. The kube-state-metrics docs detail all the metrics that are available once kube-state-metrics is deployed.

Deploy kube-state-metrics

The kube-state-metrics add-on runs as a Kubernetes Deployment with a single replica. To create the Deployment, service, and associated permissions, you can use a set of manifests from the official kube-state-metrics project. To download the manifests and apply them to your cluster, run the following series of commands:

git clone https://github.com/kubernetes/kube-state-metrics.git
cd kube-state-metrics
kubectl apply -f examples/standard

Note that some environments, including GKE clusters, have restrictive permissions settings that require a different installation approach. Details on deploying kube-state-metrics to GKE clusters and other restricted environments are available in the kube-state-metrics docs.

Collect cluster state metrics

Once kube-state-metrics is deployed to your cluster, it provides a vast array of metrics in text format on an HTTP endpoint. The metrics are exposed in Prometheus exposition format, so they can be easily consumed by any monitoring system that can collect Prometheus metrics. To browse the metrics, you can start an HTTP proxy:

kubectl proxy

Starting to serve on 127.0.0.1:8001

You can then view the text-based metrics at http://localhost:8001/api/v1/namespaces/kube-system/services/kube-state-metrics:http-metrics/proxy/metrics or by sending a curl request to the same endpoint:

curl localhost:8001/api/v1/namespaces/kube-system/services/kube-state-metrics:http-metrics/proxy/metrics

The list of returned metrics is very long—more than 1,000 lines of text at the time of this writing—so it is helpful to identify metric(s) of interest in the kube-state-metrics docs to grep or otherwise search for. For instance, the following command returns a metric definition for kube_node_status_capacity_cpu_cores, as well as the metric’s value for the sole node in a minikube cluster:

curl http://localhost:8001/api/v1/namespaces/kube-system/services/kube-state-metrics:http-metrics/proxy/metrics | grep kube_node_status_capacity_cpu_cores

# HELP kube_node_status_capacity_cpu_cores The total CPU resources of the node.
# TYPE kube_node_status_capacity_cpu_cores gauge
kube_node_status_capacity_cpu_cores{node="minikube"} 2

Spot check via command line

Some metrics specific to Kubernetes cluster status can be easily spot-checked via the command line. The most useful command for high-level cluster checks is kubectl get, which returns the status of various Kubernetes objects. For example, you can see the number of pods available, desired and currently running for all your Deployments with this command:

kubectl get deployments

NAME       DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
app        3         3         3            0           17s
nginx      1         1         1            1           23m
redis      1         1         1            1           23m

The above example shows three Deployments on our cluster. For the app Deployment, we see that, although the three requested (DESIRED) pods are currently running (CURRENT), they are not yet ready for use (AVAILABLE). In this case, it’s because the configuration specifies that pods running in this Deployment must be healthy for 90 seconds before they will be made available, and the deployment was launched just 17 seconds ago.

Likewise, we can see that the nginx and redis Deployments specify one replica (pod) each, and that both of them are currently running as desired, with one pod for each Deployment. We also see that these pods reflect the most recent desired state for those pods (UP-TO-DATE) and are available.

Viewing pod logs with kubectl logs

Viewing metrics and metadata about your nodes and pods can alert you to problems with your cluster. For example, you can see if replicas for a deployment are not launching properly, or if your nodes are running low on available resources. But troubleshooting a problem may require more detailed information and application-specific context, which is where logs can come in handy.

The kubectl logs command dumps or streams logs written to stdout from a specific pod or container:

kubectl logs <POD_NAME> # query a specific pod's logs
kubectl logs <POD_NAME> -c <CONTAINER_NAME> # query a specific container's logs

If you don’t specify any other options, this command will simply dump all stdout logs from the specified pod or container. And this does mean all, so filtering or reducing the log output can be useful. For example, the --tail flag lets you restrict the output to a specified number of the most recent log messages:

kubectl logs --tail=25 <POD_NAME>

Another useful flag is --previous, which returns logs for a previous instance of the specified pod or container. The --previous flag allows you to view the logs of a crashed pod for troubleshooting:

kubectl logs  <POD_NAME> --previous

Note that you can also view the stream of logs from a pod in Kubernetes Dashboard. From the navigation bar at the top of the “Pods” view, click on the “Logs” tab to access a log stream from the pod in the browser, which can be further segmented by container if the pod comprises multiple containers.

Production Kubernetes monitoring with Datadog

As we’ve shown in this post, Kubernetes includes several useful monitoring tools, both as built-in features and cluster add-ons. The available tooling is valuable for spot checks and retrieving metric snapshots, and can even display several minutes of monitoring data in some cases. But for monitoring production environments, you need visibility into your Kubernetes infrastructure as well as your containerized applications themselves, with much longer data retention and lookback times. Using a monitoring service can also give you access to metrics from your Control Plane, providing more insight into your cluster’s health and performance.

Datadog provides full-stack visibility into Kubernetes environments, with:

  • out-of-the-box integrations with Kubernetes, Docker, containerd, and all your containerized applications, so you can see all your metrics, logs, and traces in one place
  • Autodiscovery so you can seamlessly monitor applications in large-scale dynamic environments
  • advanced monitoring features including outlier and anomaly detection, forecasting, and automatic correlation of observability data

From cluster status to low-level resource metrics to distributed traces and container logs, Datadog brings together all the data from your infrastructure and applications in one platform. Datadog automatically collects labels and tags from Kubernetes and your containers, so you can filter and aggregate your data using the same abstractions that define your cluster. The next and last part of this series describes how to use Datadog for monitoring Kubernetes clusters, and shows you how you can start getting visibility into every part of your containerized environment in minutes.


Source Markdown for this post is available on GitHub. Questions, corrections, additions, etc.? Please let us know.