---
title: "Monitor Tanzu Kubernetes Grid on vSphere with Datadog"
description: "Learn how to use Datadog to gain holistic oversight into your TKG containers and their underlying VMs."
author: "Aaron Kaplan"
date: 2023-01-06
tags: ["infrastructure monitoring", "vsphere", "tanzu kubernetes grid"]
blog_type_id: the-monitor
locale: en
---

With vSphere and Tanzu Kubernetes Grid (TKG), [VMware](https://www.vmware.com/) enables enterprise organizations to combine the economic advantages of virtual machines (VMs) with the agility, portability, and scalability provided by Kubernetes.

[vSphere](https://www.datadoghq.com/blog/vsphere-datadog.md) is VMware's platform for the provisioning and management of VMs. vSphere's vCenter Servers enable organizations to centrally manage and monitor their VMs, while its ESXi hypervisors help them optimize their infrastructure and reduce costs by strategically allocating bare-metal server resources. [TKG](https://tanzu.vmware.com/kubernetes-grid) is VMware's turnkey solution for deploying and managing Kubernetes clusters at enterprise scale.

We're pleased to announce that Datadog now supports monitoring TKG clusters deployed on vSphere as well as their underlying VM resources. Our [vSphere integration](https://www.datadoghq.com/blog/vsphere-datadog.md) now comes with an additional out-of-the-box (OOTB) dashboard and [base configurations](https://docs.datadoghq.com/containers/kubernetes/distributions.md?tab=helm#TKG) that enable you to start monitoring your TKG VMs immediately. And by installing the Datadog Agent on your TKG clusters, you can collect container-, pod-, and node-level metrics.

This post will guide you through monitoring TKG on vSphere holistically using real-time metrics and events from both your TKG clusters and their underlying vSphere hosts and VMs.

## Monitor your entire vCenter and Kubernetes environment in real time

Our new OOTB dashboard, shown below, provides a fine-grained overview of your entire TKG and vSphere environment.

![Get the big picture of your vSphere-hosted containers in real time](https://web-assets.dd-static.net/42588/1776302316-monitor-vsphere-tanzu-kubernetes-grid-with-datadog-vsphere-tkg-dashboard-containers-map-event-stream-1.png)

This dashboard foregrounds key data on your TKG clusters and their host VMs via the vSphere Containers map and the TKG event stream. The container map provides a high-level breakdown of your containers by namespace, while the event stream provides an up-to-the-minute record of container activity, highlighting any errors or warnings. You can use template variables to easily adjust the scope of your monitoring by homing in on individual containers, VMs, vCenters, pods, hosts, clusters, and namespaces.

The dashboard Overview panel, shown below, graphs the total number of pods running—both overall and by namespace—as well as the CPU and memory usage of your vSphere hosts. This data can be instrumental in ensuring that your VMs have sufficient resources, providing cues for scaling, as well as highlighting any unexpected dips or spikes in your pods.

![The dashboard overview provides a detailed breakdown of your running TKG pods as well as the resource usage of your vSphere hosts](https://web-assets.dd-static.net/42588/1776302321-monitor-vsphere-tanzu-kubernetes-grid-with-datadog-vsphere-tkg-dashboard-overview.png)

## Manage and troubleshoot your TKG resources

The OOTB dashboard also features dedicated overviews of your TKG pods and containers. These overviews utilize events alongside a broad array of metrics generated from Datadog's [Kubernetes](https://www.datadoghq.com/blog/monitoring-kubernetes-with-datadog.md) and [Kubernetes State Metrics Core](https://www.datadoghq.com/blog/kube-state-metrics-v2-monitoring-datadog.md) integrations so that you can oversee, optimize, and troubleshoot your vSphere environment's Kubernetes resources in a single pane of glass.

![Monitor your TKG environment with rich metrics on your individual pods and containers](https://web-assets.dd-static.net/42588/1776302326-monitor-vsphere-tanzu-kubernetes-grid-with-datadog-vsphere-tkg-dashboard-pods-containers-1.png)

The Pods overview panel provides detailed visibility into the overall status and resource consumption of your pods.

The number of active, failed, and successful pods in a given scope is measured via the `kubernetes_state.pod.status_phase` metric, providing a high-level breakdown of the health and performance of your overall TKG environment or any subset of it. For a measure of activity by namespace, the `kubernetes_state.pod.count` and `kubernetes_state.pod.ready` metrics are used to rank your namespaces both by number of pods running and by number of unavailable pods. The latter metric is also used to measure the number of pods in a `Ready` state per node.

In order to keep you apprised of any potential strain on your compute resources, the `kubernetes.cpu.usage.total`and `kubernetes.memory.usage` metrics are used to highlight resource-intensive pods, providing visibility that can be critical for pinpointing errors.

The Containers overview offers rich visibility into the states and performance of your TKG containers, providing further angles from which to troubleshoot and optimize performance.

The `kubernetes_state.container.status_report.count.waiting` metric can highlight potential issues by proportionally mapping the top reasons your containers are `Waiting`. These can range from `ContainerCreating` to `CrashLoopBackOff` states.

The Containers overview also provides several perspectives on the states of your containers as a whole, graphing the total numbers of `Ready`, `Running`, `Terminated`, and `Waiting` containers in a given scope. To facilitate troubleshooting, this overview also visualizes the number of inoperative or potentially faulty containers per pod via a range of metrics, including:

- `kubernetes.containers.state.terminated`: the number of containers `OOMKilled` (i.e., terminated due to insufficient memory resources)
- `kubernetes.containers.state.waiting`: the number of containers in a `CrashLoopBackOff` state
- `kubernetes.containers.restarts`: the number of container restarts

The `kubernetes.network.rx_bytes`,  `kubernetes.network.tx_bytes`, `kubernetes.network.rx_errors`, and `kubernetes.network.tx_errors` metrics are used to track the network throughput and error rate of containers by pod.

Finally, for a broader picture of the health and performance of your TKG infrastructure, the `kubernetes.cpu.usage.total`and `kubernetes.memory.usage` metrics are used to graph resource usage by container.

## Manage and troubleshoot your vSphere resources

The vSphere overview, shown below, leverages metrics and events to provide critical visibility into the VMs and bare-metal hypervisors that underpin your TKG environment.

![Assess the health and performance of your vSphere hosts, VMs, and datastores](https://web-assets.dd-static.net/42588/1776302330-monitor-vsphere-tanzu-kubernetes-grid-with-datadog-vsphere-tkg-vsphere-metrics.png)

The `vsphere.cpu.usage.avg` and `vsphere.mem.usage.avg` metrics are used to graph the CPU and memory usage of your VMs and their ESXi hosts, and to highlight those consuming the most resources.

For visibility into your vSphere [datastores](https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.storage.doc/GUID-3CC7078E-9C30-402C-B2E1-2542BEE67E8F.html), the `vsphere.disk.capacity.latest` metric enables you to assess their available storage space, while the `vsphere.disk.used.latest` and `vsphere.disk.capacity.latest` metrics provide a clear picture of their disk utilization.

By correlating these metrics with vSphere events, as well as Kubernetes metrics and events from your TKG clusters, you can stay on top of errors and make the most of your usage of TKG on vSphere.

## Optimize and troubleshoot TKG on vSphere

Our new OOTB dashboard and base configurations for Datadog's vSphere integration enable you to quickly start monitoring your TKG clusters and their underlying vSphere VMs. They provide you with the real-time insights you need in order to continuously optimize your organization's virtualized and containerized resources and rapidly troubleshoot issues with the aid of event and log tracking. Check out our [documentation](https://docs.datadoghq.com/containers/kubernetes/distributions.md?tab=helm#TKG) to get started. If you’re brand-new to Datadog, sign up for a 14-day <!-- Sign-up trigger (free trial) omitted --> today.