Efficient Kubernetes Monitoring with the Datadog Cluster Agent
November 11, 2024
The Datadog Cluster Agent is a critical component for monitoring Kubernetes clusters, especially at scale. Acting as a proxy between the Kubernetes API server and the node-based Datadog Agents, the Cluster Agent reduces the load on the API server by centrally collecting cluster-level data and caching metadata. This metadata, such as pod names and service names, is then used by the node-based agents to enrich the metrics they gather from their respective nodes. This enrichment allows for a comprehensive view of application performance within the cluster. The Cluster Agent also executes cluster checks on external services, like load balancers and databases, sending this data, along with the cluster-level metrics, to the Datadog SaaS platform over HTTPS. Node-based agents also send their locally collected, enriched metrics, along with APM traces and logs to the Datadog SaaS platform via HTTPS. This centralized approach to data collection provides valuable insights into the health and performance of both the Kubernetes cluster and its external dependencies.
Explanation of the architecture
The Datadog Cluster Agent collects Kubernetes metadata (e.g. deployments, services) from the API server every 30 seconds and caches it. It then serves this metadata to node-based agents, allowing them to enrich local metrics with cluster-level context. This approach reduces load on the API server and improves scalability by avoiding direct queries from each node agent.
- Step 1
The Kubelet is the primary agent on each node in a Kubernetes cluster. It starts, stops, and manages the containers on a node based on the desired state. The Kubelet also retrieves system-level data, such as resource usage, from the node and makes this information available to the control plane.
- Step 2
The Datadog Cluster Agent runs a deployment and functions as an intermediary between the Kubernetes API and the node-based Datadog Agents. This setup allows the Cluster Agent to create a cache of cluster-level metadata.
- Step 3
The Datadog node agent collects metrics from sources on its node, including system-level metrics (e.g system, kubelet) and enriches the data using cached metadata retrieved from the Datadog Cluster Agent.
- Step 4
After the node-based Datadog Agent enriches its locally collected metrics with the cluster-level metadata from the Datadog Cluster Agent, it sends these metrics, along with APM traces and logs, to the Datadog SaaS platform. The Datadog node agent transmits this data over HTTPS, enabling secure and reliable communication with the platform.
- Step 5
The Datadog Cluster Agent sends cluster-level metrics (e.g. kube-state-metrics, kube-apiserver) and metadata to the Datadod SaaS platform over HTTPS.
Authors
Kennon Kwok - Product Solutions Architect
References
Inspiration and reference documents or existing solutions: