Monitor Cilium-Managed Infrastructure With Datadog | Datadog

Monitor Cilium-managed infrastructure with Datadog

Author Mallory Mooney

Published: 7月 25, 2022

In Part 2 of this series, we showed how Hubble, Cilium’s observability platform, enables you to view network-level details about service dependencies and traffic flows. Cilium also integrates with various standalone monitoring tools, so you can track the other key metrics discussed in Part 1. But since the platform is an integral part of your infrastructure, you need the ability to easily correlate Cilium network and resource metrics with data from your Kubernetes resources. Otherwise, you may potentially miss issues that could lead to an outage.

Datadog brings together all of Cilium’s observability data under a single platform, providing end-to-end visibility into your Cilium network and Kubernetes environment. In this post, we’ll show how to use Datadog to:

Enable Datadog’s Cilium integration

You can forward Cilium’s metrics and logs to Datadog using the Datadog Agent, which can either be deployed directly onto the physical or virtual hosts supporting your Cilium-managed clusters, or as part of the Kubernetes manifests that manage your containerized environment. In this section, we’ll look at enabling the Agent’s Cilium integration via Kubernetes manifests.

Datadog provides Autodiscovery templates that you can incorporate into your manifests, allowing the Agent to automatically identify Cilium services running in each of your clusters. These templates simplify the process for enabling the Cilium integration across your containerized environment so you do not have to individually configure hosts.

The manifest snippet below configures the Datadog Agent to leverage its built-in OpenMetrics check in order to scrape metrics from Prometheus endpoints for Cilium’s operator and agent:


apiVersion: v1
kind: Pod
# (...)
  name: 'cilium-pod'
  annotations: '["cilium"]' '[{...}]' |
          "source": "cilium-agent",
          "service": "cilium-agent"
      ] |
          "agent_endpoint": "http://%%host%%:9090/metrics",
          "use_openmetrics": "true"

    # (...) '["cilium"]' '[{...}]' |
          "source": "cilium-operator",
          "service": "cilium-operator"
      ] |
          "operator_endpoint": "http://%%host%%:6942/metrics",
          "use_openmetrics": "true"
    - name: 'cilium_agent'
    # (...)
    - name: 'cilium_operator'
# (...)

In addition to enabling metric and log collection, this YAML file configures source and service tags for Cilium data. Tags create a link between metrics and logs and enable you to pivot between dashboards, log analytics, and network maps for easier troubleshooting. Once you deploy the manifest for your clusters, the Datadog Agent will automatically collect Cilium data and forward it to the Datadog platform.

Visualize Cilium metrics and clusters

You can view all of the Cilium metrics collected by the Agent in the integration’s dashboard, which provides a high-level overview of the state of your network, policies, and Cilium resources. For example, you can review the total number of deployed endpoints and unreachable nodes in your environment. You can also clone the integration dashboard and customize it to fit your needs. The example dashboard below includes log and event streams for Cilium’s operator and agent, enabling you to compare Cilium-generated events, such as a sudden increase in errors, with relevant metrics.

Datadog's built-in Cilium dashboard

The dashboard also enables you to monitor agent, operator, and Hubble metrics for historical trends in performance, enhancing Cilium’s built-in monitoring capabilities. Metric trends can surface anomalies in both your network and Cilium resources so you can resolve any issues before they become more serious. For example, the screenshot below shows a sudden spike in the number of inbound packets that were dropped (i.e., drop_count_total) due to a stale destination IP address.

Cilium dashboard widget for dropped packets

An uptick in dropped packets can occur when the Cilium operator fails to release an IP address from a deleted pod, causing the Cilium agent to route traffic to an endpoint that no longer exists. You can troubleshoot further by reviewing your logs, which provide more details about the state of your Kubernetes clusters and network.

It’s important to note that Cilium provides the option to replace the IP address of a deleted pod with an unreachable route. This capability ensures that services that communicate with the affected pod are notified that its IP address is no longer available, giving you more visibility into the state of your network.

Analyze Cilium logs for network and performance issues

Datadog’s Log Explorer enables you to view, filter, and search through all of your infrastructure logs, including those generated by Cilium’s operator and agent. But large Kubernetes environments can generate a significant volume of logs at any given time, so it can be difficult to sift through that data in order to identify the root cause of an issue. Datadog gives you the ability to quickly identify trends in Cilium log activity and surface error outliers via custom alerts. In the example setup below, Datadog’s anomaly alert will notify you of any unusual spikes in the number of unreachable nodes across Kubernetes services.

Anamoly alert for Cilium unreachable nodes

This kind of issue can indicate that a particular node does not have a sufficient amount of disk space or memory to manage the running pods. Without adequate resources, a node will transition into the NotReady status, and it will start evicting running pods if it remains in this state for more than five minutes. As a next step for troubleshooting, you may need to review the status of your pods within an affected node to determine if any were terminated or failed to spin up.

Review pods in the Live Containers view

The overall health of your network is largely dependent upon the state of your Kubernetes resources, and poorly performing clusters can limit Cilium’s ability to manage their traffic. You can visualize all your Cilium-managed clusters in the Live Containers view and drill down to specific pods in order to get a better understanding of their performance and status. For example, you can view all pods within a particular service or application to determine if they are still running. The example screenshot below shows more details about an application pod in the terminating status, which indicates that its containers are not running as expected. The status for each of the pod’s containers show that they were either intentionally deleted (terminated) or failed to spin up properly (exited), which would affect Cilium’s ability to route traffic to them.

Review Cilium pods with Datadog's Live Container view

This view also includes the pod’s YAML configuration to help you determine if the problem is the result of a misconfiguration in your cluster (i.e., insufficient resource allocation for the Cilium agent to run alongside your pod’s containerized workloads).

Monitor network traffic across Cilium-managed infrastructure

In addition to monitoring the performance of your Cilium-managed clusters, you can also view network traffic as it flows through your Kubernetes environment with Datadog Network Performance Monitoring and DNS monitoring. These tools are available as soon as you deploy the Datadog Agent to your Kubernetes clusters and enable the option in your Helm chart or manifest. NPM and DNS monitoring extend Hubble’s capabilities by giving you more visibility into the performance of your network and its underlying infrastructure. You can not only ensure that your policies are working as expected but also easily trace the cause of any connectivity issues back to their source.

For example, you can use the network map to confirm that endpoints are able to communicate with each other after updating a DNS domain in one of your L7 policies. Datadog can automatically highlight which endpoints managed by a particular policy have the highest volume of DNS-related issues, as seen below.

View Cilium traffic with Datadog's network map

DNS monitoring can help you troubleshoot further by providing more details about the different types of DNS errors affecting a particular pod. The example screenshot below shows an increase in the number of NXDOMAIN errors across several DNS queries, indicating that the affected pod (tina) is attempting to communicate with domains that may not exist.

View Cilium DNS queries with Datadog NPM

NXDOMAIN errors are often the result of simple misconfigurations in your network policies. If your policies are correct, however, caching could be the culprit. Cilium can leverage Kubernetes’ NodeLocal DNSCache feature to enable caching for certain responses, such as NXDOMAIN errors. Caching attempts to decrease latency by limiting the number of times a Kubernetes resource (e.g., pods) queries a DNS service for a domain. But in some cases, pods can cache outdated responses, triggering a DNS error for legitimate domains. Restarting the affected pod can help mitigate these kinds of issues.

Start monitoring Cilium with Datadog

In this post, we looked at how Datadog provides deep visibility into your Cilium environment. You can review key Cilium metrics in Datadog’s integration dashboard and pivot to logs or the Live Container view for more insights into cluster performance. You can also leverage NPM and DNS monitoring to view traffic to and from pods in order to troubleshoot issues in your network. Check out our documentation to learn more about our Cilium integration and start monitoring your Kubernetes applications today. If you don’t already have a Datadog account, you can sign up for a .