In Part 2, we showed you how to use Istio’s built-in features and integrations with third-party tools to visualize your service mesh, including the metrics that we introduced in Part 1. While Istio’s containerized architecture makes it straightforward to plug in different kinds of visualization software like Kiali and Grafana, you can get deeper visibility into your service mesh and reduce the time you spend troubleshooting by monitoring Istio with a single platform.
In this post, we’ll show you how to use Datadog to monitor Istio, including how to:
- Collect metrics, traces, and logs automatically from Istio’s internal components and the services running within your mesh
- Use dashboards to visualize Istio metrics alongside metrics from Kubernetes and your containerized applications
- Visualize request traces between services in your mesh to find bottlenecks and misconfigurations
- Search and analyze all of the logs in your mesh to understand trends and get context
- Set alerts to get notified automatically of issues within your mesh
With Datadog, you can seamlessly navigate between Istio metrics, traces, and logs to place your Istio data in the context of your infrastructure as a whole. You can also use alerts to get notified automatically of possible issues within your Istio deployment.
Istio currently has full support only for Kubernetes, with alpha support for Consul and Nomad. As a result, we’ll assume that you’re running Istio with Kubernetes.
How to run Datadog in your Istio mesh
The Datadog Agent is open source software that collects metrics, traces, and logs from your environment and sends them to Datadog. Datadog’s Istio integration queries your Prometheus endpoints automatically, meaning that you don’t need to run your own Prometheus server to collect data from Istio. In this section, we’ll show you how to set up the Datadog Agent to get deep visibility into your Istio service mesh.
Set up the Datadog Agent
To start monitoring your Istio Kubernetes cluster, you’ll need to deploy:
- A node-based Agent that runs on every node in your cluster, gathering metrics, traces, and logs to send to Datadog
- A Cluster Agent that runs on one of your nodes, communicating with the Kubernetes API server and providing cluster-level metadata and configurations to node-based Agents
Using this approach, we can avoid the overhead of having all node-based Agents communicate with the Kubernetes control plane, as well as enrich metrics collected from node-based Agents with cluster-level metadata, such as the names of services running within the cluster.
Since Istio components expose their own Prometheus endpoints using Kubernetes services, we’ll only need to query each endpoint once per interval. We’ll use Datadog Cluster Checks to ensure that a single node-based Agent is running the Istio integration at a time. The Cluster Agent will detect which node-based Agent is running the Istio integration and, if that Agent becomes unavailable, configure the Istio integration on another node-based Agent.
You can install the Datadog Cluster Agent and node-based Agents by taking the following steps, which we’ll lay out in more detail below.
- Assign permissions that allow the Cluster Agent and node-based Agents to communicate with each other and to access your metrics, traces, and logs.
- Apply Kubernetes manifests for both the Cluster Agent and node-based Agents to deploy them to your cluster.
Configure permissions for the Cluster Agent and node-based Agents
Both the Cluster Agent and node-based Agents take advantage of Kubernetes’ built-in role-based access control (RBAC), and the first step is enabling the following:
- A ClusterRole that declares a named set of permissions for accessing Kubernetes resources, in this case to allow the Agent to collect data on your cluster
- A ClusterRoleBinding that assigns the ClusterRole to the service account that the Datadog Agent will use to access the Kubernetes API server
kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: datadog-cluster-agent rules: - apiGroups: - "" resources: - services - events - endpoints - pods - nodes - componentstatuses verbs: - get - list - watch - apiGroups: - "autoscaling" resources: - horizontalpodautoscalers verbs: - list - watch - apiGroups: - "" resources: - configmaps resourceNames: - datadogtoken # Kubernetes event collection state - datadog-leader-election # Leader election token verbs: - get - update - apiGroups: # To create the leader election token - "" resources: - configmaps verbs: - create - get - update - nonResourceURLs: - "/version" - "/healthz" verbs: - get --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: datadog-cluster-agent roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: datadog-cluster-agent subjects: - kind: ServiceAccount name: datadog-cluster-agent namespace: default --- kind: ServiceAccount apiVersion: v1 metadata: name: datadog-cluster-agent namespace: default
You’ll also need to create a manifest that grants the appropriate permissions to the node-based Agent’s ClusterRole.
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: datadog-agent rules: - apiGroups: # This is required by the agent to query the Kubelet API. - "" resources: - nodes/metrics - nodes/spec - nodes/proxy # Required to get /pods verbs: - get --- kind: ServiceAccount apiVersion: v1 metadata: name: datadog-agent namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: datadog-agent roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: datadog-agent subjects: - kind: ServiceAccount name: datadog-agent namespace: default
Next, deploy the resources you’ve created.
$ kubectl apply -f /path/to/rbac-cluster-agent.yaml $ kubectl apply -f /path/to/rbac-agent.yaml
You can verify that all of the appropriate ClusterRoles exist in your cluster by running this command:
$ kubectl get clusterrole | grep datadog datadog-agent 1h datadog-cluster-agent 1h
Enable secure communication between Agents
Next, we’ll ensure that the Cluster Agent and node-based Agents can securely communicate by creating a Kubernetes secret, which stores a cryptographic token that the Agents can access.
To generate the token (a 32-character string that we’ll encode in Base64), run the following:
echo -n '<32_CHARACTER_LONG_STRING>' | base64
Create a file named dca-secret.yaml and add your newly created token:
apiVersion: v1 kind: Secret metadata: name: datadog-auth-token type: Opaque data: token: <NEW_SECRET_TOKEN>
Once you’ve added your token to the manifest,
apply it to create the secret:
$ kubectl apply -f /path/to/dca-secret.yaml
Run the following command to confirm that you’ve created the secret:
$ kubectl get secret | grep datadog datadog-auth-token Opaque 1 21h
Configure the Cluster Agent
To configure the Cluster Agent, create the following manifest, which declares two Kubernetes resources:
- A Deployment that adds an instance of the Cluster Agent container to your cluster
- A Service that allows the Datadog Cluster Agent to communicate with the rest of your cluster
This manifest links these resources to the service account we deployed above, and points to the newly created secret. Make sure to add your Datadog API key where indicated.
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: datadog-cluster-agent namespace: default spec: template: metadata: labels: app: datadog-cluster-agent name: datadog-agent spec: serviceAccountName: datadog-cluster-agent containers: - image: datadog/cluster-agent:latest imagePullPolicy: Always name: datadog-cluster-agent env: - name: DD_API_KEY value: "<DATADOG_API_KEY>" - name: DD_COLLECT_KUBERNETES_EVENTS value: "true" - name: DD_LEADER_ELECTION value: "true" - name: DD_LEADER_LEASE_DURATION value: "15" - name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED value: "true" - name: DD_CLUSTER_AGENT_AUTH_TOKEN valueFrom: secretKeyRef: name: datadog-auth-token key: token --- apiVersion: v1 kind: Service metadata: name: datadog-cluster-agent labels: app: datadog-cluster-agent spec: ports: - port: 5005 # Has to be the same as the one exposed in the Cluster Agent. Default is 5005. protocol: TCP selector: app: datadog-cluster-agent
Configure the node-based Agent
The node-based Agent collects metrics, traces, and logs from each node and sends them to Datadog. We’ll ensure that an Agent pod runs on each node in the cluster, even for newly launched nodes, by declaring a DaemonSet. Create the following manifest, adding your Datadog API key where indicated:
apiVersion: extensions/v1beta1 kind: DaemonSet metadata: name: datadog-agent spec: template: metadata: labels: app: datadog-agent name: datadog-agent spec: serviceAccountName: datadog-agent containers: - image: datadog/agent:latest imagePullPolicy: Always name: datadog-agent ports: - containerPort: 8125 hostPort: 8125 name: dogstatsdport protocol: UDP env: - name: DD_EXTRA_CONFIG_PROVIDERS value: "clusterchecks" - name: DD_API_KEY value: "<DATADOG_API_KEY>" - name: DD_COLLECT_KUBERNETES_EVENTS value: "true" - name: DD_LEADER_ELECTION value: "true" - name: KUBERNETES value: "true" - name: DD_KUBERNETES_KUBELET_HOST valueFrom: fieldRef: fieldPath: status.hostIP - name: DD_CLUSTER_AGENT_ENABLED value: "true" - name: DD_CLUSTER_AGENT_AUTH_TOKEN valueFrom: secretKeyRef: name: datadog-auth-token key: token - name: DD_TAGS value: "env:<YOUR_ENV_NAME>" resources: requests: memory: "256Mi" cpu: "200m" limits: memory: "256Mi" cpu: "200m" volumeMounts: - name: dockersocket mountPath: /var/run/docker.sock - name: procdir mountPath: /host/proc readOnly: true - name: cgroups mountPath: /host/sys/fs/cgroup readOnly: true livenessProbe: exec: command: - ./probe.sh initialDelaySeconds: 15 periodSeconds: 5 volumes: - hostPath: path: /var/run/docker.sock name: dockersocket - hostPath: path: /proc name: procdir - hostPath: path: /sys/fs/cgroup name: cgroups
Disable automatic sidecar injection for Datadog Agent pods
You’ll also want to prevent Istio from automatically injecting Envoy sidecars into your Datadog Agent pods and interfering with data collection. You need to disable automatic sidecar injection for both the Cluster Agent and node-based Agents by revising each manifest to include the following annotation:
[...] spec: [...] template: metadata: annotations: sidecar.istio.io/inject: "false" [...]
Then deploy the Datadog Agents:
$ kubectl apply -f /path/to/datadog-cluster-agent.yaml $ kubectl apply -f /path/to/datadog-agent.yaml
Use the following
kubectl command to verify that your Cluster Agent and node-based Agent pods are running. There should be one pod named
datadog-agent-<STRING> running per node, and a single instance of
$ kubectl get pods | grep -E "NAME|datadog" NAME READY STATUS RESTARTS AGE datadog-agent-bqtdt 1/1 Running 0 4d22h datadog-agent-gb5fs 1/1 Running 0 4d22h datadog-agent-lttmq 1/1 Running 0 4d22h datadog-agent-vnkqx 1/1 Running 0 4d22h datadog-cluster-agent-9b5b56d6d-jwg2l 1/1 Running 0 5d22h
Once you’ve deployed the Cluster Agent and node-based Agents, Datadog will start to report host– and platform-level metrics from your Kubernetes cluster.
Before you can get metrics from Pilot, Galley, Mixer, Citadel, and services within your mesh, you’ll need to set up Datadog’s Istio integration.
Set up the Istio integration
The Datadog Agent’s Istio integration automatically queries Istio’s Prometheus metrics endpoints, enriches all of the data with tags, and forwards it to the Datadog platform. Since Istio handles DNS resolution for its Prometheus endpoints, the node-based Datadog Agent can query these endpoints no matter where in your cluster Istio’s components are running.
We can enable the Istio integration by using a Kubernetes ConfigMap, which the Cluster Agent will mount to its configuration directory as a new file. First, we’ll create a ConfigMap that contains our configuration for the Istio integration. Second, we’ll add the ConfigMap to the Cluster Agent’s list of volumes.
Apply the following ConfigMap:
kind: ConfigMap apiVersion: v1 metadata: name: istio-config-map namespace: default data: istio-config: |- cluster_check: true init_config: instances: - istio_mesh_endpoint: "http://istio-telemetry.istio-system:42422/metrics" mixer_endpoint: "http://istio-telemetry.istio-system:15014/metrics" galley_endpoint: "http://istio-galley.istio-system:15014/metrics" pilot_endpoint: "http://istio-pilot.istio-system:15014/metrics" citadel_endpoint: "http://istio-citadel.istio-system:15014/metrics" send_histograms_buckets: "true"
cluster_check option instructs the Cluster Agent to ensure that a single node-based Agent is running the Istio check at a time. The
instances object lists five endpoints that the Agent will query for metrics: one for the Envoy-based network metrics collected via Mixer, and one for each Istio component (Mixer, Galley, Pilot, and Citadel). The example above shows Istio’s default Prometheus endpoints. You’ll also need to provide a value for the option
send_histograms_buckets—if this option is enabled, the Datadog Agent will tag any histogram-based metrics with the
upper_bound prefix, indicating the name of the metric’s quantile bucket.
Enable Cluster Checks within the Cluster Agent by adding the following environment variables to its manifest:
[...] spec: template: [...] spec: [...] containers: - image: datadog/cluster-agent:latest [...] env: [...] - name: DD_CLUSTER_CHECKS_ENABLED value: "true" - name: DD_CLUSTER_NAME # Value for the cluster_name tag value: "my-istio-cluster" - name: DD_CLUSTER_CHECKS_EXTRA_TAGS # Use this to set a tag for all cluster check metrics value: "env:<MY_ENV_NAME>"
Because node-based Agents collect Cluster Checks data from external services, rather than their local host, Datadog doesn’t assign a
host tag to this data, nor does it add the tag you set in
DD_TAGS. If you want to organize your Istio data into the same
env tag as, for instance, system metrics from your hosts, make sure that the values of
DD_CLUSTER_CHECKS_EXTRA_TAGS are the same.
Next, give the Cluster Agent access to your new ConfigMap by adding the following YAML to your manifest:
[...] volumeMounts: [...] - name: dd-istio-config mountPath: /etc/datadog-agent/conf.d/istio.d/ volumes: [...] - name: dd-istio-config configMap: name: istio-config-map items: - key: istio-config path: istio.yaml [...]
Apply the new configuration:
kubectl apply -f path/to/istio-configmap.yaml kubectl apply -f path/to/datadog-cluster-agent.yaml
After running these commands, you should expect to see Istio metrics flowing into Datadog. The easiest way to confirm this is to navigate to our out-of-the-box dashboard for Istio, which we’ll explain in more detail later.
Finally, enable the Istio integration by clicking the tile in your Datadog account.
You can also collect metrics, traces, and logs from the applications running in your mesh with minimal configuration. With Autodiscovery, Datadog will track containers in your cluster as they spin up, and enable the correct integration automatically. Consult Datadog’s documentation for the configuration details you’ll need to include.
Get high-level views of your Istio mesh
When running a complex distributed system using Istio, you’ll want to ensure that your nodes, containers, and services are performing as expected. This goes for both Istio’s internal components (Pilot, Mixer, Galley, Citadel, and your mesh of Envoy proxies) and the services that Istio manages. Datadog helps you visualize the health and performance of your entire Istio deployment in one place.
Visualize all of your Istio metrics together
After installing the Datadog Agent and enabling the Istio integration, you’ll have access to an out-of-the-box dashboard showing key Istio metrics. You can see request throughput and latency from throughout your mesh, as well as resource utilization metrics for each of Istio’s internal components.
You can then clone the out-of-the-box Istio dashboard and customize it to produce the most helpful view for your environment. Datadog imports tags automatically from Docker, Kubernetes, and Istio, as well as from the mesh-level metrics that Mixer exports to Prometheus (e.g.,
destination_service_name). You can use tags to group and filter dashboard widgets to get visibility into Istio’s performance. For example, the following timeseries graph and toplist use the
adapter tag to show how many dispatches Mixer makes to each adapter.
You can also quickly understand the scope of an issue (does it affect a host, a pod, or your whole cluster?) by using Datadog’s mapping features: the host map and container map. Using the container map, you can easily localize issues within your Kubernetes cluster. And if issues are due to resource constraints within your Istio nodes, this will become apparent within the host map.
You can color the host map based on the current value of any metric (and the container map based on any resource metric), making it clear which parts of your infrastructure are underperforming or overloaded. You can then use tags to group and filter the maps, helping you answer any questions about your infrastructure.
The dashboard above shows CPU utilization in our Istio deployment. In the upper-left widget, we can see that this metric is high for two hosts. To investigate, we can use the container map on the bottom left to see if any container running within those hosts is facing unusual load. Istio’s components might run on any node in your cluster—the same goes for the pods running your services. To monitor our pods regardless of where they are running, we can group containers by the
service tag, making it clear which Istio components or mesh-level services are facing the heaviest demand. The
kube_namespace tag allows us to view components and services separately.
Get insights into mesh activity
Getting visibility into traffic between Istio-managed services is key to understanding the health and performance of your service mesh. With Datadog’s distributed tracing and application performance monitoring, you can trace requests between your Istio-managed services to understand your mesh and troubleshoot issues. You can display your entire service topology using the Service Map, visualize the path of each request through your mesh using flame graphs, and get a detailed performance portrait of each service. From APM, you can easily navigate to related metrics and logs, allowing you to troubleshoot more quickly than you would with dedicated graphing, tracing, and log collection tools.
Set up tracing
First, you’ll need to instruct the node-based Agents to accept traces. Edit the node-based Agent manifest to include the following attributes.
[...] env: [...] - name: DD_APM_ENABLED value: "true" - name: DD_APM_NON_LOCAL_TRAFFIC value: "true" - name: DD_APM_ENV value: "istio-demo" [...]
DD_APM_ENABLED instructs the Agent to collect traces.
DD_APM_NON_LOCAL_TRAFFIC configures the Agent to listen for traces from containers on other hosts. Finally, if you want to keep traces from your Istio cluster separate from other projects within your organization, use the
DD_APM_ENV variable to customize the
env: tag for your traces (
env:none by default). You can then filter by this tag within Datadog.
Next, forward port 8126 from the node-based Agent container to its host, allowing the host to listen for distributed traces.
[...] ports: [...] - containerPort: 8126 hostPort: 8126 name: traceport protocol: TCP [...]
This example configures Datadog to trace requests between Envoy proxies, so you can visualize communication between your services without having to instrument your application code. If you want to trace activity within an application, e.g., a function call, you can use Datadog’s tracing libraries to either auto-instrument your application or declare traces within your code for fine-grained benchmarking and troubleshooting.
Finally, create a service for the node-based Agent, so it can receive traces from elsewhere in the mesh. We’ll use a headless service to avoid needlessly allocating a cluster IP to the Agent. Create the following manifest and apply it using
apiVersion: v1 kind: Service metadata: labels: app: datadog-agent name: datadog-agent spec: clusterIP: None ports: - name: dogstatsdport port: 8125 protocol: UDP targetPort: 8125 - name: traceport port: 8126 protocol: TCP targetPort: 8126 selector: app: datadog-agent
After you apply this configuration, the Datadog Agent should be able to receive traces from Envoy proxies throughout your cluster. In the next step, you’ll configure Istio to send traces to the Datadog Agent.
pilot.traceSamplingis the percentage of requests that Istio will record as traces. Set this to
100.00to send all traces to Datadog—you can then determine within Datadog how long to retain your traces. 2.
global.proxy.tracerinstructs Istio to use a particular tracing backend, in our case
tracing.enabledinstructs Istio to record traces of requests within your service mesh.
Run the following command to enable Istio to send traces automatically to Datadog:
helm upgrade --install istio <ISTIO_INSTALLATION_PATH>/install/kubernetes/helm/istio --namespace istio-system --set pilot.traceSampling=100.0,global.proxy.tracer=datadog,tracing.enabled=true
Visualize mesh topology with the Service Map
Datadog automatically generates a Service Map from distributed traces, allowing you to quickly understand how services communicate within your mesh. The Service Map gives you a quick read into the results of your Istio configuration, so you can identify issues and determine where you might begin to optimize your network.
If you have set up alerts for any of your services (we’ll introduce these in a moment), the Service Map will show their status. In this example, an alert has triggered for the
productpage service in the
default namespace. We can navigate directly from the Service Map to see which alerts have triggered.
And if you click on “View service overview,” you can get more context into service-level issues by viewing request rates, error rates, and latencies for a single service over time. For example, we can navigate to the overview of the
productpage service to see when the service started reporting a high rate of errors, and correlate the beginning of the issue with metrics, traces, and logs from the same time.
Understand your Istio logs
If services within your mesh fail to communicate as expected, you’ll want to consult logs to get more context. As traffic flows throughout your Istio mesh, Datadog can help you cut through the complexity by collecting all of your Istio logs in one platform for visualization and analysis.
Set up Istio log collection
DD_LOGS_ENABLED: switches on Datadog log collection
DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL: tells each node-based Agent to collect logs from all containers running on that node
DD_AC_EXCLUDE: filters out logs from certain containers before they reach Datadog, such as, in our case, those from Datadog Agent containers
[...] env: [...] - name: DD_LOGS_ENABLED value: "true" - name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL value: "true" - name: DD_AC_EXCLUDE value: "name:datadog-agent name:datadog-cluster-agent" [...]
Next, edit the file to mount the node-based Agent container to the local node’s Docker socket. Since you’ll be deploying the Datadog Agent pod as a DaemonSet, each Agent will read logs from the Docker socket on its local node, enrich them with tags imported from Docker, Kubernetes, and your cloud provider, and send them to Datadog. Istio’s components publish logs to
stderr by default, meaning that the Datadog Agent can collect all of your Istio logs from the Docker socket.
(...) volumeMounts: (...) - name: dockersocket mountPath: /var/run/docker.sock (...) volumes: (...) - hostPath: path: /var/run/docker.sock name: dockersocket (...)
Note that if you plan to run more than 10 containers in each pod, you’ll want to configure the Agent to use a Kubernetes-managed log file instead of the Docker socket.
Once you run
kubectl apply -f path/to/datadog-agent.yaml, you should start seeing your logs within Datadog.
Discover trends with Log Patterns
Once you’re collecting logs from your Istio mesh, you can start exploring them in Datadog. The Log Patterns view helps you extract trends by displaying common strings within your logs and generalizing the fields that vary into regular expressions. The result is a summary of common log types. This is especially useful for reducing noise within your Istio-managed environment, where you might be gathering logs from all of Istio’s internal components in addition to Envoy proxies and the services in your mesh.
In this example, we used the sidebar to display only the patterns having to do with our Envoy proxies. We also filtered out INFO-level logs. Now that we know which error messages are especially common—Mixer is having trouble connecting to its upstream services—we can determine how urgent these errors are and how to go about resolving them.
Set alerts for automatic monitoring
When running a complex distributed system, it’s impossible to watch every host, pod, and container for possible issues. You’ll want some way to automatically get notified when something goes wrong in your Istio mesh. Datadog allows you to set alerts on any kind of data it collects, including metrics, logs, and request traces.
In this example, we’re creating an alert that will notify us whenever requests to the
productpage service in Istio’s “Bookinfo” sample application take place at an unusual frequency, using APM data and Datadog’s anomaly detection algorithm.
You can also get automated insights into aberrant trends with Datadog’s Watchdog feature, which automatically flags performance anomalies in your dynamic service mesh. With Watchdog, you can easily detect issues like heavy request traffic, service outages, or spikes in demand, without setting up any alerts. Watchdog searches your APM-based metrics (request rates, request latencies, and error rates) for possible issues, and presents these to you as a feed when you first log in.
A view of your mesh at every scale
In this post, we’ve shown you how to use Datadog to get comprehensive visibility into metrics, traces, and logs from throughout your Istio mesh. Integrated views allow you to navigate easily between data sources, troubleshoot issues, and manage the complexity that comes with running a service mesh. If you’re not already using Datadog, you can sign up for a free trial.