In Part 2, we showed you how to use Istio’s built-in features and integrations with third-party tools to visualize your service mesh, including the metrics that we introduced in Part 1. While Istio’s containerized architecture makes it straightforward to plug in different kinds of visualization software like Kiali and Grafana, you can get deeper visibility into your service mesh and reduce the time you spend troubleshooting by monitoring Istio with a single platform.
In this post, we’ll show you how to use Datadog to monitor Istio, including how to:
- Collect metrics, traces, and logs automatically from Istio’s internal components and the services running within your mesh
- Use dashboards to visualize Istio metrics alongside metrics from Kubernetes and your containerized applications
- Visualize request traces between services in your mesh to find bottlenecks and misconfigurations
- Search and analyze all of the logs in your mesh to understand trends and get context
- Set alerts to get notified automatically of issues within your mesh
With Datadog, you can seamlessly navigate between Istio metrics, traces, and logs to place your Istio data in the context of your infrastructure as a whole. You can also use alerts to get notified automatically of possible issues within your Istio deployment.
Istio currently has full support only for Kubernetes, with alpha support for Consul and Nomad. As a result, we’ll assume that you’re running Istio with Kubernetes.
How to run Datadog in your Istio mesh
The Datadog Agent is open source software that collects metrics, traces, and logs from your environment and sends them to Datadog. Datadog’s Istio integration queries Istio’s Prometheus endpoints automatically, meaning that you don’t need to run your own Prometheus server to collect data from Istio. In this section, we’ll show you how to set up the Datadog Agent to get deep visibility into your Istio service mesh.
These instructions are intended for users of Istio versions prior to 1.5. For instructions on setting up Datadog to monitor Istio versions 1.5 and later, see our dedicated post.
Set up the Datadog Agent
To start monitoring your Istio Kubernetes cluster, you’ll need to deploy:
- A node-based Agent that runs on every node in your cluster, gathering metrics, traces, and logs to send to Datadog
- A Cluster Agent that runs as a Deployment, communicating with the Kubernetes API server and providing cluster-level metadata to node-based Agents
With this approach, we can avoid the overhead of having all node-based Agents communicate with the Kubernetes control plane, as well as enrich metrics collected from node-based Agents with cluster-level metadata, such as the names of services running within the cluster.
You can install the Datadog Cluster Agent and node-based Agents by taking the following steps, which we’ll lay out in more detail below.
- Assign permissions that allow the Cluster Agent and node-based Agents to communicate with each other and to access your metrics, traces, and logs.
- Apply Kubernetes manifests for both the Cluster Agent and node-based Agents to deploy them to your cluster.
Configure permissions for the Cluster Agent and node-based Agents
Both the Cluster Agent and node-based Agents take advantage of Kubernetes' built-in role-based access control (RBAC), and the first step is enabling the following:
- A ClusterRole that declares a named set of permissions for accessing Kubernetes resources, in this case to allow the Agent to collect data on your cluster
- A ClusterRoleBinding that assigns the ClusterRole to the service account that the Datadog Agent will use to access the Kubernetes API server
The Datadog Agent GitHub repository contains manifests that enable RBAC for the Cluster Agent and node-based Agents. One of these grants permissions to the Datadog Cluster Agent’s ClusterRole:
rbac-cluster-agent.yaml
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: datadog-cluster-agent
namespace: <DATADOG_NAMESPACE>
rules:
- apiGroups:
- ""
resources:
- services
- events
- endpoints
- pods
- nodes
- componentstatuses
verbs:
- get
- list
- watch
- apiGroups:
- "autoscaling"
resources:
- horizontalpodautoscalers
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- configmaps
resourceNames:
- datadogtoken
- datadog-leader-election
verbs:
- get
- update
- apiGroups:
- ""
resources:
- configmaps
verbs:
- create
- get
- update
- nonResourceURLs:
- "/version"
- "/healthz"
verbs:
- get
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: datadog-cluster-agent
namespace: <DATADOG_NAMESPACE>
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: datadog-cluster-agent
subjects:
- kind: ServiceAccount
name: datadog-cluster-agent
namespace: <DATADOG_NAMESPACE>
---
kind: ServiceAccount
apiVersion: v1
metadata:
name: datadog-cluster-agent
namespace: <DATADOG_NAMESPACE>
You’ll also need to create a manifest that grants the appropriate permissions to the node-based Agent’s ClusterRole.
rbac-agent.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: datadog-agent
namespace: <DATADOG_NAMESPACE>
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
- nodes/spec
- nodes/proxy
verbs:
- get
---
kind: ServiceAccount
apiVersion: v1
metadata:
name: datadog-agent
namespace: <DATADOG_NAMESPACE>
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: datadog-agent
namespace: <DATADOG_NAMESPACE>
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: datadog-agent
subjects:
- kind: ServiceAccount
name: datadog-agent
namespace: <DATADOG_NAMESPACE>
Next, deploy the resources you’ve created.
$ kubectl apply -f /path/to/rbac-cluster-agent.yaml
$ kubectl apply -f /path/to/rbac-agent.yaml
You can verify that all of the appropriate ClusterRoles exist in your cluster by running this command:
$ kubectl get clusterrole | grep datadog
datadog-agent 1h
datadog-cluster-agent 1h
Enable secure communication between Agents
Next, we’ll ensure that the Cluster Agent and node-based Agents can securely communicate by creating a Kubernetes secret, which stores a cryptographic token that the Agents can access.
To generate the token (a 32-character string encoded in Base64), run the following. Node-based Agents use this as a bearer token for communicating with the Cluster Agent, so we need to remove control characters to ensure that this is a valid HTTP header value.
echo -n '<32_CHARACTER_LONG_STRING>' | base64 | tr -d "[:cntrl:]"
Create a file named dca-secret.yaml and add your newly created token:
dca-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: datadog-auth-token
namespace: <DATADOG_NAMESPACE>
type: Opaque
data:
token: <NEW_SECRET_TOKEN>
Once you’ve added your token to the manifest, apply
it to create the secret:
$ kubectl apply -f /path/to/dca-secret.yaml
Run the following command to confirm that you’ve created the secret:
$ kubectl get secret | grep datadog
datadog-auth-token Opaque 1 21h
Configure the Cluster Agent
To configure the Cluster Agent, create the following manifest, which declares two Kubernetes resources:
- A Deployment that adds an instance of the Cluster Agent container to your cluster
- A Service that allows the Datadog Cluster Agent to communicate with the rest of your cluster
This manifest links these resources to the service account we deployed above and points to the newly created secret. Make sure to add your Datadog API key where indicated. (Or use a Kubernetes secret as we did for the Cluster Agent authorization token.)
datadog-cluster-agent.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: datadog-cluster-agent
namespace: <DATADOG_NAMESPACE>
spec:
selector:
matchLabels:
app: datadog-cluster-agent
template:
metadata:
labels:
app: datadog-cluster-agent
name: datadog-agent
spec:
serviceAccountName: datadog-cluster-agent
containers:
- image: datadog/cluster-agent:latest
imagePullPolicy: Always
name: datadog-cluster-agent
env:
- name: DD_API_KEY
value: "<DATADOG_API_KEY>"
- name: DD_COLLECT_KUBERNETES_EVENTS
value: "true"
- name: DD_LEADER_ELECTION
value: "true"
- name: DD_EXTERNAL_METRICS_PROVIDER_ENABLED
value: "true"
- name: DD_CLUSTER_AGENT_AUTH_TOKEN
valueFrom:
secretKeyRef:
name: datadog-auth-token
key: token
---
apiVersion: v1
kind: Service
metadata:
name: datadog-cluster-agent
namespace: <DATADOG_NAMESPACE>
labels:
app: datadog-cluster-agent
spec:
ports:
- port: 5005 # Has to be the same as the one exposed in the Cluster Agent. Default is 5005.
protocol: TCP
selector:
app: datadog-cluster-agent
Configure the node-based Agent
The node-based Agent collects metrics, traces, and logs from each node and sends them to Datadog. We’ll ensure that an Agent pod runs on each node in the cluster, even for newly launched nodes, by declaring a DaemonSet. Create the following manifest, adding your Datadog API key where indicated:
datadog-agent.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: datadog-agent
namespace: <DATADOG_NAMESPACE>
spec:
selector:
matchLabels:
app: datadog
template:
metadata:
labels:
app: datadog
name: datadog
spec:
serviceAccountName: datadog-agent
containers:
- image: datadog/agent:latest
imagePullPolicy: Always
name: datadog-agent
ports:
- containerPort: 8125
hostPort: 8125
name: dogstatsdport
protocol: UDP
env:
- name: DD_API_KEY
value: "<DATADOG_API_KEY>"
- name: DD_COLLECT_KUBERNETES_EVENTS
value: "true"
- name: KUBERNETES
value: "true"
- name: DD_KUBERNETES_KUBELET_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: DD_CLUSTER_AGENT_ENABLED
value: "true"
- name: DD_CLUSTER_AGENT_AUTH_TOKEN
valueFrom:
secretKeyRef:
name: datadog-auth-token
key: token
- name: DD_TAGS
value: "env:<YOUR_ENV_NAME>"
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "256Mi"
cpu: "200m"
volumeMounts:
- name: dockersocket
mountPath: /var/run/docker.sock
- name: procdir
mountPath: /host/proc
readOnly: true
- name: cgroups
mountPath: /host/sys/fs/cgroup
readOnly: true
livenessProbe:
exec:
command:
- ./probe.sh
initialDelaySeconds: 15
periodSeconds: 5
volumes:
- hostPath:
path: /var/run/docker.sock
name: dockersocket
- hostPath:
path: /proc
name: procdir
- hostPath:
path: /sys/fs/cgroup
name: cgroups
Disable automatic sidecar injection for Datadog Agent pods
You’ll also want to prevent Istio from automatically injecting Envoy sidecars into your Datadog Agent pods and interfering with data collection. You need to disable automatic sidecar injection for both the Cluster Agent and node-based Agents by revising each manifest to include the following annotation:
[...]
spec:
[...]
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
[...]
Then deploy the Datadog Agents:
$ kubectl apply -f /path/to/datadog-cluster-agent.yaml
$ kubectl apply -f /path/to/datadog-agent.yaml
Use the following kubectl
command to verify that your Cluster Agent and node-based Agent pods are running. There should be one pod named datadog-agent-<STRING>
running per node, and a single instance of datadog-cluster-agent-<STRING>
.
$ kubectl -n <DATADOG_NAMESPACE> get pods
NAME READY STATUS RESTARTS AGE
datadog-agent-bqtdt 1/1 Running 0 4d22h
datadog-agent-gb5fs 1/1 Running 0 4d22h
datadog-agent-lttmq 1/1 Running 0 4d22h
datadog-agent-vnkqx 1/1 Running 0 4d22h
datadog-cluster-agent-9b5b56d6d-jwg2l 1/1 Running 0 5d22h
Once you’ve deployed the Cluster Agent and node-based Agents, Datadog will start to report host– and platform-level metrics from your Kubernetes cluster.
Before you can get metrics from Pilot, Galley, Mixer, Citadel, and services within your mesh, you’ll need to set up Datadog’s Istio integration.
Set up the Istio integration
The Datadog Agent’s Istio integration automatically queries Istio’s Prometheus metrics endpoints, enriches all of the data with tags, and forwards it to the Datadog platform. The Datadog Cluster Agent uses a feature called endpoints checks to detect Istio’s Kubernetes services, identify the pods that back them, and send configurations to the Agents on the nodes running those pods. Each node-based Agent then uses these configurations to query the Istio pods running on the local node for data.
If you horizontally scale an Istio component, there is a risk that requests to that component’s Kubernetes service will load balance randomly across the component’s pods. Endpoints checks enable the Datadog Agent to bypass Istio’s Kubernetes services and query the backing pods directly, avoiding the risk of load balancing queries.
The Datadog Agent uses Autodiscovery to track the services exposing Istio’s Prometheus endpoints. We can enable the Istio integration by annotating these services. The annotations contain Autodiscovery templates—when the Cluster Agent detects that a currently deployed service contains a relevant annotation, it will identify each backing pod, populate the template with the pod’s IP address, and send the resulting configuration to a node-based Agent. We’ll create one Autodiscovery template per Istio component—each Agent will only load configurations for Istio pods running on its own node.
Note that you’ll need to run versions 6.17+ or 7.17+ of the node-based Agent and version 1.5.2+ of the Datadog Cluster Agent.
Run the following script to annotate each Istio service using kubectl patch
. Since there are multiple ways to install Istio, this approach lets you annotate your services without touching their manifests.
#!/bin/bash
kubectl -n istio-system patch service istio-telemetry --patch "$(cat<<EOF
metadata:
annotations:
ad.datadoghq.com/endpoints.check_names: '["istio"]'
ad.datadoghq.com/endpoints.init_configs: '[{}]'
ad.datadoghq.com/endpoints.instances: |
[
{
"istio_mesh_endpoint": "http://%%host%%:42422/metrics",
"mixer_endpoint": "http://%%host%%:15014/metrics",
"send_histograms_buckets": true
}
]
EOF
)"
kubectl -n istio-system patch service istio-galley --patch "$(cat<<EOF
metadata:
annotations:
ad.datadoghq.com/endpoints.check_names: '["istio"]'
ad.datadoghq.com/endpoints.init_configs: '[{}]'
ad.datadoghq.com/endpoints.instances: |
[
{
"galley_endpoint": "http://%%host%%:15014/metrics",
"send_histograms_buckets": true
}
]
EOF
)"
kubectl -n istio-system patch service istio-pilot --patch "$(cat<<EOF
metadata:
annotations:
ad.datadoghq.com/endpoints.check_names: '["istio"]'
ad.datadoghq.com/endpoints.init_configs: '[{}]'
ad.datadoghq.com/endpoints.instances: |
[
{
"pilot_endpoint": "http://%%host%%:15014/metrics",
"send_histograms_buckets": true
}
]
EOF
)"
kubectl -n istio-system patch service istio-citadel --patch "$(cat<<EOF
metadata:
annotations:
ad.datadoghq.com/endpoints.check_names: '["istio"]'
ad.datadoghq.com/endpoints.init_configs: '[{}]'
ad.datadoghq.com/endpoints.instances: |
[
{
"citadel_endpoint": "http://%%host%%:15014/metrics",
"send_histograms_buckets": true
}
]
EOF
)"
When the Cluster Agent identifies a Kubernetes service that contains these annotations, it uses them to fill in configuration details for the Istio integration. The %%host%%
template variable becomes the IP of a pod backing the service. The Cluster Agent sends the configuration to a Datadog Agent running on the same node, and the Agent uses the configuration to query the pod’s metrics endpoint.
You can also provide a value for the option send_histograms_buckets
—if this option is enabled (the default), the Datadog Agent will tag any histogram-based metrics with the upper_bound
prefix, indicating the name of the metric’s quantile bucket.
Next, update the node-based Agent and Cluster Agent manifests to enable endpoints checks. The Datadog Cluster Agent sends endpoint check configurations to node-based Agents using cluster checks, and you will need to enable these as well. In the node-based Agent manifest, add the following environment variables:
datadog-agent.yaml
# [...]
spec:
template:
spec:
containers:
- image: datadog/agent:latest
# [...]
env:
# [...]
- name: DD_EXTRA_CONFIG_PROVIDERS
value: "endpointschecks clusterchecks"
If you set DD_EXTRA_CONFIG_PROVIDERS
to endpointschecks
, the node-based Agents will collect endpoint check configurations from the Cluster Agent. We also need to add the value clusterchecks
, which tells the node-based Agent to pull configurations from the Cluster Agent.
Now add the following environment variables to the Cluster Agent manifest:
datadog-cluster-agent.yaml
# [...]
spec:
template:
spec:
containers:
- image: datadog/cluster-agent:latest
# [...]
env:
# [...]
- name: DD_CLUSTER_CHECKS_ENABLED
value: "true"
- name: DD_EXTRA_CONFIG_PROVIDERS
value: "kube_endpoints kube_services"
- name: DD_EXTRA_LISTENERS
value: "kube_endpoints kube_services"
The DD_EXTRA_CONFIG_PROVIDERS
and DD_EXTRA_LISTENERS
variables tell the Cluster Agent to query the Kubernetes API server for the status of currently active endpoints and services.
Finally, apply the changes.
$ kubectl apply -f path/to/datadog-agent.yaml
$ kubectl apply -f path/to/datadog-cluster-agent.yaml
After running these commands, you should expect to see Istio metrics flowing into Datadog. The easiest way to confirm this is to navigate to our out-of-the-box dashboard for Istio, which we’ll explain in more detail later.
Finally, enable the Istio integration by clicking the tile in your Datadog account.
You can also use Autodiscovery to collect metrics, traces, and logs from the applications running in your mesh with minimal configuration. Consult Datadog’s documentation for the configuration details you’ll need to include.
Visualize all of your Istio metrics together
After installing the Datadog Agent and enabling the Istio integration, you’ll have access to an out-of-the-box dashboard showing key Istio metrics. You can see request throughput and latency from throughout your mesh, as well as resource utilization metrics for each of Istio’s internal components.
You can then clone the out-of-the-box Istio dashboard and customize it to produce the most helpful view for your environment. Datadog imports tags automatically from Docker, Kubernetes, and Istio, as well as from the mesh-level metrics that Mixer exports to Prometheus (e.g., source_app
and destination_service_name
). You can use tags to group and filter dashboard widgets to get visibility into Istio’s performance. For example, the following timeseries graph and toplist use the adapter
tag to show how many dispatches Mixer makes to each adapter.
You can also quickly understand the scope of an issue (does it affect a host, a pod, or your whole cluster?) by using Datadog’s mapping features: the host map and container map. Using the container map, you can easily localize issues within your Kubernetes cluster. And if issues are due to resource constraints within your Istio nodes, this will become apparent within the host map.
You can color the host map based on the current value of any metric (and the container map based on any resource metric), making it clear which parts of your infrastructure are underperforming or overloaded. You can then use tags to group and filter the maps, helping you answer any questions about your infrastructure.
The dashboard above shows CPU utilization in our Istio deployment. In the upper-left widget, we can see that this metric is high for two hosts. To investigate, we can use the container map on the bottom left to see if any container running within those hosts is facing unusual load. Istio’s components might run on any node in your cluster—the same goes for the pods running your services. To monitor our pods regardless of where they are running, we can group containers by the service
tag, making it clear which Istio components or mesh-level services are facing the heaviest demand. The kube_namespace
tag allows us to view components and services separately.
Get insights into mesh activity
Getting visibility into traffic between Istio-managed services is key to understanding the health and performance of your service mesh. With Datadog’s distributed tracing and application performance monitoring (APM), you can trace requests between your Istio-managed services to understand your mesh and troubleshoot issues. You can display your entire service topology using the Service Map, visualize the path of each request through your mesh using flame graphs, and get a detailed performance portrait of each service. From APM, you can easily navigate to related metrics and logs, allowing you to troubleshoot more quickly than you would with dedicated graphing, tracing, and log collection tools.
Set up tracing
Receiving traces
First, you’ll need to instruct the node-based Agents to accept traces. Edit the node-based Agent manifest to include the following attributes.
datadog-agent.yaml
[...]
env:
[...]
- name: DD_APM_ENABLED
value: "true"
- name: DD_APM_NON_LOCAL_TRAFFIC
value: "true"
- name: DD_APM_ENV
value: "<YOUR_ENV_NAME>"
[...]
DD_APM_ENABLED
instructs the Agent to collect traces. DD_APM_NON_LOCAL_TRAFFIC
configures the Agent to listen for traces from containers on other hosts. Finally, if you want to keep traces from your Istio cluster separate from other projects within your organization, use the DD_APM_ENV
variable to customize the env:
tag for your traces (env:none
by default). You can then filter by this tag within Datadog.
Next, forward port 8126 from the node-based Agent container to its host, allowing the host to listen for distributed traces.
datadog-agent.yaml
[...]
ports:
[...]
- containerPort: 8126
hostPort: 8126
name: traceport
protocol: TCP
[...]
This example configures Datadog to trace requests between Envoy proxies, so you can visualize communication between your services without having to instrument your application code. If you want to trace activity within an application, e.g., a function call, you can use Datadog’s tracing libraries to either auto-instrument your application or declare traces within your code for fine-grained benchmarking and troubleshooting.
Finally, create a service for the node-based Agent, so it can receive traces from elsewhere in the mesh. We’ll use a headless service to avoid needlessly allocating a cluster IP to the Agent. Create the following manifest and apply it using kubectl apply
:
dd-agent-service.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: datadog-agent
name: datadog-agent
namespace: <DATADOG_NAMESPACE>
spec:
clusterIP: None
ports:
- name: dogstatsdport
port: 8125
protocol: UDP
targetPort: 8125
- name: traceport
port: 8126
protocol: TCP
targetPort: 8126
selector:
app: datadog-agent
After you apply this configuration, the Datadog Agent should be able to receive traces from Envoy proxies throughout your cluster. In the next step, you’ll configure Istio to send traces to the Datadog Agent.
Sending traces
Istio has built-in support for distributed tracing using several possible backends, including Datadog. You need to configure tracing by setting three options:
pilot.traceSampling
is the percentage of requests that Istio will record as traces. Set this to100.00
to send all traces to Datadog—you can then determine within Datadog how long to retain your traces. 2.global.proxy.tracer
instructs Istio to use a particular tracing backend, in our casedatadog
.tracing.enabled
instructs Istio to record traces of requests within your service mesh.
Run the following command to enable Istio to send traces automatically to Datadog:
helm upgrade --install istio <ISTIO_INSTALLATION_PATH>/install/kubernetes/helm/istio --namespace istio-system --set pilot.traceSampling=100.0,global.proxy.tracer=datadog,tracing.enabled=true
Visualize mesh topology with the Service Map
Datadog automatically generates a Service Map from distributed traces, allowing you to quickly understand how services communicate within your mesh. The Service Map gives you a quick read into the results of your Istio configuration, so you can identify issues and determine where you might begin to optimize your network.
If you have set up alerts for any of your services (we’ll introduce these in a moment), the Service Map will show their status. In this example, an alert has triggered for the productpage
service in the default
namespace. We can navigate directly from the Service Map to see which alerts have triggered.
And if you click on “View service overview,” you can get more context into service-level issues by viewing request rates, error rates, and latencies for a single service over time. For example, we can navigate to the overview of the productpage
service to see when the service started reporting a high rate of errors, and correlate the beginning of the issue with metrics, traces, and logs from the same time.
Visualize mesh requests with flame graphs
Once you set up APM in your Istio mesh, you can inspect individual request traces using flame graphs. A flame graph is a visualization that displays the service calls that were executed to fulfill a request. The duration of each service call is represented by the width of the span, and in the sidebar, you can see the services called and the percent of time spent on each. You can click any span to see further information, such as metadata and error messages.
Note that in several spans, envoy.proxy
precedes the name of the resource (which is the specific endpoint to which the call is addressed, e.g., main-app.apm-demo.svc.cluster.local:80
). This is because Envoy proxies all requests within an Istio mesh. This architecture also explains why envoy.proxy
spans are generated in pairs: the first span is created by the sidecar proxying the outgoing request, and the matching second span is from the sidecar that receives it.
You can get even deeper visibility into requests within your mesh by configuring your applications to report spans from functions and packages of your choice. In the example above, we have instrumented our applications to report spans for individual functions called by the render-svc
service, using one of Datadog’s custom tracing libraries. You can also auto-instrument your applications to visualize function calls within popular libraries in a number of languages.
Along with other APM features like Trace Search and Analytics and the Service Map, flame graphs can help you troubleshoot and investigate errors in your Istio mesh. In the next screenshot, we see that the reviews.default
service has executed in 387 microseconds and returned the error code 500
.
With Datadog APM you can see exactly where an error originates, and use the tabs below the flame graph—Span Metadata, Host, Logs, and Error—to see related information that can help you better understand the span you’re inspecting.
For more information about monitoring your distributed services with APM, see our documentation.
Understand your Istio logs
If services within your mesh fail to communicate as expected, you’ll want to consult logs to get more context. As traffic flows throughout your Istio mesh, Datadog can help you cut through the complexity by collecting all of your Istio logs in one platform for visualization and analysis.
Set up Istio log collection
To enable log collection, edit the datadog-agent.yaml manifest you created earlier to provide a few more environment variables:
DD_LOGS_ENABLED
: switches on Datadog log collectionDD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
: tells each node-based Agent to collect logs from all containers running on that nodeDD_AC_EXCLUDE
: filters out logs from certain containers before they reach Datadog, such as, in our case, those from Datadog Agent containers
datadog-agent.yaml
[...]
env:
[...]
- name: DD_LOGS_ENABLED
value: "true"
- name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
value: "true"
- name: DD_AC_EXCLUDE
value: "name:datadog-agent name:datadog-cluster-agent"
[...]
Next, edit the file to mount the node-based Agent container to the local node’s Docker socket. Since you’ll be deploying the Datadog Agent pod as a DaemonSet, each Agent will read logs from the Docker socket on its local node, enrich them with tags imported from Docker, Kubernetes, and your cloud provider, and send them to Datadog. Istio’s components publish logs to stdout
and stderr
by default, meaning that the Datadog Agent can collect all of your Istio logs from the Docker socket.
datadog-agent.yaml
(...)
volumeMounts:
(...)
- name: dockersocket
mountPath: /var/run/docker.sock
(...)
volumes:
(...)
- hostPath:
path: /var/run/docker.sock
name: dockersocket
(...)
Note that if you plan to run more than 10 containers in each pod, you’ll want to configure the Agent to use a Kubernetes-managed log file instead of the Docker socket.
Once you run kubectl apply -f path/to/datadog-agent.yaml
, you should start seeing your logs within Datadog.
Discover trends with Log Patterns
Once you’re collecting logs from your Istio mesh, you can start exploring them in Datadog. The Log Patterns view helps you extract trends by displaying common strings within your logs and generalizing the fields that vary into regular expressions. The result is a summary of common log types. This is especially useful for reducing noise within your Istio-managed environment, where you might be gathering logs from all of Istio’s internal components in addition to Envoy proxies and the services in your mesh.
In this example, we used the sidebar to display only the patterns having to do with our Envoy proxies. We also filtered out INFO-level logs. Now that we know which error messages are especially common—Mixer is having trouble connecting to its upstream services—we can determine how urgent these errors are and how to go about resolving them.
Know what’s flowing through your mesh
Datadog Network Performance Monitoring (NPM) automatically visualizes the topology of your Istio-managed network, giving you instant insights into dependencies between services, pods, and containers. You can use NPM to locate possible root causes of network issues, get real-time architecture visualizations, and spot inefficient designs. You can then track network data in the context of traces, logs, and process data from your infrastructure and applications.
Network Performance Monitoring receives data from the system probe, an eBPF program managed by the Datadog Agent that monitors traffic passing through each host’s kernel network stack. Network Performance Monitoring automatically follows Istio’s network address translation logic, giving you complete visibility into your Istio traffic with no configuration.
How to install Network Performance Monitoring in your Istio cluster
To enable NPM, edit the manifest you use to deploy the node-based Datadog Agent to:
- Enable the Process Agent and system probe
- Share data between the node-based Agent and system probe
- Add the system probe as a sidecar to the node-based Agent
Enable the Process Agent and system probe
You can configure the Datadog Agent to enable NPM by adding environment variables to the node-based Agent manifest. To set these environment variables, modify the spec.template.spec.containers[*].env
object for the datadog-agent
container to include the following:
datadog-agent.yaml
- name: DD_PROCESS_AGENT_ENABLED
value: 'true'
- name: DD_SYSTEM_PROBE_ENABLED
value: 'true'
- name: DD_SYSTEM_PROBE_EXTERNAL
value: 'true'
- name: DD_SYSPROBE_SOCKET
value: /var/run/sysprobe/sysprobe.sock
NPM needs to collect per-process data, so you should enable the Process Agent with DD_PROCESS_AGENT_ENABLED
. Since you’ll be running the system probe as a sidecar to the node-based Agent, use the DD_SYSTEM_PROBE_EXTERNAL
environment variable to prevent the datadog-agent
container from starting the system probe itself. Finally, DD_SYSPROBE_SOCKET
indicates the path to the Unix socket that the Datadog Agent uses to communicate with the system probe (/opt/datadog-agent/run/sysprobe.sock
by default).
Share data between the node-based Agent and system probe
The system probe uses Kubernetes volumes to share data with the Datadog Agent. To configure these, you’ll need to modify the node-based Agent manifest to declare two volumes in the spec.volumes
object: debugfs
and sysprobe-socket-dir
. The first accesses the underlying host’s debugfs
, which enables the system probe to make kernel information available to user-space processes via the filesystem. The second volume creates an empty directory that the Datadog Agent uses to initialize the system probe socket.
datadog-agent.yaml
- name: debugfs
hostPath:
path: /sys/kernel/debug
- name: sysprobe-socket-dir
emptyDir: {}
Next, mount these volumes on the Datadog Agent container by adding them to the datadog-agent
container’s spec.template.spec.containers[*].volumeMounts
object:
datadog-agent.yaml
- name: debugfs
mountPath: /sys/kernel/debug
- name: sysprobe-socket-dir
mountPath: /var/run/sysprobe
You’ll also need to make sure that the procdir
and cgroups
volume mounts we assigned earlier are still included here, since Network Performance Monitoring needs to access certain system information from the underlying host.
Add the system probe as a sidecar
The system probe runs as a separate process from the Datadog Agent in order to initialize and manage the kernel-level eBPF program. You should declare the system probe as a sidecar container within the spec.template.spec.containers
section of the node-based Agent manifest:
datadog-agent.yaml
- name: system-probe
image: 'datadog/agent:latest'
imagePullPolicy: Always
securityContext:
capabilities:
add:
- SYS_ADMIN
- SYS_RESOURCE
- SYS_PTRACE
- NET_ADMIN
- IPC_LOCK
command:
- /opt/datadog-agent/embedded/bin/system-probe
env:
- name: DD_SYSPROBE_SOCKET
value: /var/run/sysprobe/sysprobe.sock
resources:
requests:
memory: 150Mi
cpu: 200m
limits:
memory: 150Mi
cpu: 200m
volumeMounts:
- name: procdir
mountPath: /host/proc
readOnly: true
- name: cgroups
mountPath: /host/sys/fs/cgroup
readOnly: true
- name: debugfs
mountPath: /sys/kernel/debug
- name: sysprobe-socket-dir
mountPath: /var/run/sysprobe
- name: sysprobe-config
mountPath: /etc/datadog-agent/
This configuration includes a security context that spells out the Linux capabilities the container needs to get visibility into the underlying host. You’ll also notice that the system probe shares the procdir
, cgroups
, debugfs
, and sysprobe-socket-dir
volumes with the node-based Datadog Agent. The system probe retrieves its runtime configuration from the sysprobe-config
volume, which you should create using the following ConfigMap (saved as system-probe-config.yaml):
apiVersion: v1
kind: ConfigMap
metadata:
name: datadog-system-probe-config
namespace: <DATADOG_NAMESPACE>
labels: {}
data:
system-probe.yaml: |
system_probe_config:
enabled: true
Expose this ConfigMap as a volume in your Datadog Agent DaemonSet by adding this to the spec.volumes
list in the node-based Agent manifest:
datadog-agent.yaml
- name: sysprobe-config
configMap:
name: datadog-system-probe-config
Once you’ve updated the node-based Agent manifest, go ahead and apply it along with the new ConfigMap (create the ConfigMap first since the node-based Agent needs to refer to it):
kubectl apply -f system-probe-config.yaml
kubectl apply -f datadog-agent.yaml
See your network traffic in context
Since Network Performance Monitoring tracks all of the traffic through the containers in your mesh, it gives you a good starting point for understanding your mesh topology and investigating misconfigurations and failed dependencies.
You can use the Network Map to get an instant view of your network architecture without having to deploy software beyond the Datadog Agent. This makes it easier to identify the scope of networking issues in your mesh, contain the blast radius, and prevent cascading failures. The color of each node in the Network Map indicates the status of any alerts associated with the service, pod, container, or other tag the node visualizes. You can inspect the upstream and downstream dependencies of a node—e.g., a service receiving an unusually low volume of network traffic—and check whether any of them are in an alerting state. You can then view these alerts to get context, such as related traces, helpful dashboards, and troubleshooting instructions.
If the Network Map is showing you an unexpected dependency, unwanted cross-regional traffic, an abnormal drop in throughput, or some other unintended characteristic of your network topology, you can navigate to the Network Page and adjust the filters to display only flows you want to investigate.
You can then export graphs from the Network Page to a dashboard, copy timeseries graphs from Istio’s out-of-the-box dashboard using the Datadog Clipboard, and create a new dashboard that visualizes network-level traffic metrics in context with application-level metrics from Istio’s components. In this case, we can see that a brief decline in bytes received over the network correlates with a wave of xDS pushes. With this knowledge in hand, we can better plan our configuration changes so we don’t sacrifice service availability.
Set alerts for automatic monitoring
When running a complex distributed system, it’s impossible to watch every host, pod, and container for possible issues. You’ll want some way to automatically get notified when something goes wrong in your Istio mesh. Datadog allows you to set alerts on any kind of data it collects, including metrics, logs, and request traces.
In this example, we’re creating an alert that will notify us whenever requests to the productpage
service in Istio’s “Bookinfo” sample application take place at an unusual frequency, using APM data and Datadog’s anomaly detection algorithm.
You can also get automated insights into aberrant trends with Datadog’s Watchdog feature, which automatically flags performance anomalies in your dynamic service mesh. With Watchdog, you can easily detect issues like heavy request traffic, service outages, or spikes in demand, without setting up any alerts. Watchdog searches your APM-based metrics (request rates, request latencies, and error rates) for possible issues, and presents these to you as a feed when you first log in.
A view of your mesh at every scale
In this post, we’ve shown you how to use Datadog to get comprehensive visibility into metrics, traces, and logs from throughout your Istio mesh. Integrated views allow you to navigate easily between data sources, troubleshoot issues, and manage the complexity that comes with running a service mesh. If you’re not already using Datadog, you can sign up for a free trial.