Azure Kubernetes Service (AKS) enables you to easily deploy and manage containerized applications in Azure while leveraging Microsoft resources such as development tools, security features, and more. As with any Kubernetes service, the sheer volume of containers being orchestrated makes monitoring AKS cluster health challenging, which can slow response times to critical incidents and create bottlenecks around long-term optimizations.
Datadog’s AKS integration already provides complete visibility into your AKS clusters—once you’ve enabled the integration and deployed the Datadog Agent to your clusters, Datadog automatically begins collecting metrics and logs from your entire AKS setup and organizing them into high-level visualizations. However, the fact that many teams use third-party services such as Helm and Ansible to install the Datadog Agent on their clusters can add complexity to workflows and increase overhead. With the Datadog cluster extension for AKS, you can now easily deploy the Datadog Agent to your Kubernetes clusters directly within Azure—no other tools needed.
In this post, we’ll explore how you can:
- Quickly deploy the Datadog Agent across your AKS clusters
- Visualize AKS cluster and control plane activity
AKS cluster extensions make it easy to deploy services to your AKS clusters at scale and manage them from Azure Resource Manager. Like other cluster extensions, the Datadog AKS extension provides two methods of installation, enabling you to choose the deployment method that works best for your workflows. One way is to search for the Datadog AKS extension within Azure Marketplace. The extension setup page then enables you to configure details such as the relevant resource group, region, and cluster name.
Alternatively, you can also access the extension setup page directly from your Azure Kubernetes clusters by selecting the service you want to monitor, then choosing the Extensions and applications option from the sidebar.
Finally, before you can begin collecting AKS metrics in Datadog, you’ll want to enable the Azure integration. You can do so either from the Azure Portal via our Azure Native integration, or within Datadog by accessing the Azure integration tile.
Once you’ve deployed the Datadog Agent to your clusters, metrics and logs from your AKS setup immediately begin streaming into Datadog. By using Datadog’s monitors and the OOTB AKS dashboard, you can quickly detect issues in your nodes before they bring processing to a halt.
The information Datadog ingests includes logs from the AKS control plane, which manages cluster resources. These logs contain critical information about the status of various orchestration components, including your API server, scheduler, and controller manager. With the AKS dashboard, you can view your control plane logs alongside performance metrics from the rest of your clusters, enabling you to quickly trace the root cause of issues no matter where they occur in your Kubernetes setup.
Let’s say that you receive an alert of a spike in CPU utilization across several nodes. While inspecting the dashboard, you notice an increase in clusters reporting an
unhealthy status and, by scrolling down and viewing the logs for your control plane components, you also see an increase in error messages for the controller manager. The issues with your controller manager have led to fewer pods being created, causing the CPU on your existing nodes to overload. From here, you can take steps to debug the problematic node and prevent future issues, such as switching to a high-availability cluster with multiple control plane nodes.
The Datadog AKS integration helps you catch issues across all your Azure clusters, but using third-party tools to install the Datadog Agent on your services can lead to increased overhead and tool sprawl. With the Datadog AKS cluster extension, you can easily deploy the Datadog Agent to your clusters directly within Azure. Once the extension is enabled and the Agent is installed, metrics and logs from your AKS setup start streaming to the AKS dashboard in Datadog, so you can immediately begin analyzing troubleshooting data.