The Monitor

Accelerate Kubernetes issue resolution with AI-powered guided remediation

3 min read

Share article

Accelerate Kubernetes issue resolution with AI-powered guided remediation
Jessica Hsiao

Jessica Hsiao

As organizations scale their Kubernetes environments, managing dynamic workloads and interdependent services becomes increasingly complex. Detecting and resolving incidents promptly requires deep visibility and context, which can be challenging to obtain amid a flood of telemetry data and alerts. Furthermore, application developer teams frequently lack the relevant Kubernetes expertise to investigate and mitigate errors on their own, resulting in platform teams inadvertently becoming a support bottleneck for issue resolution.

Datadog Kubernetes Active Remediation, currently available in Preview, addresses these problems by helping you identify and fix common infrastructure issues in your Kubernetes clusters with clear contextual guidance and suggested actions before issues escalate into business-impacting incidents. Now, we're excited to announce the latest enhancement to Kubernetes Active Remediation: AI-powered explanations that provide deeper insights into the root cause of issues. This capability facilitates faster investigations and reduced mean time to resolution (MTTR).

In this post, we’ll explain how the new AI-powered reasoning in Kubernetes Active Remediation can help you:

Troubleshoot issues more efficiently

When Kubernetes errors such as CrashLoopBackOff and OOMKilled occur, Kubernetes Active Remediation detects the issue and generates a remediation guide for each problematic workload. Aggregating all the related troubleshooting information in one place reduces the time spent on gathering context, empowering teams to respond to issues more quickly and get ahead of potential service disruptions.

Context-rich AI feedback makes it even easier to understand why an issue occurred and how to fix it. Suppose that one of your workloads is experiencing a CrashLoopBackOff error, causing your Datadog monitor to trigger an alert. You can view the remediation guide by clicking the link in the alert, by using the Kubernetes Remediation tab, or by selecting an impacted pod in the Kubernetes Explorer. The issue summary now includes AI-powered explanations for likely root causes, based on collected telemetry data and known patterns.

In the example shown in the following screenshot, Kubernetes Active Remediation detects an application error that has caused a container to terminate. The analysis from Bits AI is based on errors in the application logs and explains that the problem was likely caused by database connectivity issues.

Information about an application error. The screen shows what happened, analysis from Bits AI, and recommended next steps.
Information about an application error. The screen shows what happened, analysis from Bits AI, and recommended next steps.

Increase self-sufficiency of application developers

Not all application developers have the necessary Kubernetes knowledge to troubleshoot issues that affect their individual services. The AI explanations from Kubernetes Active Remediation can serve as a useful learning tool for developers who might be unfamiliar with Kubernetes.

Platform teams that enable downstream application teams can use Kubernetes Active Remediation as an educational resource in their onboarding and triage processes. This use of AI explanations helps application developers become more self-sufficient in troubleshooting issues and frees up the platform teams to focus on higher-priority work.

Implement faster, smarter issue detection and response today

Kubernetes Active Remediation empowers your organization to respond to incidents and take corrective action more quickly and confidently. AI-powered reasoning enables teams of all experience levels to easily understand underlying problems, accelerating root cause analysis and fixes. As your teams become more efficient at addressing complex Kubernetes issues, reduced MTTR leads to fewer escalations and better uptime for your customers.

Sign up to join the Preview for Kubernetes Active Remediation and receive feature updates. Not a Datadog user yet? Get started today with a .

Related Articles

Automatically identify issues and generate fixes with the Bits AI Dev Agent

Automatically identify issues and generate fixes with the Bits AI Dev Agent

Java on containers: a guide to efficient deployment

Java on containers: a guide to efficient deployment

How to support a growing Kubernetes cluster with a small etcd

How to support a growing Kubernetes cluster with a small etcd

Key metrics for monitoring etcd

Key metrics for monitoring etcd

Start monitoring your metrics in minutes