Key metrics for monitoring Karpenter

David Lentz

In Part 1 of this series, we explored how Karpenter’s architecture enables just-in-time provisioning and active node consolidation. Because Karpenter is constantly making infrastructure decisions based on real-time scheduling pressure, its metrics can give you early warning of provisioning slowdowns, cloud API throttling, and misconfigurations that prevent it from scaling the way you expect. In this post, we’ll show you key metrics you can monitor to understand Karpenter’s behavior and performance. As you collect Karpenter metrics, note that each one is marked as STABLE, BETA, ALPHA, or DEPRECATED. BETA and ALPHA metrics are useful, but they’re more likely to change across versions, so you should treat them as a signal to double-check your dashboards after upgrades.

Track Karpenter metrics to monitor performance

Karpenter exposes Prometheus-formatted metrics via an HTTP endpoint at /metrics on the Karpenter controller. The default metrics port is 8080. This can be overridden at install time via the METRICS_PORT environment variable.

You can collect Karpenter metrics by using either of two approaches. If you use the Prometheus Operator with a ServiceMonitor, you can determine the metrics endpoint port by examining the Karpenter service, such as with this command:

kubectl -n karpenter get svc karpenter -o wide

If you use a standard Prometheus scrape config to collect metrics, you can determine the port by inspecting the controller’s METRICS_PORT setting.

In this section, we’ll describe the key metrics you should monitor to track Karpenter’s health and performance. Specifically, we’ll look at metrics from these categories:

Scheduling and pod life cycle metrics
Disruption and consolidation metrics
Cloud provider metrics
Controller internals and cluster state metrics
Cost optimization and interruption metrics

Scheduling and pod life cycle metrics

If Karpenter is working well, you should see a predictable pattern during scale-out: The Kubernetes scheduler marks pods unschedulable, Karpenter reacts by creating capacity, nodes join the cluster, and pods transition to running. The metrics in this section help you measure that end-to-end experience and then pinpoint the source of any latency that arises.

Metric to alert on: karpenter_pods_startup_duration_seconds

This metric illustrates the total time it takes Karpenter to provision capacity. It spans the whole path from a pod being created to that pod reaching a running state. If your primary goal is to alert when users are likely to experience the effects of latency in the scaling process, this is an important metric. It’s directly aligned with what workloads experience rather than what any single component is doing. If you observe an increase in this metric, you should look for the cause in upstream processes such as Karpenter’s scheduling simulation, cloud API latency or errors, or node life cycle delays.

Metric to watch: karpenter_scheduler_scheduling_duration_seconds

Karpenter’s scheduler simulation time helps you understand whether a delay is occurring before Karpenter reaches out to the cloud provider. You may see this metric increase when Karpenter has to evaluate more possibilities to satisfy constraints—the rules that limit which nodes a pod can run on and what capacity Karpenter is allowed to provision. Those constraints can come from the workload (for example, node selectors, taints and tolerations, or large resource requests), from your NodePool requirements (such as restricting instance families, zones, or capacity type), or from placement rules that narrow options (like pod anti-affinity or topology spread). The tighter the constraints, the smaller the set of valid options and the more work Karpenter may need to do before it finds a viable match.

Metric to alert on: karpenter_scheduler_queue_depth

Queue depth tracks the number of pods currently waiting to be scheduled by Karpenter. A queue that rises briefly during bursts and then drains is normal. But if the queue grows and stays elevated, Karpenter isn’t keeping up. That can happen because Karpenter is taking longer than usual to evaluate feasible capacity (often due to more complex scheduling requirements), because it’s retrying failed requests, or because it’s blocked downstream—for example, when the cloud provider can’t supply capacity that matches your requirements.

This metric provides an early warning of issues that cause Karpenter to fall behind, and it often signals the problem before the worst impact is visible. You should investigate the cause of the slowdown by looking for correlated latency or errors in Karpenter’s scheduling, and for cloud provider errors that indicate that the requested capacity is unavailable.

To pinpoint the cause, correlate it with karpenter_scheduler_scheduling_duration_seconds to see if scheduling is slow. You can also look for correlations with the karpenter_cloudprovider_duration_seconds and karpenter_cloudprovider_errors_total metrics to see if the calls to the cloud provider API are slow or failing.

Disruption and consolidation metrics

Karpenter’s disruption features—consolidation, drift remediation, and other voluntary node replacement—are where cost and efficiency gains often come from. The tricky part is that disruption depends on factors outside Karpenter’s control (evictions, budgets, and workload behavior). The metrics in this section reveal how actively Karpenter is optimizing the cluster by removing underutilized or drifted nodes.

Metric to watch: karpenter_voluntary_disruption_eligible_nodes

This metric indicates whether Karpenter is finding opportunities to save money. A consistently large number of eligible nodes means Karpenter is identifying them but is unable to disrupt them. This can happen if pods on the node are protected by a PodDisruptionBudgets (PDBs) or if Karpenter can’t create suitable replacement capacity.

A low count in a well-packed cluster is normal. But few disruption-eligible nodes in a cluster with visibly underutilized nodes (high karpenter_nodes_allocatable but low karpenter_nodes_total_pod_requests) may indicate that consolidation is disabled or blocked.

Metric to alert on: karpenter_nodeclaims_termination_duration_seconds

This metric measures the time from a deletion request to the final removal of the NodeClaim. Ideally, nodes get deleted quickly to head off unnecessary cloud costs. But if the process of draining workloads from the node is stalling—which can happen if a PodDisruptionBudget blocks eviction—termination duration (and cloud costs) can remain elevated. If termination duration tail latency rises, look in your Karpenter logs for contributors such as workloads that can’t be evicted, stuck finalizers, or prolonged draining behavior.

Metrics to watch: karpenter_nodeclaims_created_total, karpenter_nodeclaims_terminated_total

Use these counters to confirm that Karpenter is both adding capacity when needed and removing underutilized nodes when it can. Start by monitoring NodeClaims created, which increments whenever Karpenter creates a NodeClaim in response to scheduling demand. Pair it with the terminated metric to track the raw volume of node churn over time. If Karpenter is not terminating instances even when karpenter_voluntary_disruption_eligible_nodes is above zero, the termination controller is blocked. You can investigate this issue by looking at Karpenter logs and the karpenter_nodeclaims_termination_duration_seconds metric.

Metric to alert on: karpenter_nodeclaims_disrupted_total (ALPHA)

To understand why nodes are turning over—for example, whether due to consolidation, drift remediation, or expiration—you can look at this metric’s reason label.

A rise in karpenter_nodeclaims_disrupted_total{reason="registration_timeout"} indicates that nodes were increasingly unable to join the cluster. You’ll see a corresponding increase in the karpenter_nodeclaims_created_total metric but no increase in usable nodes.

You should alert on a rise in these timeouts. When this alert fires, you know the issue isn’t with Karpenter’s scheduling decisions, and you should instead investigate node launch failure modes like IAM or UserData misconfigurations, networking blockages, or bad AMIs. Because this metric is marked as ALPHA, you should verify after each Karpenter upgrade that the metric name and reason label values are unchanged before relying on this alert.

Cloud provider metrics

You may experience Karpenter issues that are actually the effect of cloud provider limits that arise during rapid provisioning. These include capacity shortages, quota limits, throttling, and API latency. The metrics in this section can surface the root cause of Karpenter performance lags by highlighting the specific error rates and latencies within your cloud provider’s provisioning APIs.

Metrics to watch: karpenter_cloudprovider_errors_total, karpenter_cloudprovider_duration_seconds

These metrics surface problems that often look like Karpenter issues—such as pending pods, slow scale-out, or stalled replacements—but are actually driven by API failures or latency in the cloud provider layer.

The karpenter_cloudprovider_errors_total metric counts Karpenter requests to the cloud provider API that result in an error. You can filter on the metric’s error label to understand why requests are failing. For example, to track throttling specifically, look at karpenter_cloudprovider_errors_total{error="RequestLimitExceeded"}. To find requests that failed because the desired instance type was unavailable, look at error="InsufficientInstanceCapacity".

The karpenter_cloudprovider_duration_seconds metric measures the latency of Karpenter’s requests to the cloud provider API. An increase here indicates that scale-out will slow down even when requests are succeeding. The slowdown multiplies as Karpenter has more work to do and the rate of API calls goes up.

You can correlate these two metrics with the end-to-end scale-out metric, karpenter_pods_startup_duration_seconds. If startup is slowing and cloud provider errors are rising, Karpenter is being blocked by cloud API failures. When both startup duration and cloud provider duration increase, the issue is probably with the cloud provider, not with Karpenter’s scheduling performance.

Controller internals and cluster state metrics

These metrics provide critical information about Karpenter’s health to help you see whether Karpenter is keeping up with the rate of change in your cluster. Specifically, the metrics here can help you understand whether the Kubernetes API latency plays a role in any observed Karpenter latency.

Metric to alert on: controller_runtime_reconcile_time_seconds

This metric measures the duration of Karpenter’s reconciliation loop, which includes detecting pending pods, running scheduling simulations, and creating NodeClaims. An increase in the time Karpenter takes to process each unit of work can indicate that the controller is overwhelmed or waiting on slow Kubernetes API calls.

This metric measures the duration of Karpenter’s reconciliation loops across all controllers, including pod scheduling, NodeClaim life cycle management, and disruption. Use the controller label to isolate which reconciler is slow. A sustained increase in the provisioner controller indicates scheduling pressure. In the disruption controller, it may reflect PDB contention or blocked consolidation.

Metrics to watch: controller_runtime_reconcile_total, controller_runtime_reconcile_errors_total

The counts of reconciliations and errors give you a quick understanding of Karpenter’s throughput and health. Watch these alongside reconcile time when you’re troubleshooting slow provisioning. If reconcile time and errors rise together for the controllers that reconcile pods, NodeClaims, and NodePools, Karpenter may be stuck in a retry loop. But if reconcile time rises while these remain flat, look for cloud provider latency or other downstream bottlenecks.

Metric to watch: workqueue_depth

If the depth of the controller’s work queue is consistently high, it signals that the controller can’t keep up with the pace of changes in the cluster. Karpenter can fall behind if its reconciliation loop is slowing down (see controller_runtime_reconcile_time_seconds) or if the controller is blocked and retrying. As a result, Karpenter is slow to replace nodes and scale out your application, even when cloud-provider capacity is available.

Cost optimization and interruption metrics

Karpenter optimizes your Kubernetes operations in two ways: It cuts costs through instance consolidation and shields your workloads from node churn. The metrics in this section can help you validate that Karpenter is making cost-aware choices and ensure that its interruption frequency doesn’t contribute to instability.

Metric to watch: karpenter_cloudprovider_instance_type_offering_price_estimate

This metric provides the estimated price for instance types that Karpenter considers. Compare this metric to cloud provider prices to verify that you’re launching cost-efficient instances, especially after you change NodePool requirements.

Note that it has high cardinality, though. Labels can include instance type, region, zone, and capacity type, so enabling it broadly can increase your monitoring costs. You can manage this by filtering aggressively and limiting its use to ad hoc inquiries rather than ongoing dashboard use.

Metric to watch: karpenter_interruption_received_messages_total

This metric counts the number of Spot interruption or maintenance events received from the cloud provider. If you see this rise, it means that your cluster is churning more frequently than usual. This increased rate of node replacements can create brief capacity gaps—where nodes are terminated faster than replacement nodes register—so more pods may be stuck pending. That volatility can impact your application’s performance.

A corresponding rise in karpenter_pods_startup_duration_seconds confirms that these interruptions aren’t just transparent cost optimization, but are contributing to measurable workload impact.

Gain visibility into your just-in-time provisioning

Monitoring Karpenter requires more than just tracking CPU usage; it requires visibility into the decisions your autoscaler is making. By correlating scheduling latency with cloud provider errors and consolidation logs, you can ensure your cluster remains both performant and cost efficient. In Part 3 of this series, we’ll look at vendor-agnostic tooling for monitoring Karpenter. In Part 4, we’ll explore how to visualize these costs by using Datadog Cloud Cost Management to see exactly how your NodePool policies impact your bottom line.

Get Started with Datadog