How Delivery Hero Uses Kubecost and Datadog to Manage Kubernetes Costs in the Cloud | Datadog

How Delivery Hero uses Kubecost and Datadog to manage Kubernetes costs in the cloud

Author Guto Costa
Staff Site Reliability Engineer, Delivery Hero
Author Smit Thakkar
Site Reliability Engineer, Delivery Hero

Published: March 1, 2023

This is a guest post by Guto Costa, Staff Site Reliability Engineer at Delivery Hero.

As the world’s leading local delivery platform, Delivery Hero brings groceries and household goods to customers in more than 70 countries. Their technology stack comprises over 200 services across 20 Kubernetes clusters running on Amazon EKS. This cloud-based, containerized infrastructure enabled them to scale their operation to support increasing demand as the volume of orders placed on their platform doubled during the pandemic.

But operating shared Kubernetes clusters made it difficult for Delivery Hero to fully understand their cloud costs, since each Amazon EC2 instance in a cluster might host pods from more than one service. As a result, the organization could see how much they were spending on EC2, but individual teams could not see how much of that spend was attributable to their service.

To equip engineering teams to meet their goals for reducing the cloud cost of each order placed on the platform, Delivery Hero needed to provide them with visibility into their services’ cloud costs and resource usage. Delivery Hero already relies on Datadog for observability, and they didn’t want to introduce an additional tool for cloud cost visibility.

In this post, we’ll detail how Delivery Hero visualizes their cloud spend using custom Datadog dashboards that combine Kubernetes usage metrics with detailed cost data. Later, we’ll look at how they use memory and CPU request recommendations from Vertical Pod Autoscaler to evaluate and revise their clusters’ resource allocations. But first we’ll show you how Delivery Hero tracks the cost of each component in their Kubernetes environment through Kubecost.

A Datadog dashboard shows Delivery Hero's Kubernetes cost data, including memory cost, CPU cost, and cost per squad.

Collecting cost data with Kubecost

Kubecost is a service that helps you monitor and visualize the costs of your cloud-based Kubernetes infrastructure. It combines Kubernetes resource usage data—the amount of CPU and memory used by each pod—with AWS, Azure, or GCP pricing data to determine the hourly cost per pod. Kubecost can also monitor the cost of operating Kubernetes services, namespaces, and deployments by aggregating pod-level costs across native Kubernetes constructs.

When monitoring EKS costs, by default Kubecost uses AWS’s public API to get pricing information. However, Delivery Hero instead leverages Kubecost’s integration with the AWS Cost and Usage Report (CUR), which enables them to see not only their usage of AWS services, but also their customized unit cost for each of those services. By using cost data from their own CUR rather than the public API, Delivery Hero is able to account for custom pricing such as EC2 Spot Instances, Reserved Instances, and Savings Plans.

To ensure a consistent level of cost-observability across all of their existing and future clusters, Delivery Hero uses Terraform and Helm to create a repeatable process for deploying Kubecost and configuring it to send metrics to Datadog. Before installing Kubecost in each cluster, they use a Terraform module to create the necessary AWS dependencies. These include the CUR and an associated Amazon S3 bucket, AWS Glue and Amazon Athena artifacts to provide SQL access to the CUR data, an EC2 Spot Instance data feed, and IAM roles and policies to provision access to these components.

Once the necessary AWS dependencies have been created, Delivery Hero leverages a Helm chart to install and configure Kubecost. They use Datadog’s OpenMetrics check to collect Kubecost metrics, which are stored in Prometheus format. To configure that check, they provide a custom podAnnotations item in their Helm values that creates a Prometheus endpoint—/metrics—on each pod. The annotations also specify which metrics the Agent will collect and designate a namespace for them (kubecost), as shown in the code snippet below .

[...]
podAnnotations:
  ad.datadoghq.com/<CONTAINER_ID>.checks: |
  {
    "openmetrics": {
      "init_config": {},
      "instances": [
        {
          "openmetrics_endpoint": "http://%%host%%:%%port%%/metrics ",
          "namespace": "kubecost",                    
          "metrics": [
            {
              "node_cpu_hourly_cost":"node_cpu_hourly_cost",
              "node_ram_hourly_cost":"node_ram_hourly_cost",
            }
          ]
        }
      ]
    }
  } 

If you’re interested in learning more about using annotations to configure data collection from a Prometheus endpoint, see our OpenMetrics documentation.

By configuring the Agent to collect Kubecost metrics, Delivery Hero is able to visualize cost data on custom dashboards in Datadog. This allows engineering teams to view it alongside Kubernetes metrics and infrastructure metrics within the platform they already use for monitoring and alerting. While this enabled teams to see the CPU and memory costs they incurred operating their pods, services, and deployments, it didn’t help them analyze how those resources were being used, or provide guidance on how to optimize resource allocation to reduce cloud costs. To visualize wasted expenditures and guide teams toward allocating Kubernetes resources more efficiently, Delivery Hero expanded their dashboards to include data from Kubernetes’ Vertical Pod Autoscaler.

Gaining insight through Vertical Pod Autoscaler

Vertical Pod Autoscaler (VPA) is a Kubernetes object that calculates the optimal memory and CPU requests for containers based on their resource usage over time. The Kubernetes scheduler uses the CPU and memory requests in a pod’s specification to determine which node should run the pod. Request values that are too high can prevent the scheduler from placing pods efficiently by forcing it to select nodes with more capacity than is strictly necessary to run the workload. This can cause the autoscaler to add nodes to the cluster—each with more resources than necessary—which increases the overall cost to operate the service.

By default, VPA automatically evicts pods from the cluster if it determines that their memory or CPU is under- or overprovisioned and replaces them with new ones of optimal size. But Delivery Hero uses VPA in recommendation mode, so that it only recommends optimal resource requests without making any changes to the cluster.

Visualizing usage, cost, and waste on custom dashboards

With dashboards that display Kubernetes resource metrics, Kubecost data, and VPA recommendations in one place, Delivery Hero’s engineering teams can easily see whether they’re overprovisioning resources. Visualizing the gap between resource requests and usage helps teams identify waste, and surfacing VPA recommendations gives them specific guidance on how to minimize that waste.

The screenshot below shows an excerpt of a dashboard that visualizes the compute cost efficiency of a Kubernetes service. The line graph shows metrics collected by the Datadog Kubernetes integration that illustrate the CPU usage, requests, and limits of the pod with the highest CPU utilization. It also visualizes the kubernetes_state.vpa.upperbound metric, which is collected by the Kubernetes State Metrics Core integration and tracks the VPA’s recommended maximum value for CPU requests across the cluster.

A dashboard shows CPU resource usage data.

If a pod’s CPU request is greater than its CPU utilization, this graph makes it clear that the pod is wasting compute resources. The query value widget named Pod’s CPU Waste shows the percentage of requested CPU cores that go unutilized. In this example, because the CPU usage metric is greater than the CPU request metric throughout the time frame visualized, the widget indicates that there is no CPU waste.

Similarly, the dashboard shown below makes it easy to see the relationship between the service’s memory requests, usage, and recommendations. It’s clear that in this example the service’s memory usage is consistently below the amount requested, indicating that the team can reduce the requests values in their pod specs to save money without affecting the service’s performance. The graph also surfaces VPA’s recommended memory allocation, which can guide the team in determining how far they can safely reduce the requests.

A dashboard shows memory resource usage data.

The dashboards also display the dollar amount of wasted cloud spend that results from overprovisioned resources. This is the difference between the amount of resources requested and the amount used across all of the service’s pods, multiplied by that resource’s hourly cost (as reported by Kubecost). For example, if a service’s memory request is greater than its memory usage, the dashboard query multiplies that difference by Kubecost’s hourly memory cost metric (kubecost.node_ram_hourly_cost) to surface the amount of money wasted by that overprovisioning. This appears on the dashboard in the Overprovisioned Memory query value widget, as shown below.

A dashboard shows memory cost data.

By making it easy to see overprovisioned Kubernetes resources and expressing that waste as a dollar amount, Delivery Hero aims to keep teams informed of their services’ efficiency and motivate them to optimize cloud spend by sizing their containers appropriately.

Greater visibility drives cost optimization

Custom dashboards that combine Kubecost, VPA, and Kubernetes metrics are one component in Delivery Hero’s strategy to improve the cost efficiency of their Kubernetes services. Teams consider the VPA’s resource recommendations—along with the historical performance of the service and the cost per order—when they decide whether and how to revise the resource requests of their services.

In some cases, the dashboards have shown that a service spanning multiple regions uses different amounts of resources in different geographies due to varying usage patterns. Teams have found that they can gain even more cost efficiency from their services by analyzing usage and waste in each region and adjusting resource allocations accordingly.

Increasing teams’ cost visibility has contributed to a 10 percent decrease in Delivery Hero’s cloud costs over 48 days, which is an initial step in a larger FinOps goal of reducing overall cloud costs by 30 percent. To keep the focus on efficiency rather than the overall cloud expenditure, teams have set goals for reducing the cloud cost incurred each time their service processes an order.