Key metrics for CoreDNS monitoring

David Lentz

CoreDNS is an open source DNS server that can resolve requests for internet domain names and provide service discovery within a Kubernetes cluster. CoreDNS is the default DNS provider in Kubernetes as of v1.13. Though it can be used independently of Kubernetes, this series will focus on its role in providing Kubernetes service discovery, which simplifies cluster networking by enabling clients to access services using DNS names rather than IP addresses. It's important to monitor CoreDNS to ensure that elevated latency or error rates are not disrupting communication among your services and causing bottlenecks in your application. In this post, we'll walk through the following key categories of CoreDNS metrics you should monitor:

Throughput metrics
Performance metrics
Scaling and resource metrics
Go metrics
Cache metrics

Finally, we'll show you how CoreDNS logs can help you gain further visibility in the performance of your cluster's DNS. But first, we'll explain how CoreDNS works and how it enables service discovery in Kubernetes.

How CoreDNS works

In this section, we'll show you how CoreDNS provides service discovery inside Kubernetes clusters. Then we'll give you an overview of CoreDNS plugins, server configuration, and caching.

How CoreDNS processes requests

DNS servers function as a hierarchy, in which each server responds to requests either by using information it already has or by requesting that information from a server above it in the hierarchy. If the server has the requested information, it acts as a resolver and provides what is known as an authoritative response to the client. Otherwise, it acts as a forwarder by querying an upstream server for that information and returning it as a non-authoritative response. Like many DNS servers, CoreDNS can act as both a resolver and a forwarder.

CoreDNS enables service discovery in the cluster, acting as a resolver and providing authoritative data to pods trying to find services.

To resolve requests for resources outside a Kubernetes cluster—for example, if a pod needs to send a request to an API endpoint on the internet—you can configure CoreDNS to forward those requests to a public resolver, such as 8.8.8.8. If the public resolver doesn't have the requested DNS record, it queries the DNS root zone to find a top-level domain (TLD) server that knows where to find an authoritative name server and forwards the query to that server. CoreDNS can then cache the result to respond more quickly to future requests for the same record. Forwarding and caching are implemented via CoreDNS plugins, which we'll cover in the next section.

The diagram below illustrates how CoreDNS processes two DNS requests from an application pod. In one request, the pod uses the domain name lookup.ecommerce.svc.cluster.local to resolve the address of a service within the cluster. CoreDNS is able to provide the IP address corresponding to that service, so its response to the request is authoritative. In another request, the pod tries to resolve a domain name on the internet, www.shopist.io. CoreDNS does not hold an authoritative DNS record for that domain, so it forwards the request to the upstream server 8.8.8.8, which then forwards the request to the authoritative name server, nameserver.shopist.io. The response is sent back downstream, and CoreDNS provides a non-authoritative response to the client and caches the result.

A diagram shows that CoreDNS gives an authoritative response to a request for a Kubernetes service and a non-authoritative response for a request to an internet site—www.shopist.io.

When a service is created, modified, or deleted, CoreDNS detects that change via the Kubernetes API server. This allows CoreDNS to maintain authoritative and up-to-date DNS data for all services in the cluster.

Pods act as DNS clients, and the DNS servers they communicate with are specified in their resolv.conf file. This file contains a nameserver keyword that tells the pod where to find CoreDNS (i.e., the address of the ClusterIP service that exposes CoreDNS within the cluster). It also includes a search keyword that contains a list of search domains. These are strings that the client will append to the requested domain if it's not a fully qualified domain name (FQDN)—i.e., if it does not include a trailing dot and if it contains fewer dots than specified in the client's configured ndots value (which we describe below). The list of search domains comprises the search path.

A sample resolv.conf file is shown below.

1
nameserver 10.100.0.10
2
search ecommerce.svc.cluster.local svc.cluster.local cluster.local ec2.internal
3
options ndots:5

The search path allows applications to use shortened domain names to discover services within the cluster. Consider an application that needs to look up an item in an online inventory by calling a service named lookup. If the application tries to resolve the service's IP address using only the service name, the DNS client will iterate through the search path, appending each search domain to expand the abbreviated request and form an FQDN it can resolve.

The first search domain includes the namespace of the application pod, which in this example is ecommerce. The client expands the request for lookup and tries to resolve lookup.ecommerce.svc.cluster.local. If the service exists in the ecommerce namespace, CoreDNS sends a NOERROR response that includes the service's IP address. But if the requested service does not exist there, CoreDNS responds with NXDOMAIN, and the client repeats the process using the next search domain in the search path.

If the lookup service exists in a different namespace—for example, customers—the application can still use an abbreviated name, but must include in its request both the service name and the namespace. In this case, the DNS client will use the second search domain to form the FQDN lookup.customers.svc.cluster.local, and CoreDNS will resolve the request if that service exists in the customers namespace.

The example resolv.conf file shown above also sets the value of the ndots option to 5. This means that the DNS client will automatically consider a domain name to be fully qualified (which will allow it to skip the search path iteration) if it has five or more dots. If the requested domain has fewer than five dots, the client will iterate over the search path until it resolves the request or runs out of search domains. The last search domain in the example file is a cloud-provided DNS server that is capable of resolving requests for domain names on the internet.

The CoreDNS log lines below illustrate the sequence of requests resulting from a client lookup of an address outside the cluster, www.shopist.io. The log shows five requests: one for each item in the search path, plus a final request that treats www.shopist.io as an FQDN. Notice that the first four responses include NXDOMAIN, which is the DNS response code indicating that the requested domain name does not exist.

1
[INFO] 10.244.0.19:36249 - 518 "A IN www.shopist.io.ecommerce.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd 151 0.000313196s
2
[INFO] 10.244.0.19:33212 - 44373 "A IN www.shopist.io.svc.cluster.local. udp 50 false 512" NXDOMAIN qr,aa,rd 143 0.000209746s
3
[INFO] 10.244.0.19:59367 - 19133 "A IN www.shopist.io.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000214739s
4
[INFO] 10.244.0.19:39809 - 20260 "A IN www.shopist.io.ec2.internal. udp 45 false 512" NXDOMAIN qr,rd,ra 120 0.013290877s
5
[INFO] 10.244.0.19:51397 - 13283 "A IN www.shopist.io. udp 32 false 512" NOERROR qr,rd,ra 152 0.034610018s

By setting ndots to a lower value, you might reduce the number of DNS queries CoreDNS executes for each request. For example, if ndots equals 2, CoreDNS will immediately recognize www.shopist.io as fully qualified and resolve it with a single DNS request outside the cluster. However, the client would also view a request for a domain like recommendation.ecommerce.svc as an FQDN—which it would be unable to resolve—so you may need to refactor your applications to call services using an unabbreviated domain name such as recommendation.ecommerce.svc.cluster.local.

CoreDNS plugins

To meet a wide range of use cases, CoreDNS uses a customizable collection of plugins that determine how it processes and responds to requests. Plugins operate in a sequence—or plugin chain—and they handle each incoming request by either responding to it or passing it on to the next plugin in the chain. Plugins can modify a request before passing it to the next plugin. They can also modify a response before returning it to the client.

A subset of CoreDNS's built-in plugins is enabled by default. When CoreDNS is deployed in a Kubernetes cluster, for example, the default plugins allow it to provide service discovery in the cluster, capture errors and metrics, cache responses from upstream servers, and more.

To enable and configure other built-in plugins, you can modify the Corefile, a text file that prescribes CoreDNS's behavior at runtime. Additionally, CoreDNS allows you to include external plugins from sources outside the CoreDNS source code. And you can write your own plugins to meet specific needs and even emit metrics that are useful to you. For example, you could create a plugin that detects the client's IP address and responds differently to requests from internal clients and external clients.

To use external or custom plugins, you first need to specify them in the plugin.cfg file and then recompile CoreDNS. This makes the plugins available to your CoreDNS server. Next, you need to enable the plugins by specifying them in your Corefile.

You will also need to recompile CoreDNS if you ever need to remove plugins or change the sequence of plugins in the chain. Creating a new sequence for the plugin chain can allow you to use the output of one plugin as input for plugins later in the chain. For example, if you create a plugin that can emit an error, you should add that plugin to plugin.cfg ahead of the errors plugin, and then recompile CoreDNS.

CoreDNS servers

The Corefile specifies one or more CoreDNS servers. Each server processes requests for one or more DNS zones—portions of the DNS namespace that hold data about specific domains and subdomains. Servers are defined in stanzas in the Corefile called server blocks, which specify the server's plugin chain and optionally the protocol and port to use.

The code snippet below shows a simple Corefile that defines two servers. The first server uses a single plugin, and the second server uses two plugins.

1
datadoghq.com {
2
file db.datadoghq.com
3
}
4
. {
5
forward . 8.8.8.8
6
cache 30
7
}

The first server block causes CoreDNS to process requests for any address that ends in datadoghq.com by using the file plugin to read DNS data from the local file, db.datadoghq.com.

The second server block is authoritative for the root zone—indicated by the dot character—which means that the CoreDNS server will process all requests except those for the more specific zone named in the first server block. This server is configured to use the forward plugin to retrieve DNS records from the upstream server, 8.8.8.8. It also configures the cache plugin with a maximum time-to-live (TTL) of 30 seconds, meaning that the server will hold a copy of each record for up to that length of time (though individual DNS records may have shorter TTL values). Neither server block specifies a port or protocol to use, so both servers will use the default port (53) and protocol (UDP).

In a Kubernetes cluster, the Corefile takes the form of a ConfigMap. The default ConfigMap defines a single server which is authoritative for the root zone, uses port 53 to process all queries, and enables a subset of the built-in plugins. The code snippet below shows a partial output of the command to display the contents of that ConfigMap—kubectl describe -n kube-system cm coredns.

1
[...]
2
Data
3
====
4
Corefile:
5
----
6
.:53 {
7
    errors
8
    health {
9
       lameduck 5s
10
    }
11
    ready
12
    kubernetes cluster.local in-addr.arpa ip6.arpa {
13
       pods insecure
14
       fallthrough in-addr.arpa ip6.arpa
15
       ttl 30
16
    }
17
    prometheus :9153
18
    forward . /etc/resolv.conf
19
    cache 30
20
    loop
21
    reload
22
    loadbalance
23
}

The kubernetes plugin enables CoreDNS to provide service discovery for your cluster. The snippet above shows how the kubernetes plugin specifies zones for which it is authoritative. The first zone listed—cluster.local—is the cluster's own domain; all Kubernetes services inside the cluster have DNS names that end in cluster.local. The other two zones listed—in-addr.arpa and ip6.arpa—allow CoreDNS to execute reverse lookups using PTR records.

The kubernetes plugin's pods insecure directive allows CoreDNS to resolve requests for pod IP addresses without verifying that the pod exists in the namespace, and is included to make CoreDNS backwards compatible with kube-dns (which was the built-in cluster DNS provider prior to Kubernetes v1.13). The fallthrough directive tells the plugin that in case it is unable to resolve a reverse lookup, it should pass the request on to the next plugin in the chain. Finally, the plugin has a TTL value of 30, meaning that it will cache the results of any lookup for up to 30 seconds.

NodeLocal DNSCache

NodeLocal DNSCache is a Kubernetes add-on that uses a DaemonSet to run a DNS caching pod on each worker node in the cluster, which can speed up the response time for DNS lookups. To provide this local cache, the pods in the DaemonSet also run CoreDNS, so you can track CoreDNS metrics to understand the performance of CoreDNS at the cluster level as well as each NodeLocal DNSCache pod.

When an application pod makes a DNS request, it's served from the NodeLocal cache when possible, reducing latency. The upstream servers specified in the cluster DNS configuration are automatically applied to the CoreDNS pod running the local cache. If the requested data isn't in the local cache, the request is forwarded directly to the upstream server(s) specified at the cluster level.

If the cluster DNS specifies a CoreDNS server that is authoritative for the root zone (as shown in the example below), a NodeLocal DNSCache pod will request a record directly from 8.8.8.8 if it doesn't already have it in the cache, avoiding a call to the cluster DNS.

1
. {
2
forward . 8.8.8.8
3
}

The response is then saved in the local cache so that future requests can be resolved more quickly.

NodeLocal DNSCache provides a layer of DNS caching close to the client and is optionally part of a multi-layered approach to DNS caching. Later in this post, we'll look at latency metrics and cache metrics. These can help you understand how your cluster's performance might be affected if you enable NodeLocal DNSCache or show you how well it's performing if you've already enabled it.

Key metrics for CoreDNS

So far in this post, we've described how CoreDNS works and how it provides DNS within a Kubernetes cluster. In this section, we'll describe the key metrics you should monitor to track the performance of your CoreDNS service. We'll explore metrics from these categories:

Throughput metrics
Performance metrics
Scaling and resource metrics
Go metrics
Cache metrics

Terminology in this section comes from Datadog's Monitoring 101, a series of blog posts that describe effective monitoring.

Throughput metrics

Throughput metrics show you how much work CoreDNS is doing. The rate at which CoreDNS processes queries and the size of the requests and responses can change with your application's activity. But those changes might also correlate with issues affecting the application's dependencies or underlying infrastructure. You should watch for correlations between the metrics listed in this section and those in the scaling and resource metrics section.

Metric	Description	Metric type	Availability
coredns_dns_requests_total	Total number of queries processed by CoreDNS	Work: Throughput	`prometheus` plugin
coredns_dns_responses_total, coredns_forward_responses_total	The number of responses sent by CoreDNS or by an upstream server, aggregated by response code	Work: Throughput	`prometheus` plugin (for CoreDNS responses) and `forward` plugin (for upstream responses)

Metric to alert on: coredns_dns_requests_total

You can monitor the aggregated rate of requests to understand how busy your CoreDNS service is. This metric is likely to rise as your application sees spikes in activity or during cyclical increases in usage. To accommodate these changes, you can enable horizontal autoscaling on your CoreDNS Deployment to automatically add and remove pods as necessary.

You can also implement NodeLocal DNSCache to store DNS data locally on each node in your cluster. NodeLocal DNSCache will respond to DNS queries using cached data when possible, which will reduce the rate of requests to your CoreDNS service.

If you see the rate of queries fall unexpectedly without a corresponding dip in application activity, it could indicate that an infrastructure issue or a configuration issue is causing a disruption in your cluster's network.

In a Kubernetes cluster, CoreDNS is exposed as a ClusterIP service, which automatically load balances requests across all CoreDNS pods. If you see that the rate of requests is unequal across your CoreDNS pods, it may indicate that some pods are unavailable.

Datadog's Network Map visualizes the rate of requests sent to different DNS servers. — A network monitoring tool can help you spot relative differences in the rate of DNS queries sent to your CoreDNS service, so you can better understand where in your application traffic is coming from.

Metrics to watch: coredns_dns_responses_total, coredns_forward_responses_total

Each response from a DNS server contains an RCODE—a response code that indicates the outcome of the server's effort to resolve the query. For example, if a client requests a domain that does not exist, the RCODE value contained in the response will be NXDOMAIN. If CoreDNS or an upstream server encounters an error processing the request, the RCODE is SERVFAIL. An RCODE of NOERROR indicates that the request was processed successfully.

By tracking the number of times CoreDNS replies with each RCODE value, you may be able to discover patterns in your application's performance or changes in its health. For example, if you see an increasing number of SERVFAIL responses, you may have a problem with your CoreDNS configuration, and CoreDNS logs can help you troubleshoot that. An increase in NXDOMAIN responses, on the other hand, could indicate that clients are trying to contact a nonexistent endpoint due to an error in the application code.

Performance metrics

CoreDNS performance metrics are a promising place to look when you're troubleshooting application latency or errors. If CoreDNS responds slowly to incoming queries, your application may slow down as clients wait to communicate with upstream servers. If CoreDNS is slow to update its data when Kubernetes makes changes to the cluster's pods or services, clients may receive outdated information such as stale IP addresses, leading to application errors.

Metric	Description	Metric type	Availability
coredns_dns_request_duration_seconds	How long it takes CoreDNS to process a request, in seconds	Work: Performance	`prometheus` plugin
coredns_kubernetes_dns_programming_duration_seconds	The number of seconds between a change in the cluster's pod configuration and a successful update of the CoreDNS data	Work: Performance	`kubernetes` plugin
coredns_forward_request_duration_seconds	How long it takes the upstream server to respond to a forwarded request, in seconds	Work: Performance	`forward` plugin

Metric to alert on: coredns_dns_request_duration_seconds

The time CoreDNS requires to process a request depends on several factors, including the plugins in the chain and the performance of the cluster's network and infrastructure. It can also vary based on the backend. For example, if the DNS data can be fetched from a file, resolution is usually faster than if the backend is a database or an upstream server. You should create an alert to notify you if CoreDNS latency rises substantially beyond a baseline value so that you can troubleshoot and reduce user-facing latency. If you manage your Corefile in a version control system, you may be able to correlate a change in configuration with a change in latency to understand the effect of updating your Corefile.

If you're operating multiple CoreDNS servers, and you configured them to use different plugin chains, you can aggregate request processing time by server to compare their performance. This might help you identify specific plugins that are responsible for the most latency.

Metric to alert on: coredns_kubernetes_dns_programming_duration_seconds

CoreDNS watches the Kubernetes API to detect changes in the cluster—for example, to learn that a pod has been removed from a service's list of endpoints. The coredns_kubernetes_dns_programming_duration_seconds metric measures DNS update latency, or how long it takes CoreDNS to account for these updates. If update latency is high, clients may not be able to reach a newly created pod and might receive stale IP addresses when they try to resolve the name of a service. All of this will cause application errors and latency as clients fail to communicate with services. It can also reduce the effectiveness of autoscaling, since CoreDNS isn't aware of any pods that have been added to the cluster and can't immediately direct traffic to them if it hasn't fetched current data from the control plane.

You should alert on a sudden rise in this metric, especially if it correlates with increased errors from upstream servers. If the latency is the result of insufficient CPU or memory within your cluster's worker nodes, you may be able to remediate it by scaling out your CoreDNS Deployment or your kube-apiserver.

Metric to watch: coredns_forward_request_duration_seconds

Monitoring this metric can help you understand whether an upstream server is contributing to your application's latency. The forward plugin reports the duration required for the upstream server to resolve requests, broken down by RCODE so you can see, for example, the average time it takes the upstream server to determine that a domain name does not exist (NXDOMAIN) or successfully resolve the request (NOERROR).

You may not have control over the performance of your upstream server—for example, if it's a public DNS service—but if you see an increase in this metric, you may be able to change your CoreDNS configuration to compensate. You can use the cache plugin to reduce latency, and if you're already using it, you may be able to improve your cache hit rate by revising the plugin's settings, as described in a later section. You can also use the cache plugin's prefetch directive to cache data from the upstream server before it's requested, allowing CoreDNS to respond to clients more quickly.

Scaling and resource metrics

You can use horizontal autoscaling to accommodate varying amounts of traffic to your CoreDNS Deployment. You should monitor scaling activity and resource usage metrics to understand whether your autoscaling parameters and CoreDNS container resource requests and limits are appropriate.

Metric	Description	Metric type	Availability
Memory utilization	The amount of memory used by a CoreDNS pod, in bytes	Resource: Utilization	Metrics server
CPU utilization	The amount of CPU used by a CoreDNS pod, in cores	Resource: Utilization	Metrics server
kube_deployment_status_replicas_ready	The number of replicas in a CoreDNS Deployment that are ready to receive requests	Other	kube-state-metrics
coredns_forward_healthcheck_failures_total	The number of times an upstream server has failed its health check since the CoreDNS process was started	Resource: Availability	`forward` plugin

Metric to alert on: Memory utilization per pod

CoreDNS uses more memory as the number of pods and services in the cluster increases and as the rate of requests rises. It will also consume more memory if you're using the autopath plugin. autopath requires the kubernetes plugin's pods verified option, which also increases CoreDNS's memory usage. And if you're using the cache plugin, CoreDNS will consume more memory as the amount of data stored in the cache increases.

You should create an alert to notify you as CoreDNS pods approach their memory limit. By default, CoreDNS containers are created with a memory request of 70 mebibytes (Mi) and a memory limit of 170 Mi. If you get alerted, you may want to revise your pod specification to allocate more memory to CoreDNS containers. If you do this, you may need to scale up the nodes on which they run to ensure that those resources are available.

You can also scale out your CoreDNS Deployment to split traffic across a greater number of pods. The horizontal autoscaler can do this automatically, but you can also manually edit your Deployment manifest to add pods.

Metric to alert on: CPU utilization per pod

CoreDNS's CPU usage increases as the cluster's rate of requests goes up and also when pods run garbage collection (GC). You can expect to see CPU usage rise and fall throughout the GC cycle, and the duration of the cycle may change with the amount of memory in use. You should create an alert to notify you as CoreDNS containers approach their allocated CPU limit. Kubernetes will throttle the processes in containers that reach their CPU limits, which can contribute to errors and latency. To avoid this, you should be prepared to scale out your cluster or scale up your nodes to provide more CPU resources.

Metric to watch: coredns_forward_healthcheck_failures_total

By default, the CoreDNS server in a Kubernetes cluster is authoritative for the cluster.local zone and holds data to resolve requests for services in the cluster. When clients request resources outside the cluster, the forward plugin tells CoreDNS to rely on an upstream server to resolve those requests.

If an upstream server returns an error, CoreDNS marks it as unhealthy and then runs a health check on that server every 0.5 seconds (by default). Once the upstream server passes a health check, it's marked as healthy and CoreDNS pauses the health check until the server returns another error.

This metric enables you to track the availability of each upstream server over time. When an upstream server is unavailable, your clients could be working with stale records (if your cache plugin is configured to serve them), or they may receive a SERVFAIL response from CoreDNS when the request to the upstream server times out. These issues can trigger errors or increase latency in your application, so you can monitor health check failures alongside your application metrics to detect correlations that could reveal a root cause of degraded application performance.

Metric to watch: kube_deployment_status_replicas_ready

If you've enabled horizontal autoscaling on your CoreDNS Deployment, the number of CoreDNS pods can change as the volume of traffic changes. If you're using autoscaling, you can correlate this metric with throughput metrics such as the number of requests received and responses sent, and with performance metrics such as request processing latency to ensure that your Deployment is scaling effectively. If you're not using autoscaling, you can use this metric to determine whether unavailable CoreDNS pods are a root cause of degraded throughput or performance.

Go metrics

You can monitor the metrics in this section to see how the CoreDNS binary uses system resources. Although you can get similar information from pod-level metrics, you can use these runtime metrics to monitor your servers' health and performance even if you're using CoreDNS outside of a Kubernetes cluster.

Metric	Description	Metric type	Availability
go_memstats_heap_inuse_bytes, go_memstats_stack_inuse_bytes	The amount of heap and stack memory used by the CoreDNS program, in bytes	Resource: Utilization	`prometheus` plugin
go_gc_duration_seconds	The amount of time CoreDNS has spent on garbage collection, in seconds	Resource: Utilization	`prometheus` plugin
go_memstats_gc_cpu_fraction	The percentage of of CPU used on garbage collection	Resource: Utilization	`prometheus` plugin

Metrics to watch: go_memstats_heap_inuse_bytes, go_memstats_stack_inuse_bytes

The amount of memory used by CoreDNS's Go processes should move in parallel with the pod's overall memory utilization. If you're using CoreDNS outside of Kubernetes, you can use this as your primary metric to understand the memory usage patterns of your servers.

Metric to watch: go_gc_duration_seconds

Garbage collection cycles can slow down as memory becomes constrained. Even if CoreDNS pods aren't emitting OOM errors, slow GC may indicate that the memory in the pod is running low. You may benefit by redeploying CoreDNS onto larger nodes and using higher memory limits.

Metric to watch: go_memstats_gc_cpu_fraction

As its cache grows, CoreDNS will use more heap memory. As a result, the Go garbage collector will have more work to do, and will require more CPU resources. You can manage this by scaling out your cluster and expanding your CoreDNS Deployment to spread the work across a greater number of nodes, or by scaling up the nodes in your cluster to provide more CPU resources.

Cache metrics

The CoreDNS cache plugin can help you improve cluster DNS performance. If you're using this plugin, you should monitor these metrics to understand the effectiveness of your cache and to guide you in optimizing its configuration.

Metric	Description	Metric type	Availability
coredns_cache_entries	The number of entries in the cache	Other	`cache` plugin
coredns_cache_hits_total, coredns_cache_misses_total	The cache hit rate (calculated from these two metrics) measures the proportion of requests that are served from the CoreDNS cache	Other	`cache` plugin
coredns_cache_prefetch_total	The number of times since the CoreDNS process was started that CoreDNS has prefetched an item to add it to the cache before it's requested	Other	`cache` plugin

Metric to watch: coredns_cache_entries

The default capacity of the CoreDNS cache is 9,984 items. You can improve the performance of your application by caching more data—for example, by increasing the capacity of your CoreDNS cache—but this requires more memory. If your CoreDNS memory usage is rising, you can check this metric to see if the number of items in the cache is a factor. On the other hand, if your CoreDNS pod is underutilizing its memory allocation, you may be able to increase the capacity of your cache.

Metrics to watch: coredns_cache_hits_total, coredns_cache_misses_total

You can use these two metrics to calculate your cache hit rate: coredns_cache_hits_total / (coredns_cache_hits_total + coredns_cache_misses_total). Optimizing your cache hit rate is a key way to improve CoreDNS performance. You may be able to increase your hit rate by using a longer TTL value, but this can increase the risk of stale data. You can also increase your cache capacity, though this may require you to move to a larger instance and update your CoreDNS pod specification to request more memory.

The cache plugin provides a prefetch directive that configures CoreDNS to proactively fetch records from upstream servers and refresh them in the cache before clients request them. To specify which records to prefetch, you can quantify the number of times a name must be requested (using the amount argument) over a specified period (using the duration argument) and how soon its TTL will expire (using the percentage argument).

As you make changes like these, you should monitor your cache hit rate and your application performance metrics to ensure that you benefit as expected.

Metric to watch: coredns_cache_prefetch_total

If your cache hit rate is low, you can use the cache plugin's prefetch directive to refresh records in the cache before they expire. You can track this metric to guide you in configuring your cache to optimally prefetch records based on how often they're requested and how soon they'll expire from the cache.

An overview of CoreDNS logs

CoreDNS logs provide information about its status and performance as well as the activity of its plugins. For example, CoreDNS will write an info-level log message each time it successfully loads a revised Corefile, but it will log an error-level message if the new Corefile contains a typo or syntax error.

Plugins may use the logging package in the CoreDNS source code to log any relevant messages, and information that gets sent to the logger varies across the available plugins. If you enable the debug plugin, debug messages sent by the plugins will also appear in the logs.

CoreDNS sends all logs to standard output. In Part 2 of this series, we'll look at how you can configure CoreDNS to collect logs and view them using kubectl logs. In Part 3, we'll show you how you can collect cluster-level logs from CoreDNS and explore and analyze them in Datadog.

Monitor CoreDNS to ensure Kubernetes performance

As Kubernetes' built-in service discovery mechanism, CoreDNS can be a critical factor in the performance of your clusters. In this post, we've shown you the key metrics you should monitor to fully understand how well CoreDNS is performing and to effectively troubleshoot errors and latency. Coming up in Part 2 of this series, we'll show you the tools that are available to collect these metrics as well as CoreDNS logs. Then, in Part 3, we'll walk you through how to monitor CoreDNS with Datadog.