What is Kafka?
Apache Kafka is a distributed, partitioned, replicated log service developed by LinkedIn and open sourced in 2011. Basically it is a massively scalable pub/sub message queue architected as a distributed transaction log. It was created to provide a “unified platform for handling all the real-time data feeds a large company might have.”
Despite bring pre-1.0, it is production-ready and powers a large number of high-profile companies including LinkedIn, Yahoo!, Netflix and Datadog.
Kafka Dashboard Overview
Kafka performance is measured by four main metric categories: broker metrics, producer metrics, consumer metrics, and ZooKeeper metrics. As you build a dashboard to monitor Kafka, you need to have a comprehensive implementation that includes all the layers of your deployment, including host-level metrics where appropriate, and not just the metrics emitted by Kafka itself.
Below is an example of the customizable Kafka dashboard in Datadog, which helps you visualize the multiple layers of metrics. However, even if you’re not a Datadog user, this example can act as a template for assembling your own comprehensive Kafka monitoring dashboard.
Read on for a widget-by-widget breakdown of the graphs and query values in the Kafka dashboard, grouped by metric categories—broker metrics, producer metrics, consumer metrics, and ZooKeeper metrics.
Kafka broker metrics provide a window into brokers, the backbone of the pipeline. Because all messages must pass through a Kafka broker in order to be consumed, monitoring and alerting on issues as they emerge in your broker cluster is critical.
Clean/unclean leader elections
This graph tracks the
LeaderElectionRateAndTimeMs metric to report the rate of leader elections (per second) and the total time the cluster went without a leader (in milliseconds).
You want to keep an eye on this metric because a leader election is triggered when contact with the current leader is lost, which could translate to an offline broker. Depending on the type of leader election (clean/unclean), it can even signal data loss.
Broker network throughput
Tracking network throughput on your brokers via this graph gives you more information as to where potential bottlenecks may lie, and can inform decisions like whether or not you should enable end-to-end compression of your messages.
This graph tracks the number of produce and fetch requests added to purgatory.
Keeping an eye on the size of the request purgatory is useful to determine the underlying causes of latency. Increases in consumer fetch times, for example, can be easily explained if there is a corresponding increase in the number of fetch requests in purgatory.
TotalTimeMs metric family measures the total time taken to service a request (be it a produce, fetch-consumer, or fetch-follower request).
Under normal conditions, this value should be fairly static, with minimal fluctuations. If you are seeing anomalous behavior, you may want to check the individual queue, local, remote and response values to pinpoint the exact request segment that is causing the slowdown.
These values are monitored via three counters on the dashboard: consumer fetch time, produce request time, and follower fetch time.
ISR delta (counter)
This counter monitors the difference between the reported values of
IsrExpandsPerSec. The number of in-sync replicas (ISRs) should remain fairly static, the only exceptions are when you are expanding your broker cluster or removing partitions.
Under replicated (counter)
This counter monitors when partition replicas fall too far behind their leaders and the follower partition is removed from the ISR pool, causing a corresponding increase in
Offline partitions (counter)
This counter monitors the number of partitions without an active leader. A non-zero value for this metric should be alerted on to prevent service interruptions. Any partition without an active leader will be completely inaccessible, and both consumers and producers of that partition will be blocked until a leader becomes available.
Kafka producers are independent processes which push messages to broker topics for consumption. Below are some of the most useful producer metrics to monitor to ensure a steady stream of incoming data.
Bytes out by topic
This graph tracks the average number of outgoing/incoming bytes per second.
Monitoring producer network traffic will help to inform decisions on infrastructure changes, as well as to provide a window into the production rate of producers and identify sources of excessive traffic.
This graph monitors the rate at which producers send data to brokers. Keeping an eye on peaks and drops is essential to ensure continuous service availability.
Request average latency
Average request latency is a measure of the amount of time between when
KafkaProducer.send() was called until the producer receives a response from the broker.
Since latency has a strong correlation with throughput, it is worth mentioning that modifying
batch.size in your producer configuration can lead to significant gains in throughput.
This graph represents the percentage of time the CPU is idle and there is at least one I/O operation in progress. If you are seeing excessive wait times, it means your producers can’t get the data they need fast enough.
The following important consumer metrics track the performance and throughput of both simple consumers and high-level consumers in Kafka 0.8.2.2, as well as the new consumer introduced in Kafka 0.9.0.0.
Consumer lag by group
ConsumerLag measures the difference between a consumer’s current log offset and a producer’s current log offset.
BytesPerSec metric helps you monitor your consumer network throughput.
A sudden drop in
MessagesPerSec could indicate a failing consumer, but if its
BytesPerSec remains constant, it’s still healthy, just consuming fewer, larger-sized messages.
This graph monitors the rate of messages consumed per second, which may not strongly correlate with the rate of bytes consumed because messages can vary in size.
Monitoring this metric over time can help you discover trends in your data consumption and create baselines against which you can alert.
Min fetch rate
This graph measures the fetch rate of a consumer, which can be a good indicator of overall consumer health.
Broker JVM Metrics
Because Kafka is written in Scala and runs in the Java Virtual Machine (JVM), it relies on Java garbage collection processes to free up memory. These are the essential metrics to monitor:
JVM GC per min
This graph monitors the JVM garbage collection processes that are actively freeing up memory. The more activity in your Kafka cluster, the more often the garbage collector will run.
ParNew time by broker
The young generation garbage collections monitored by this graph pause all application threads during garbage collection.
CMS time by broker
ConcurrentMarkSweep (CMS) metric monitors the collections that free up unused memory in the old generation of the heap.
ZooKeeper plays an important role in Kafka deployments. It is responsible for maintaining consumer offsets and topic lists, leader election, and general state information. The following are important ZooKeeper metrics to monitor for Kafka:
Available file descriptors
This graph monitors the number of available file descriptors. Because each broker must maintain a connection with ZooKeeper, and each connection to ZooKeeper uses multiple file descriptors, after increasing the number of available file descriptors in your ZooKeeper ensemble, you should keep an eye on them to ensure they are not exhausted.
Average request latency
This graph tracks the average time it takes (in milliseconds) for ZooKeeper to respond to a request.
This graph visualizes the number of clients connected to ZooKeeper. You should be aware of unanticipated drops in this value; since Kafka uses ZooKeeper to coordinate work, a loss of connection to ZooKeeper could have a number of different effects, depending on the disconnected client.
zk_outstanding_requests metric tracks the number of client requests that can’t be processed by ZooKeeper.
Tracking both outstanding requests and latency can give you a clearer picture of the causes behind degraded performance.
Pending syncs (leader)
This graph monitors the transaction log, which is the most performance-critical part of ZooKeeper.
You should definitely monitor this metric and consider alerting on larger (> 10) values.
ZooKeeper commits/sec by consumer
ZooKeeperCommitsPerSec metric tracks the rate of consumer offset commits to ZooKeeper.
If you consistently see a high rate of commits to ZooKeeper, you could consider either enlarging your ensemble, or changing the offset storage backend to Kafka.
Monitor Kafka with Datadog
For a deep dive on Apache Kafka metrics and how to monitor them, check out our three-part How to Monitor Kafka Performance Metrics series.