What is Hadoop?
Apache Hadoop is an open source framework for distributed storage and processing of very large data sets on computer clusters.
Hadoop began as a project to implement Google’s MapReduce programming model, and has become synonymous with a rich ecosystem of related technologies, not limited to: Apache Pig, Apache Hive, Apache Spark, Apache HBase, and others.
Hadoop dashboard overview
When working properly, a Hadoop cluster can handle a truly massive amount of data—there are plenty of production clusters managing petabytes of data each. Monitoring each of Hadoop’s subcomponents—HDFS, MapReduce and YARN—is essential to keep jobs running and the cluster humming.
Below is an example of the customizable Hadoop dashboard in Datadog, which helps you visualize the different metrics to monitor for each subcomponent. However, even if you’re not a Datadog user, this example can act as a template when assembling your own comprehensive Hadoop monitoring dashboard.
Read on for a widget-by-widget breakdown of the metrics in this sample Hadoop dashboard, parsed out by the three main subcomponents as well as application metrics you should monitor.
Key HDFS metrics to monitor
The Hadoop Distributed File System (HDFS) is the underlying file system of a Hadoop cluster. It provides scalable, fault-tolerant, rack-aware data storage designed to be deployed on commodity hardware.
Total nodes (counter) This counter tracks the number of alive data nodes. Ideally this number will be equal to the number of DataNodes you’ve provisioned for the cluster.
Dead nodes (counter) This counter tracks the number of dead data nodes. It is important to alert on the NumDeadDataNodes metric because the death of a DataNode causes a flurry of network activity, as the NameNode initiates replication of blocks lost on the dead nodes.
Losing multiple DataNodes will start to be very taxing on cluster resources, and could result in data loss.
Volume failures (counter) This counter monitors the number of failed volumes in your Hadoop cluster. Though a failed volume will not bring your cluster to a grinding halt, you most likely want to know when hardware failures occur, if only so that you can replace the failed hardware.
Blocks (counter) This counter monitors the BlocksTotal metric. Keeping an eye on the total number of blocks across the cluster is essential to continued operation.
Under replicated blocks (counter) This counter tracks the number of under-replicated blocks. These are the number of blocks with an insufficient number of replicas. If you see a large, sudden spike in the number of under-replicated blocks, it is likely that a DataNode has died—this can be verified by correlating under-replicated block metric values with the status of DataNodes.
HDFS Disk usage This graph depicts the total disk usage across the entire HDFS cluster.
HDFS remaining/node This graph monitors the disk space remaining for a particular DataNode. If left unrectified, a single DataNode running out of space could quickly cascade into failures across the entire cluster as data is written to an increasingly-shrinking pool of available DataNodes.
You may want to alert on this metric when the remaining space falls dangerously low (less than 10 percent).
Key YARN metrics to monitor
YARN (Yet Another Resource Negotiator) is the framework responsible for assigning computational resources for application execution.
Application metrics provide detailed information on the execution of individual YARN applications. The counters provide insight into the main Progress metric.
Progress Progress gives you a real-time window into the execution of a YARN application. Because application execution can often be opaque when running hundreds of applications on thousands of nodes, tracking progress alongside other metrics can better help you to determine the cause of any performance degradation.
The counters in this section of the dashboard—Submitted, Running, Done, Pending, Killed and Failed—offer context that can help clarify progress metric values. Applications that go extended periods without making progress should be investigated.
Allocated Memory/App This is a high-level view of the amount of RAM allocated per application.
Allocated vCores/App This is the number of virtual cores allocated per application.
Memory usage (counter) This counter tracks the combination of the totalMB and allocatedMB metrics to give a high-level view of your cluster’s memory usage.
Total vCores (counter) The number of virtual cores in the cluster.
Containers (counter) The number of containers in the cluster, where containers represent a collection of physical resources—an abstraction used to bundle resources into distinct, allocatable units.
Memory Usage This graph visualizes the memory use counter. Keep in mind that YARN may over-commit resources, which can occasionally translate to reported values of allocatedMB which are higher than totalMB.
Virtual Core Usage This timeseries graph depicts the virtual core usage across the cluster.
Container Usage This graph depicts the total number of containers allocated, aggregated by cluster.
Active (counter) The number of currently active nodes. These are the normally operating nodes given by the activeNodes metric.
Sick (counter) This counter monitors the unhealthy nodes in the cluster. YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage to be unhealthy.
Rebooted (counter) The number of nodes in the cluster that have rebooted.
Lost (counter) This counter tracks the lostNodes metric. If a NodeManager fails to maintain contact with the ResourceManager, it will eventually be marked as “lost” and its resources will become unavailable for allocation.
Avg containers/node (counter) This counter tracks the average number of containers running per host.
Memory Usage This graph visualizes the memory in use by node.
Virtual Core usage This graph tracks the virtual core usage of each node.
Key MapReduce metrics to monitor
The MapReduce framework exposes a number of counters to track statistics on MapReduce job execution. Counters are an invaluable mechanism that let you see what is actually happening during a MapReduce job run.
Maps running/completed A graph of the number of maps that are running, and that have successfully run on the cluster.
Pending maps/reduces A graph of the number of maps and reduces that are queued for processing.
Reduces running/completed This timeseries graph shows the volume of reduce tasks that are running, and that have successfully run on the cluster.
Monitoring Hadoop with the Datadog dashboard
For a deep dive on Hadoop metrics and how to monitor them, check out our four-part How to Monitor Hadoop Metrics series.