How to collect Hadoop metrics

Evan Mouzakitis

This post is part 3 of a 4-part series on monitoring Hadoop health and performance. Part 1 gives a general overview of Hadoop's architecture and subcomponents, Part 2 dives into the key metrics to monitor, and Part 4 explains how to monitor a Hadoop deployment with Datadog.

If you’ve already read our guide to key Hadoop performance metrics, you’ve seen that Hadoop provides a vast array of metrics on job execution performance, health, and resource utilization.

In this post we'll step through several different ways to access those metrics. We'll show you how to collect metrics from core Hadoop components (HDFS, MapReduce, YARN), as well as from ZooKeeper, using standard development tools as well as specialized tools like Apache Ambari and Cloudera Manager.

Collecting HDFS metrics

HDFS emits metrics from two sources, the NameNode and the DataNodes, and for the most part each metric type must be collected at the point of origination. Both the NameNode and DataNodes emit metrics over an HTTP interface as well as via JMX.

Collecting NameNode metrics via API
Collecting DataNode metrics via API
Collecting HDFS metrics via JMX

NameNode HTTP API

The NameNode offers a summary of health and performance metrics through an easy-to-use web UI. By default, the UI is accessible via port 50070, so point a web browser at: http://<namenodehost>:50070

While a summary is good to have, it is likely you will want to drill deeper into the metrics mentioned in part two of this series; to see all the metrics, point your browser to http://<namenodehost>:50070/jmx, which will result in JSON output like this:

1
{
2
    "beans" : [ {
3
      "name" : "java.lang:type=Memory",
4
      "modelerType" : "sun.management.MemoryImpl",
5
      "ObjectPendingFinalizationCount" : 0,
6
      "HeapMemoryUsage" : {
7
        "committed" : 241172480,
8
        "init" : 262144000,
9
        "max" : 241172480,
10
        "used" : 110505832
11
      },
12
      "NonHeapMemoryUsage" : {
13
        "committed" : 136773632,
14
        "init" : 136773632,
15
        "max" : 318767104,
16
        "used" : 35047040
17
      },
18
      "Verbose" : true,
19
      "ObjectName" : "java.lang:type=Memory"
20
    }, {
21
      "name" : "java.lang:type=GarbageCollector,name=ConcurrentMarkSweep",
22
      "modelerType" : "sun.management.GarbageCollectorImpl",
23
      "LastGcInfo" : null,
24
      "CollectionCount" : 0,
25
      "CollectionTime" : 0,
26
      "Valid" : true,
27
      "MemoryPoolNames" : [ "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ],
28
      "Name" : "ConcurrentMarkSweep",
29
      "ObjectName" : "java.lang:type=GarbageCollector,name=ConcurrentMarkSweep"
30
    }, {
31
      "name" : "java.nio:type=BufferPool,name=mapped",
32
      "modelerType" : "sun.management.ManagementFactoryHelper$1",
33
      "TotalCapacity" : 2144,
34
      "MemoryUsed" : 2144,
35
      "Name" : "mapped",
36
      "Count" : 1,
37
      "ObjectName" : "java.nio:type=BufferPool,name=mapped"
38
    }, {
39
      "name" : "java.lang:type=Compilation",
40
      "modelerType" : "sun.management.CompilationImpl",
41
      "CompilationTimeMonitoringSupported" : true,
42
      "TotalCompilationTime" : 9808,
43
      "Name" : "HotSpot 64-Bit Tiered Compilers",
44
      "ObjectName" : "java.lang:type=Compilation"
45
    }, {
46
      "name" : "java.lang:type=MemoryPool,name=Par Eden Space",
47
      "modelerType" : "sun.management.MemoryPoolImpl",
48
      "Valid" : true,
49
      "Usage" : {
50
        "committed" : 167772160,
51
        "init" : 167772160,
52
        "max" : 167772160,
53
        "used" : 95843104
54
      },
55
      "PeakUsage" : {
56
        "committed" : 167772160,
57
        "init" : 167772160,
58
        "max" : 167772160,
59
        "used" : 167772160
60
      },
61
  [...]

You can also filter by specific MBeans, like so:

1
http://<namenodehost>:50070/jmx?qry=java.lang:type=Memory

1
{
2
    "beans" : [ {
3
      "name" : "java.lang:type=Memory",
4
      "modelerType" : "sun.management.MemoryImpl",
5
      "Verbose" : true,
6
      "ObjectPendingFinalizationCount" : 0,
7
      "HeapMemoryUsage" : {
8
        "committed" : 251658240,
9
        "init" : 262144000,
10
        "max" : 251658240,
11
        "used" : 86238032
12
      },

1
      "NonHeapMemoryUsage" : {
2
        "committed" : 137166848,
3
        "init" : 136773632,
4
        "max" : 318767104,
5
        "used" : 57894936
6
      },
7
      "ObjectName" : "java.lang:type=Memory"
8
    } ]
9
  }

Whether you use the API or JMX, most of the NameNode metrics from part two of this series can be found under the MBean Hadoop:name=FSNamesystem,service=NameNode. VolumeFailuresTotal, NumLiveDataNodes, NumDeadDataNodes, NumLiveDecomDataNodes, NumStaleDataNodes can be found under the MBean Hadoop:name=FSNamesystemState,service=NameNode.

DataNode HTTP API

A high-level overview of the health of your DataNodes is available in the NameNode dashboard, under the Datanodes tab (http://localhost:50070/dfshealth.html#tab-datanode).

Hadoop YARN stats - DataNode information panel

To get a more detailed view of an individual DataNode, you can access its metrics through the DataNode API.

By default, DataNodes expose all of their metrics on port 50075, via the jmx endpoint. Hitting this endpoint on your DataNode with your browser or curl gives you all of the metrics from part two of this series, and then some:

1
  $ curl http://<datanodehost>:50075/jmx

1
  {
2
      "name" : "Hadoop:service=DataNode,name=DataNodeActivity-evan.hadoop",
3
      "modelerType" : "DataNodeActivity-evan.hadoop",
4
      "tag.SessionId" : null,
5
      "tag.Context" : "dfs",
6
      "tag.Hostname" : "evan.hadoop",
7
      "BytesWritten" : 387072,
8
      "TotalWriteTime" : 0,
9
      "BytesRead" : 0,
10
      "TotalReadTime" : 0,
11
      "BlocksWritten" : 0,
12
      "BlocksRead" : 0,
13
      "BlocksReplicated" : 0,
14
      "BlocksRemoved" : 0,
15
      "BlocksVerified" : 0,
16
      "VolumeFailures" : 0,
17
  [...]

To retrieve all of the the metrics detailed in part two (and only those metrics), use this query:

1
  $ curl http://<datanodehost>:50075/jmx?qry=Hadoop:name=FSDatasetState,service=DataNode

1
  {
2
    "beans" : [ {
3
      "name" : "Hadoop:service=DataNode,name=FSDatasetState",
4
      "modelerType" : "FSDatasetState",
5
      "tag.Context" : "FSDatasetState",
6
      "tag.StorageInfo" : "FSDataset{dirpath='[/hadoop/hdfs/data/current]'}",
7
      "tag.Hostname" : "sandbox.hortonworks.com",
8
      "Capacity" : 44716605440,
9
      "DfsUsed" : 1870278656,
10
      "Remaining" : 28905246720,
11
      "NumFailedVolumes" : 0,
12
      "LastVolumeFailureDate" : 0,
13
      "EstimatedCapacityLostTotal" : 0,
14
      "CacheUsed" : 0,
15
      "CacheCapacity" : 0,
16
      "NumBlocksCached" : 0,
17
      "NumBlocksFailedToCache" : 0,
18
      "NumBlocksFailedToUnCache" : 0
19
    } ]
20
  }

NameNode and DataNode metrics via JMX

Like Kafka, Cassandra, and other Java-based systems, both the NameNode and DataNodes also exposes metrics via JMX.

The JMX remote agent interfaces are disabled by default; to enable them, set the following JVM options in hadoop-env.sh (usually found in $HADOOP_HOME/conf):

1
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote
2
  -Dcom.sun.management.jmxremote.password.file=$HADOOP_HOME/conf/jmxremote.password
3
  -Dcom.sun.management.jmxremote.ssl=false
4
  -Dcom.sun.management.jmxremote.port=8004 $HADOOP_NAMENODE_OPTS"

1
  export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote
2
  -Dcom.sun.management.jmxremote.password.file=$HADOOP_HOME/conf/jmxremote.password
3
  -Dcom.sun.management.jmxremote.ssl=false
4
  -Dcom.sun.management.jmxremote.port=8008 $HADOOP_DATANODE_OPTS"

These settings will open port 8004 on the NameNode and 8008 on the DataNode, with password authentication enabled (see "To Set up a Single-User Environment" here for more information on configuring the JMX remote agent).

Once enabled, you can connect using any JMX console, like JConsole or Jmxterm. The following shows a Jmxterm connection to the NameNode, first listing available MBeans, and then drilling into the Hadoop:name=FSNamesystem,service=NameNode MBean:

1
[hadoop@evan.hadoop conf]# java -jar /opt/datadog-agent/agent/checks/libs/jmxterm-1.0-DATADOG-uber.jar --url localhost:8004
2
  Welcome to JMX terminal. Type "help" for available commands.
3
  $>beans
4
  #domain = Hadoop:
5
  Hadoop:name=BlockStats,service=NameNode
6
  Hadoop:name=FSNamesystem,service=NameNode
7
  Hadoop:name=FSNamesystemState,service=NameNode
8
  Hadoop:name=JvmMetrics,service=NameNode
9
  Hadoop:name=MetricsSystem,service=NameNode,sub=Control
10
  Hadoop:name=MetricsSystem,service=NameNode,sub=Stats
11
  [...]

1
  $>bean Hadoop:name=FSNamesystem,service=NameNode
2
  $>info
3
  #mbean = Hadoop:name=FSNamesystem,service=NameNode
4
  #class name = FSNamesystem
5
  # attributes
6
    %0   - BlockCapacity (java.lang.Integer, r)
7
    %1   - BlocksTotal (java.lang.Long, r)
8
    %2   - CapacityRemaining (java.lang.Long, r)
9
    %3   - CapacityRemainingGB (java.lang.Float, r)
10
    %4   - CapacityTotal (java.lang.Long, r)
11
    %5   - CapacityTotalGB (java.lang.Float, r)
12
    %6   - CapacityUsed (java.lang.Long, r)
13
    %7   - CapacityUsedGB (java.lang.Float, r)
14
    %8   - CapacityUsedNonDFS (java.lang.Long, r)
15
   [...]

1
  $>get BlocksTotal
2
  #mbean = Hadoop:name=FSNamesystem,service=NameNode:
3
  BlocksTotal = 695;

Once you've enabled JMX, connecting to DataNodes is the same as with the NameNode, with a different port (8008) and MBean (Hadoop:name=FSDatasetState,service=DataNode):

1
[hadoop@evan.hadoop conf]# java -jar /opt/datadog-agent/agent/checks/libs/jmxterm-1.0-DATADOG-uber.jar --url localhost:8008
2
  Welcome to JMX terminal. Type "help" for available commands.
3
  $>bean Hadoop:name=FSDatasetState,service=DataNode
4
  #bean is set to Hadoop:name=FSDatasetState,service=DataNode

1
  $>info
2
  #mbean = Hadoop:name=FSDatasetState,service=DataNode
3
  #class name = FSDatasetState
4
  # attributes
5
    %0   - CacheCapacity (java.lang.Long, r)
6
    %1   - CacheUsed (java.lang.Long, r)
7
    %2   - Capacity (java.lang.Long, r)
8
    %3   - DfsUsed (java.lang.Long, r)
9
    %4   - EstimatedCapacityLostTotal (java.lang.Long, r)
10
    %5   - LastVolumeFailureDate (java.lang.Long, r)
11
    %6   - NumBlocksCached (java.lang.Long, r)
12
    %7   - NumBlocksFailedToCache (java.lang.Long, r)
13
    %8   - NumBlocksFailedToUnCache (java.lang.Long, r)
14
    %9   - NumFailedVolumes (java.lang.Integer, r)
15
    %10  - Remaining (java.lang.Long, r)
16
    %11  - tag.Context (java.lang.String, r)
17
    %12  - tag.Hostname (java.lang.String, r)
18
    %13  - tag.StorageInfo (java.lang.String, r)

1
  $>get NumFailedVolumes
2
  #mbean = Hadoop:name=FSDatasetState,service=DataNode:
3
  NumFailedVolumes = 0;

Collecting MapReduce counters

MapReduce counters provide information on MapReduce task execution, like CPU time and memory used. They are dumped to the console when invoking Hadoop jobs from the command line, which is great for spot-checking as jobs run, but more detailed analysis requires monitoring counters over time.

The ResourceManager also exposes all MapReduce counters for each job. To access MapReduce counters on your ResourceManager, first navigate to the ResourceManager web UI at http://<resourcemanagerhost>:8088.

Find the application you're interested in, and click "History" in the Tracking UI column:

Then, on the next page, click "Counters" in the navigation menu on the left:

Hadoop YARN stats - MapReduce counter navigation

And finally, you should see all of the counters collected associated with your job:

Hadoop YARN stats - MapReduce counters in YARN

Collecting Hadoop YARN metrics

Like HDFS metrics, YARN metrics are also exposed via an HTTP API.

YARN HTTP API

By default, YARN exposes all of its metrics on port 8088, via the jmx endpoint. Hitting this API endpoint on your ResourceManager gives you all of the metrics from part two of this series, and more:

1
$ curl http://localhost:8088/jmx

1
{
2
    "beans" : [{
3
      "name" : "Hadoop:service=ResourceManager,name=QueueMetrics,q0=root",
4
      "AppsSubmitted" : 10,
5
      "AppsRunning" : 5,
6
      "AppsPending" : 4,
7
      "AppsCompleted" : 0,
8
      "AppsKilled" : 0,
9
      "AppsFailed" : 1,
10
      "AllocatedMB" : 2250,
11
      "AllocatedVCores" : 9,
12
      "AllocatedContainers" : 9,
13
      "AvailableMB" : 0,
14
      "AvailableVCores" : 2,
15
      "PendingMB" : 1250,
16
      "PendingVCores" : 5,
17
      "PendingContainers" : 5,
18
      "ReservedMB" : 0,
19
      "ReservedVCores" : 0,
20
      "ReservedContainers" : 0,
21
      "ActiveUsers" : 0,
22
      "ActiveApplications" : 5
23
    }{
24
      "name" : "Hadoop:service=ResourceManager,name=JvmMetrics",
25
      "MemNonHeapUsedM" : 65.38923,
26
      "MemNonHeapCommittedM" : 65.9375,
27
      "MemNonHeapMaxM" : 214.0,
28
      "MemHeapUsedM" : 63.52308,
29
      "MemHeapCommittedM" : 148.5,
30
      "MemHeapMaxM" : 222.5,
31
      "MemMaxM" : 222.5,
32
      "GcCount" : 31,
33
      "GcTimeMillis" : 3987,

1
  [...]

And as with HDFS, when querying the JMX endpoint you can specify MBeans with the qry parameter:

1
$ curl <resourcemanagerhost>:8088/jmx?qry=java.lang:type=Memory

To get only the metrics from part two of the series, you can also query the ws/v1/cluster/metrics endpoint:

1
$ curl http://<resourcemanagerhost>:8088/ws/v1/cluster/metrics

1
  {
2
    "clusterMetrics": {
3
      "appsSubmitted": 10,
4
      "appsCompleted": 0,
5
      "appsPending": 4,
6
      "appsRunning": 5,
7
      "appsFailed": 1,
8
      "appsKilled": 0,
9
      "reservedMB": 0,
10
      "availableMB": 0,
11
      "allocatedMB": 2250,
12
      "reservedVirtualCores": 0,
13
      "availableVirtualCores": 2,
14
      "allocatedVirtualCores": 9,
15
      "containersAllocated": 9,
16
      "containersReserved": 0,
17
      "containersPending": 5,
18
      "totalMB": 2250,
19
      "totalVirtualCores": 8,
20
      "totalNodes": 1,
21
      "lostNodes": 0,
22
      "unhealthyNodes": 0,
23
      "decommissionedNodes": 0,
24
      "rebootedNodes": 0,
25
      "activeNodes": 1
26
    }
27
  }

Third-party tools

Native collection methods are useful for spot checking metrics in a pinch, but to see the big picture requires collecting and aggregating metrics from all your systems for correlation.

Two projects, Apache Ambari and Cloudera Manager, offer users a unified platform for Hadoop administration and management. These projects both provide tools for the collection and visualization of Hadoop metrics, as well as tools for common troubleshooting tasks.

Apache Ambari

The Apache Ambari project aims to make Hadoop cluster management easier by creating software for provisioning, managing, and monitoring Apache Hadoop clusters. It is a great tool not only for administering your cluster, but for monitoring, too.

Installation instructions for multiple platforms can be found here. Once installed, configure Ambari with

1
ambari-server setup

Most users should be fine with the default configuration options, though you might want to change the Ambari user from the default root user. You should be aware, Ambari will install and use the PostgreSQL database package by default; if you already have your own database server installed, be sure to "Enter advanced database configuration" when prompted.

Once configured, start the server with:

1
service ambari-server start

To connect to the Ambari dashboard, point your browser to <AmbariHost>:8080 and login with the default user admin and password admin.

Once logged in, you should be met with a screen similar to the one below:

Hadoop YARN stats - Apache Ambari configuration screen

To get started, select "Launch Install Wizard". On the series of screens that follow, you will be prompted for hosts to be monitored and credentials to connect to each host in your cluster, then you'll be prompted to configure application-specific settings. Configuration details will be specific to your deployment and the services you use. Once you're all set up, you'll have a detailed dashboard like the one below, complete with health and performance information on your entire cluster, as well as links to connect to the web UIs for specific daemons like the NameNode and ResourceManager.

Hadoop YARN stats - Ambari dashboard image

Cloudera Manager

Cloudera Manager is a cluster-management tool that ships as part of Cloudera's commercial Hadoop distribution and is also available as a free download.

Installation instructions for multiple platforms can be found here. Once you've downloaded and installed the installation packages, and set up a database for Cloudera Manager, start the server with:

1
service cloudera-scm-server start

Then, continue installation of the Cloudera Manager through its webUI. To connect to the Cloudera Manager dashboard, point your browser to <ClouderaHost>:7180 and login with the default user admin and password admin.

Once logged in, complete the configuration steps on the next few screens.

On the series of screens that follow, you will be prompted for hosts to be monitored and credentials to connect to each host in your cluster, then you'll be prompted to configure application-specific settings. Configuration details will be specific to your deployment and the services you use. Once you're all set up, you'll have a customizable dashboard like the one below, complete with health and performance information on your entire cluster.

Hadoop YARN stats - Cloudera Manager installed

Collecting ZooKeeper metrics

There are several ways you can collect metrics from ZooKeeper. We will focus on the two most popular, JMX and the so-called "four-letter words". Though we won't go into it here, the zktop utility is also noteworthy for providing a useful top-like interface to ZooKeeper.

Using only the four-letter words, you can collect all of the native ZooKeeper metrics listed in part 2 of this series. JMX coverage is nearly as complete.

Collecting ZooKeeper metrics with Jmxterm

ZooKeeper randomizes its JMX port on each run, making it a bit more complex to connect to ZooKeeper with JMX tools. To set a static port for ZooKeeper's JMX metrics, make sure to add the following lines to your zkEnv.sh, usually located in /usr/share/zookeeper/bin (on *NIX):

1
  -Dcom.sun.management.jmxremote
2
  -Dcom.sun.management.jmxremote.port=9999
3
  -Dcom.sun.management.jmxremote.local.only=false
4
  -Dcom.sun.management.jmxremote.authenticate=false
5
  -Dcom.sun.management.jmxremote.ssl=false
6
  -Djava.rmi.server.hostname=<HOST IP>

These settings will open up port 9999 for JMX connections, without authentication or SSL enabled (for simplicity). To enable password authentication, see "To Set up a Single-User Environment" here.

Using JMX with an MBean browser like JConsole or Jmxterm, you can collect all of the metrics listed in part 2 (except zk_followers, for which you'll need the four-letter words). Below is a walkthrough using Jmxterm:

Connect to ZooKeeper's JMX port

1
/usr/share/zookeeper/bin# java -jar /opt/datadog-agent/agent/checks/libs/jmxterm-1.0-DATADOG-uber.jar --url localhost:9999

Switch to the org.apache.ZooKeeperService domain

1
$>domain org.apache.ZooKeeperService
2
  #domain is set to org.apache.ZooKeeperService

List the beans and select the first MBean from the output

1
$>beans
2
  #domain = org.apache.ZooKeeperService:
3
  org.apache.ZooKeeperService:name0=StandaloneServer_port-1
4
  org.apache.ZooKeeperService:name0=StandaloneServer_port-1,name1=InMemoryDataTree

1
  $>bean org.apache.ZooKeeperService:name0=StandaloneServer_port-1
2
  #bean is set to org.apache.ZooKeeperService:name0=StandaloneServer_port-1

1
  $>info
2
  #mbean = org.apache.ZooKeeperService:name0=StandaloneServer_port-1
3
  #class name = org.apache.zookeeper.server.ZooKeeperServerBean
4
  # attributes
5
    %0   - AvgRequestLatency (long, r)
6
    %1   - ClientPort (java.lang.String, r)
7
    %2   - MaxClientCnxnsPerHost (int, rw)
8
    %3   - MaxRequestLatency (long, r)
9
    %4   - MaxSessionTimeout (int, rw)
10
    %5   - MinRequestLatency (long, r)
11
    %6   - MinSessionTimeout (int, rw)
12
    %7   - NumAliveConnections (long, r)
13
    %8   - OutstandingRequests (long, r)
14
    %9   - PacketsReceived (long, r)
15
    %10  - PacketsSent (long, r)
16
    %11  - StartTime (java.lang.String, r)
17
    %12  - TickTime (int, rw)
18
    %13  - Version (java.lang.String, r)
19
  # operations
20
    %0   - void resetLatency()
21
    %1   - void resetMaxLatency()
22
    %2   - void resetStatistics()

Get metric values

1
$>get AvgRequestLatency
2
  #mbean = org.apache.ZooKeeperService:name0=StandaloneServer_port-1:
3
  AvgRequestLatency = 1;

The 4-letter word

ZooKeeper emits operational data in response to a limited set of commands known as "the four letter words". You can issue a four letter word to ZooKeeper via telnet or nc.

The most important of the 4-letter words is the mntr command.

If you are on your ZooKeeper node, you can see all of the ZooKeeper metrics from part 2, including zk_followers, with mntr:

1
  echo mntr | nc localhost 2181

1
  zk_version  3.4.5--1, built on 06/10/2013 17:26 GMT
2
  zk_avg_latency  0
3
  zk_max_latency  0
4
  zk_min_latency  0
5
  zk_packets_received 70
6
  zk_packets_sent 69
7
  zk_outstanding_requests 0
8
  zk_server_state leader
9
  zk_znode_count   4
10
  zk_watch_count  0
11
  zk_ephemerals_count 0
12
  zk_approximate_data_size    27
13
  zk_followers    4                   - only exposed by the Leader
14
  zk_synced_followers 4               - only exposed by the Leader
15
  zk_pending_syncs    0               - only exposed by the Leader
16
  zk_open_file_descriptor_count 23    - only available on Unix platforms
17
  zk_max_file_descriptor_count 1024   - only available on Unix platforms

Collection is only half the battle

In this post we've covered a few of the ways to access Hadoop and ZooKeeper metrics natively and with cluster-management tools. For production-ready monitoring, you will likely want a comprehensive monitoring system that ingests Hadoop performance metrics as well as key metrics from every other technology in your data-processing stack.

At Datadog, we have developed HDFS, MapReduce, YARN, and ZooKeeper integrations so that you can start collecting, graphing, and alerting on metrics from your clusters with a minimum of overhead. You can easily correlate Hadoop performance with system metrics from your cluster nodes, or with monitoring data from related technologies such as Kafka, Cassandra, and Spark.

For more details, check out our guide to monitoring Hadoop performance metrics with Datadog, or get started right away with a free trial.

Acknowledgments

Special thanks to Ian Wrigley, Director of Education Services at Confluent, for graciously sharing his Hadoop expertise for this article.

Source Markdown for this post is available on GitHub. Questions, corrections, additions, etc.? Please let us know.