The Service Map for APM is here!

Key metrics for RabbitMQ monitoring

/ / / / /
Published: January 24, 2018

What is RabbitMQ?

RabbitMQ is a message broker, a tool for implementing a messaging architecture. Some parts of your application publish messages, others consume them, and RabbitMQ routes them between producers and consumers. The broker is well suited for loosely coupled microservices. If no service or part of the application can handle a given message, RabbitMQ keeps the message in a queue until it can be delivered. RabbitMQ leaves it to your application to define the details of routing and queuing, which depend on the relationships of objects in the broker: exchanges, queues, and bindings.

If your application is built around RabbitMQ messaging, then comprehensive monitoring requires gaining visibility into the broker itself. RabbitMQ exposes metrics for all of its main components, giving you insight into your message traffic and how it affects the rest of your system.

Out-of-the-box screenboard for RabbitMQ monitoring

How RabbitMQ works

RabbitMQ runs as an Erlang runtime, called a node. A RabbitMQ server can include one or more nodes, and a cluster of nodes can operate across one machine or several. Connections to RabbitMQ take place through TCP, making RabbitMQ suitable for a distributed setup. While RabbitMQ supports a number of protocols, it implements AMQP (Advanced Message Queuing Protocol) and extends some of its concepts.

At the heart of RabbitMQ is the message. Messages feature a set of headers and a binary payload. Any sort of data can make up a message. It is up to your application to parse the headers and use this information to interpret the payload.

The parts of your application that join up with the RabbitMQ server are called producers and consumers. A producer is anything that publishes a message, which RabbitMQ then routes to another part of your application: the consumer. RabbitMQ clients are available in a range of languages, letting you implement messaging in most applications.

RabbitMQ passes messages through abstractions within the server called exchanges and queues. When your application publishes a message, it publishes to an exchange. An exchange routes a message to a queue. Queues wait for a consumer to be available, then deliver the message.

You’ll notice that a message going from a producer to a consumer moves through two intermediary points, an exchange and a queue. This separation lets you specify the logic of routing messages. There can be multiple exchanges per queue, multiple queues per exchange, or a one-to-one mapping of queues and exchanges. Which queue an exchange delivers to depends on the type of the exchange. While RabbitMQ defines the basic behaviors of topics and exchanges, how they relate is up to the needs of your application.

There are many possible design patterns. You might use work queues, a publish/subscribe pattern, or a Remote Procedure Call (as seen in OpenStack Nova), just to name examples from the official tutorial. The design of your RabbitMQ setup depends on how you configure its application objects (nodes, queues, exchanges…). RabbitMQ exposes metrics for each of these, letting you measure message traffic, resource use, and more.

RabbitMQ monitoring - a diagram of RabbitMQ's core elements

Key RabbitMQ metrics

With so many moving parts within the RabbitMQ server, and so much room for configuration, you’ll want to make sure your messaging setup is working as efficiently as possible. As we’ve seen, RabbitMQ has a whole cast of abstractions, and each has its own metrics. These include:

This post, the first in the series, is a tour through these metrics. In some cases, the metrics have to do with RabbitMQ-specific abstractions, such as queues and exchanges. Other components of a RabbitMQ application demand attention to the same metrics that you’d monitor in the rest of your infrastructure, such as storage and memory resources.

You can gather RabbitMQ metrics through a set of plugins and built-in tools. One is rabbitmqctl, a RabbitMQ command line interface that lists queues, exchanges, and so on, along with various metrics. Another is a management plugin that reports metrics from a local web server. Several tools report events. We’ll tell you how to use these tools in Part 2.

Exchange performance

Exchanges tell your messages where to go. Monitoring exchanges lets you see whether messages are being routed as expected.

NameDescriptionMetric typeAvailability
Messages published inMessages published to an exchange (as a count and a rate per second)Work: Throughputmanagement plugin
Messages published outMessages that have left an exchange (as a count and a rate per second)Work: Throughputmanagement plugin
Messages unroutableCount of messages not routed to a queueWork: Errorsmanagement plugin

Metrics to watch: Messages published in, Messages published out

When RabbitMQ performs work, it performs work with messages: routing, queuing, and delivering them. Counts and rates of deliveries are available as metrics, including the number of messages that have entered the exchange and the number of messages that have left. Both metrics are available as rates (see the discussion of the management plugin in Part 2). These are key indicators of throughput.

Metric to alert on: Messages unroutable

In RabbitMQ, you specify how a message will move from an exchange to a queue by defining bindings. If a message falls outside the rules of your bindings, it is considered unroutable. In some cases, such as a Publish/Subscribe pattern, it may not be important for consumers to receive every message. In others, you may want to keep missed messages to a minimum. RabbitMQ’s implementation of AMQP includes a way to detect unroutable messages, sending them to a dedicated (‘Alternative’) exchange. In the management plugin (see Part 2), use the return_unroutable metric, constraining the count to a given time interval. If some messages have not been routed properly, the rate of publications into an exchange will also exceed the rate of publications out of the exchange, suggesting that some messages have been lost.

Nodes

RabbitMQ runs inside an Erlang runtime system called a node. For this reason the node is the primary reference point for observing the resource use of your RabbitMQ setup.

When use of certain resources reaches a threshold, RabbitMQ triggers an alarm and blocks connections. These connections appear as blocking in built-in monitoring tools, but it is left to the user to set up notifications (see Part 2). For this reason, monitoring resource use across your RabbitMQ system is necessary for ensuring availability.

NameDescriptionMetric typeAvailability
File descriptors usedCount of file descriptors used by RabbitMQ processesResource: Utilizationmanagement plugin, rabbitmqctl
File descriptors used as socketsCount of file descriptors used as network sockets by RabbitMQ processesResource: Utilizationmanagement plugin, rabbitmqctl
Disk space usedBytes of disk used by a RabbitMQ nodeResource: Utilizationmanagement plugin, rabbitmqctl
Memory usedBytes in RAM used by a RabbitMQ node (categorized by use)Resource: Utilizationmanagement plugin, rabbitmqctl

Metrics to alert on: File descriptors used, file descriptors used as sockets

As you increase the number of connections to your RabbitMQ server, RabbitMQ uses a greater number of file descriptors and network sockets. Since RabbitMQ will block new connections for nodes that have reached their file descriptor limit, monitoring the available number of file descriptors helps you keep your system running (configuring the file descriptor limit depends on your system, as seen in the context of Linux here). On the front page of the management plugin UI, you’ll see a count of your file descriptors for each node. You can fetch this information through the HTTP API (see Part 2). This timeseries graph shows what happens to the count of file descriptors used when we add, then remove, connections to the RabbitMQ server.

RabbitMQ Monitoring: file descriptors used by RabbitMQ nodes

Metrics to alert on: Disk space used

RabbitMQ goes into a state of alarm when the available disk space of a given node drops below a threshold. Alarms notify your application by passing an AMQP method, connection.blocked, which RabbitMQ clients handle differently (e.g. Ruby, Python). The default threshold is 50MB, and the number is configurable. RabbitMQ checks the storage of a given drive or partition every 10 seconds, and checks more frequently closer to the threshold. Disk alarms impact your whole cluster: once one node hits its threshold, the rest will stop accepting messages. By monitoring storage at the level of the node, you can make sure your RabbitMQ cluster remains available. If storage becomes an issue, you can check queue-level metrics and see which parts of your RabbitMQ setup demand the most disk space.

Metrics to alert on: Memory used

As with storage, RabbitMQ alerts on memory. Once a node’s RAM utilization exceeds a threshold, RabbitMQ blocks all connections that are publishing messages. If your application requires a different threshold than the default of 40 percent, you can set the vm_memory_high_watermark in your RabbitMQ configuration file. Monitoring the memory your nodes consume can help you avoid surprise memory alarms and throttled connections.

The challenge for monitoring memory in RabbitMQ is that it’s used across your setup, at different scales and different points within your architecture, for application-level abstractions such as queues as well as for dependencies like Mnesia, Erlang’s internal database management system. A crucial step in monitoring memory is to break it down by use. In Part 2, we’ll cover tools that let you list application objects by memory and visualize that data in a graph.

Connection performance

Any traffic in RabbitMQ flows through a TCP connection. Messages in RabbitMQ implement the structure of the AMQP frame: a set of headers for attributes like content type and routing key, as well as a binary payload that contains the content of the message. RabbitMQ is well suited for a distributed network, and even single-machine setups work through local TCP connections. Like monitoring exchanges, monitoring your connections helps you understand your application’s messaging traffic. While exchange-level metrics are observable in terms of RabbitMQ-specific abstractions such as message rates, connection-level metrics are reported in terms of computational resources.

NameDescriptionMetric typeAvailability
Data ratesNumber of octets sent/received within a TCP connection per secondResource: Utilizationmanagement plugin

Metrics to watch: Data rates

The logic of publishing, routing, queuing and subscribing is independent of a message’s size. RabbitMQ messages are always first-in, first-out, and require a consumer to parse their content. From the perspective of a queue, all messages are equal.

One way to get insight into the payloads of your messages, then, is by monitoring the data that travels through a connection. If you’re seeing a rise in memory or storage in your nodes, the messages moving to consumers through a connection may be holding a greater payload. Whether the messages use memory or storage depends on your persistence settings, which you can monitor along with your queues. A rise in the rate of sent octets may explain spikes in storage and memory use downstream.

Queue performance

Queues receive, push, and store messages. After the exchange, the queue is a message’s final stop within the RabbitMQ server before it reaches your application. In addition to observing your exchanges, then, you will want to monitor your queues. Since the message is the top-level unit of work in RabbitMQ, monitoring queue traffic is one way of measuring your application’s throughput and performance.

NameDescriptionMetric typeAvailability
Queue depthCount of all messages in the queueResource: Saturationrabbitmqctl
Messages unacknowledgedCount of messages a queue has delivered without receiving acknowledgment from a consumerResource: Errorrabbitmqctl
Messages readyCount of messages available to consumerOtherrabbitmqctl
Message ratesMessages that move in or out of a queue per second, whether unacknowledged, delivered, acknowledged, or redeliveredWork: Throughputmanagement plugin
Messages persistentCount of messages written to diskOtherrabbitmqctl
Message bytes persistentSum in bytes of messages written to diskResource: Utilizationrabbitmqctl
Message bytes RAMSum in bytes of messages stored in memoryResource: Utilizationrabbitmqctl
Number of consumersCount of consumers for a given queueOtherrabbitmqctl
Consumer utilizationProportion of time that the queue can deliver messages to consumersResource: Availabilitymanagement plugin

Metrics to watch: Queue depth, messages unacknowledged, and messages ready

Queue depth, or the count of messages currently in the queue, tells you a lot and very little: a queue depth of zero can indicate that your consumers are behaving efficiently or that a producer has thrown an error. The usefulness of queue depth depends on your application’s expected performance, which you can compare against queue depths for messages in specific states.

For instance, messages_ready indicates the number of messages that your queues have exposed to subscribing consumers. Meanwhile, messages_unacknowledged tracks messages that have been delivered but remain in a queue pending explicit acknowledgment (an ack) by a consumer. By comparing the values of messages, messages_ready and messages_unacknowledged, you can understand the extent to which queue depth is due to success or failure elsewhere.

Metrics to watch: Message rates

You can also retrieve rates for messages in different states of delivery. If your messages_unacknowledged rate is higher than usual, for example, there may be errors or performance issues downstream. If your deliveries per second are lower than usual, there may be issues with a producer, or your routing logic may have changed.

This dashboard shows message rates for three queues, all part of a test application that collects data about New York City.

RabbitMQ monitoring: Screenboard of message-related metrics

Metric to watch: Messages persistent, message bytes persistent, and message bytes RAM

A queue may persist messages in memory or on disk, preserving them as pairs of keys and values in a message store. The way RabbitMQ stores messages depends on whether your queues and messages are configured to be, respectively, durable and persistent. Transient messages are written to disk in conditions of memory pressure. Since a queue consumes both storage and memory, and does so dynamically, it’s important to keep track of your queues’ resource metrics. For instance you can compare two metrics, message_bytes_persistent and message_bytes_ram, to understand how your queue is allocating messages between resources.

Metric to alert on: Number of consumers

Since you configure consumers manually, an application running as expected should have a stable consumer count. A lower-than-expected count of consumers can indicate failures or errors in your application.

Metric to alert on: Consumer utilization

A queue’s consumers are not always able to receive messages. If you have configured a consumer to acknowledge messages manually, you can stop your queues from releasing more than a certain number at a time before they are consumed. This is your channel’s prefetch setting. If a consumer encounters an error and terminates, the proportion of time in which it can receive messages will shrink. By measuring consumer utilization, which the management plugin (see Part 2) reports as a percentage and as a decimal between 0 and 1, you can determine the availability of your consumers.

Get inside your messaging stack

Much of the work that takes place in your RabbitMQ setup is only observable in terms of abstractions within the server, such as exchanges and queues. RabbitMQ reports metrics on these abstractions in their own terms, for instance counting the messages that move through them. Abstractions you can monitor, and the metrics RabbitMQ reports for them, include:

  • Exchanges: Messages published in, messages published out, messages unroutable
  • Queues: Queue depth, messages unacknowledged, messages ready, messages persistent, message bytes persistent, message bytes RAM, number of consumers, consumer utilization

Monitoring your message traffic, you can make sure that the loosely coupled services within your application are communicating as intended.

You will also want to track the resources that your RabbitMQ setup consumes. Here you’ll monitor:

  • Nodes: File descriptors used, file descriptors used as sockets, disk space used, memory used
  • Connections: Octets sent and received

In Part 2 of this series, we’ll show you how to use a number of RabbitMQ monitoring tools. In Part 3, we’ll introduce you to comprehensive RabbitMQ monitoring with Datadog, including the RabbitMQ integration.

Acknowledgments

We wish to thank our friends at Pivotal for their technical review of this series.

Source Markdown for this post is available on GitHub. Questions, corrections, additions, etc.? Please let us know.