How to Monitor Redis Performance Metrics | Datadog

How to monitor Redis performance metrics

Author Evan Mouzakitis
@vagelim

Last updated: November 7, 2017

Editor’s note: Redis uses the terms “master” and “slave” to describe its architecture and certain metrics. When possible, Datadog does not use these terms. Except when referring to specific metric names for clarity, we will replace these words with “primary” and “replica.”

What is Redis?

Redis is a popular in-memory key/value data store. Known for its performance and simple onboarding, Redis has found uses across industries and use cases, including as a:

  • Database: as an alternative to a traditional disk-based database, Redis trades durability for speed, though asynchronous disk persistence is available. Redis offers a rich set of data primitives and an unusually extensive list of commands.
  • Message queue: Redis’s blocking list commands and low latency make it a good backend for a message broker service.
  • Memory cache: Configurable key eviction policies, including the popular Least Recently Used policy and, as of Redis 4.0, the Least Frequently Used policy, make Redis a great choice as a cache server. Unlike a traditional cache, Redis also allows persistence to disk to improve reliability.

Redis is available as a free, open-source product. Commercial support is available, as is fully managed Redis-as-a-service.

Redis is employed by many high-traffic websites and applications such as Twitter, GitHub, Docker, Pinterest, Datadog, and Stack Overflow.

Key Redis Metrics

Monitoring Redis can help catch problems in two areas: resource issues with Redis itself, and problems arising elsewhere in your supporting infrastructure.

In this article we go though the most important Redis metrics in each of the following categories:

Note that we use metric terminology introduced in our Monitoring 101 series, which provides a basic framework for metric collection and alerting.

Performance metrics

Along with low error rates, good performance is one of the best top-level indicators of system health. Poor performance is most commonly caused by memory issues, as described in the Memory section.

NameDescriptionMetric Type
latencyAverage time for the Redis server to respond to a requestWork: Performance
instantaneous_ops_per_secTotal number of commands processed per secondWork: Throughput
hit rate (calculated)keyspace_hits / (keyspace_hits + keyspace_misses)Work: Success

Metric to alert on: latency

Latency is the measurement of the time between a client request and the actual server response. Tracking latency is the most direct way to detect changes in Redis performance. Due to the single-threaded nature of Redis, outliers in your latency distribution could cause serious bottlenecks. A long response time for one request increases the latency for all subsequent requests.

Once you have determined that latency is an issue, there are several measures you can take to diagnose and address performance problems. Refer to the section “Using the latency command to improve performance” on page 14 in our white-paper, Understanding the Top 5 Redis Performance Metrics.

Redis Latency graph

Metric to watch: instantaneous_ops_per_sec

Tracking the throughput of commands processed is critical for diagnosing causes of high latency in your Redis instance. High latency can be caused by a number of issues, from a backlogged command queue, to slow commands, to network link overutilization. You could investigate by measuring the number of commands processed per second—if it remains nearly constant, the cause is not a computationally intensive command. If one or more slow commands are causing the latency issues you would see your number of commands per second drop or stall completely.

A drop in the number of commands processed per second as compared to historical norms could be a sign of either low command volume or slow commands blocking the system. Low command volume could be normal, or it could be indicative of problems upstream. Identifying slow commands is detailed in Part 2 about Collecting Redis Metrics.

Redis Commands per second graph

Metric to watch: hit rate

When using Redis as a cache, monitoring the cache hit rate can tell you if your cache is being used effectively or not. A low hit rate means that clients are looking for keys that no longer exist. Redis does not offer a hit rate metric directly. We can still calculate it like this:

hit rate = keyspace_hits / (keyspace_hits + keyspace_misses)

The keyspace_misses metric is discussed in the Error metrics section.

A low cache hit rate can be caused by a number of factors, including data expiration and insufficient memory allocated to Redis (which could cause key eviction). Low hit rates could cause increases in the latency of your applications, because they have to fetch data from a slower, alternative resource.

Memory metrics

NameDescriptionMetric Type
used_memoryAmount of memory (in bytes) used by RedisResource: Utilization
mem_fragmentation_ratioRatio of memory allocated by the operating system to memory requested by RedisResource: Saturation
evicted_keysNumber of keys removed due to reaching the maxmemory limitResource: Saturation
blocked_clientsClients blocked while waiting on BLPOP, BRPOP, or BRPOPLPUSHOther

Metric to watch: used_memory

Memory usage is a critical component of Redis performance. If used_memory exceeds the total available system memory, the operating system will begin swapping old/unused sections of memory. Every swapped section is written to disk, severely affecting performance. Writing or reading from disk is up to 5 orders of magnitude (100,000x!) slower than writing or reading from memory (0.1 µs for memory vs. 10 ms for disk).

You can configure Redis to remain confined to a specified amount of memory. Setting the maxmemory directive in the redis.conf file gives you direct control over Redis’s memory usage. Enabling maxmemory requires you to configure an eviction policy for Redis to determine how it should free up memory. Read more about configuring the maxmemory-policy directive in the evicted_keys section.

Redis Memory used by host
This “flat line” pattern is common for Redis when it is used as a cache; all available memory is consumed, and old data is evicted at the same rate that new data is inserted.

Metric to alert on: mem_fragmentation_ratio

The mem_fragmentation_ratio metric gives the ratio of memory used as seen by the operating system (used_memory_rss) to memory allocated by Redis (used_memory).

mem_fragmentation_ratio = used_memory_rss / used_memory

The operating system is responsible for allocating physical memory to each process. The operating system’s virtual memory manager handles the actual mapping, mediated by a memory allocator.

What does this mean? If your Redis instance has a memory footprint of 1GB, the memory allocator will first attempt to find a contiguous memory segment to store the data. If no contiguous segment is found, the allocator must divide the process’s data across segments, leading to increased memory overhead.

Tracking fragmentation ratio is important for understanding your Redis instance’s performance. A fragmentation ratio greater than 1 indicates fragmentation is occurring. A ratio in excess of 1.5 indicates excessive fragmentation, with your Redis instance consuming 150% of the physical memory it requested. A fragmentation ratio below 1 tells you that Redis needs more memory than is available on your system, which leads to swapping. Swapping to disk will cause significant increases in latency (see used memory). Ideally, the operating system would allocate a contiguous segment in physical memory, with a fragmentation ratio equal to 1 or slightly greater.

If your server is suffering from a fragmentation ratio above 1.5, restarting your Redis instance will allow the operating system to recover memory previously unusable due to fragmentation. In this case, an alert as a notification is probably sufficient.

If, however, your Redis server has a fragmentation ratio below 1, you may want to alert as a page so that you can quickly increase available memory or reduce memory usage.

Starting in Redis 4, a new active defragmentation feature is available when Redis is configured to use the included copy of jemalloc. This tool can be configured to kick in when fragmentation reaches a certain level and begin to copy values into contiguous memory regions and release the old copies, reducing fragmentation while the server is running.

Metric to alert on: evicted_keys (Cache only)

If you are using Redis as a cache, you may want to configure it to automatically purge keys upon hitting the maxmemory limit. If you are using Redis as a database or a queue, you may prefer swapping to eviction, in which case you can skip this metric.

Tracking key eviction is important because Redis processes each operation sequentially, meaning that evicting a large number of keys can lead to lower hit rates and thus longer latency times. If you are using TTL, you may not expect to ever evict keys. In this case, if this metric is consistently above zero, you are likely to see an increase in latency in your instance. Most other configurations that don’t use TTL will eventually run out of memory and start evicting keys. So long as your response times are acceptable, a stable eviction rate is acceptable.

You can configure the key expiration policy with the command:

redis-cli CONFIG SET maxmemory-policy <policy>

where policy is one of the following:

  • noeviction returns an error when the memory limit has been reached and a user attempts to add additional keys
  • volatile-lru removes the least recently used key among the ones with an expiration set
  • volatile-ttl removes a key with the shortest remaining time to live among the ones with an expiration set
  • volatile-random removes a random key among the ones with an expiration set
  • allkeys-lru removes the least recently used key from the set of all keys
  • allkeys-random removes a random key from the set of all keys
  • volatile-lfu added in Redis 4, removes the least frequently used key among the ones with an expiration set
  • allkeys-lfu added in Redis 4, removes the least frequently used key from the set of all keys

NOTE: For performance reasons, Redis does not actually sample from the entire keyspace, when using an LRU, TTL, or, as of Redis 4, LFU policy. Redis first samples a random subset of the keyspace, and then applies the eviction strategy on the sample. Generally, newer (>=3) versions of Redis employ an LRU sampling strategy that is a closer approximation of true LRU. The LFU policy may be tuned by setting, for example, how much time must pass without access for an item to move down in rank. See Redis’s documentation for more details.

Metric to watch: blocked_clients

Redis offers a number of blocking commands which operate on lists. BLPOP, BRPOP, and BRPOPLPUSH are blocking variants of the commands LPOP, RPOP, and RPOPLPUSH, respectively. When the source list is non-empty, the commands perform as expected. However, when the source list is empty, the blocking commands will wait until the source is filled, or a timeout is reached.

An increase in the number of blocked clients waiting on data could be a sign of trouble. Latency or other issues could be preventing the source list from being filled. Although a blocked client in itself is not cause for alarm, if you are seeing a consistently nonzero value for this metric you should investigate.

Basic activity metrics

Note: This section includes metrics that use the terms “master” and “slave.” Except when referring to specific metric names, this article replaces them with “primary” and “replica.”

NameDescriptionMetric Type
connected_clientsNumber of clients connected to RedisResource: Utilization
connected_slavesNumber of replicas connected to current primary instanceOther
master_last_io_seconds_agoTime in seconds since last interaction between replica and primary instancesOther
keyspaceTotal number of keys in your databaseResource: Utilization

Metric to alert on: connected_clients

Because access to Redis is usually mediated by an application (users do not generally directly access the database), for most uses, there will be reasonable upper and lower bounds for the number of connected clients. If the number leaves the normal range, this could indicate a problem. If it is too low, upstream connections may have been lost, and if it is too high, the large number of concurrent client connections could overwhelm your server’s ability to handle requests.

Regardless, the maximum number of client connections is always a limited resource—whether by operating system, Redis’s configuration, or network limitations. Monitoring client connections helps you ensure you have enough free resources available for new clients or an administrative session.

Metric to alert on: connected_slaves

If your database is read-heavy, you are probably making use of the primary-replica database replication features available in Redis. In this case, monitoring the number of connected replicas is key. Should the number of connected replicas change unexpectedly, it could indicate a down host or problem with the replica instance.

Redis Primary-Replica

NOTE: In the diagram above, the Redis primary would show that it has two connected replicas, and the first children would each report that they, too, have two connected replicas each. Because the secondary replicas are not directly connected to the Redis primary, they are not included in the count of replicas connected to the primary.

Metric to alert on: master_last_io_seconds_ago

When using Redis’s replication features, replica instances regularly check in with their primary. A long time interval without communication could indicate a problem on your primary Redis server, on the replica, or somewhere in between. You also run the risk of replicas serving old data that may have been changed since the last sync. Minimizing disruption of primary-replica communication is critical due to the way Redis performs synchronization. When a replica reconnects to a primary after an interruption, it sends a PSYNC command to attempt a partial synchronization of only the commands missed during the outage. If this isn’t possible, the replica will request a full SYNC, which causes the primary to immediately begin a background save of the database to disk, while buffering all new commands received that will modify the dataset. The data is sent to the client along with the buffered commands when the background save is complete. Each time a replica performs a SYNC, it can cause a significant spike in latency on the primary instance.

Metric to watch: keyspace

Keeping track of the number of keys in your database is generally a good idea. As an in-memory data store, the larger the keyspace, the more physical memory Redis requires to ensure optimal performance. Redis will continue to add keys until it reaches the maxmemory limit, at which point it then begins evicting keys at the same rate new ones come in. This results in a “flat line” graph, like the one above.

If you are using Redis as a cache and see keyspace saturation—as in the graph above—coupled with a low hit rate, you may have clients requesting old or evicted data. Tracking your number of keyspace_misses over time will help you pinpoint the cause.

Alternatively, if you are using Redis as a database or queue, volatile keys may not be an option. As your keyspace grows, you may want to consider adding memory to your box or splitting your dataset across hosts, if possible. Adding more memory is a simple and effective solution. When more resources are needed than one box can provide, partitioning or sharding your data allows you to combine the resources of many computers. With a partitioning plan in place, Redis can store more keys without evictions or swapping. However, applying a partitioning plan is much more challenging than swapping in a few memory sticks. Thankfully, the Redis documentation has a great section on implementing a partitioning scheme with your Redis instances, read more here.

Redis Keys by host

Persistence metrics

NameDescriptionMetric Type
rdb_last_save_timeUnix timestamp of last save to diskOther
rdb_changes_since_last_saveNumber of changes to the database since last dumpOther

Enabling persistence can be necessary, especially if you are using Redis’s replication features. Because the replicas blindly copy any changes made to the primary, if your primary instance were to restart (without persistence), all replicas connected to it will copy its now-empty dataset.

Persistence may not be necessary if you are using Redis as a cache or in a use case where loss of data would be inconsequential.

Metrics to watch: rdb_last_save_time and rdb_changes_since_last_save

Generally, it is a good idea to keep an eye on the volatility of your data set. Excessively long time intervals between writes to disk could cause data loss in the event of server failure. Any changes made to the data set between the last save time and time of failure are lost.

Monitoring rdb_changes_since_last_save gives you more insight into your data volatility. Long time intervals between writes are not a problem if your data set has not changed much in that interval. Tracking both metrics gives you a good idea of how much data you would lose should failure occur at a given point in time.

Error metrics

Note: This section includes metrics using the terms “master” and “slave.” Except when referring to specific metric names, this article replaces them with “primary” and “replica.”

Redis error metrics can alert you to anomalistic conditions. The following metrics track common errors:

NameDescriptionMetric Type
rejected_connectionsNumber of connections rejected due to hitting maxclient limitResource: Saturation
keyspace_missesNumber of failed lookups of keysResource: Errors / Other
master_link_down_since_secondsTime in seconds of the link between primary and replica being downResource: Errors

Metric to alert on: rejected_connections

Redis is capable of handling many active connections, with a default of 10,000 client connections available. You can set the maximum number of connections to a different value, by altering the maxclient directive in redis.conf. Any new connection attempts will be disconnected if your Redis instance is currently at its maximum number of connections.

Redis Keys by host

Note that your system may not support the number of connections you request with the maxclient directive. Redis checks with the kernel to determine the number of available file descriptor. If the number of available file descriptors is smaller than maxclient + 32 (Redis reserves 32 file descriptors for its own use), then the maxclient directive is ignored and the number of available file descriptors is used.

Refer to the documentation on redis.io for more information on how Redis handles client connections.

Metric to alert on: keyspace_misses

Every time Redis does a lookup of a key, there are only two possible outcomes: the key exists, or the key does not. Looking up a nonexistent key causes the keyspace_misses counter to be incremented. A consistently nonzero value for this metric means that clients are attempting to find keys in your database which do not exist. If you are not using Redis as a cache, keyspace_misses should be at or near zero. Note that any blocking operation (BLPOP, BRPOP, and BRPOPLPUSH) called on an empty key will result in the keyspace_misses being incremented.

This metric is only available when the connection between a primary and its replica has been lost. Ideally, this value should never exceed zero–the primary and replica should be in constant communication to ensure the replica is not serving up stale data. A large time interval between connections should be addressed. Remember, upon reconnecting, your primary Redis instance will need to devote resources to updating the data on the replica, which can cause an increase in latency.

Conclusion

In this post we’ve mentioned some of the most useful metrics you can monitor to keep tabs on your Redis servers. If you are just getting started with Redis, monitoring the metrics in the list below will provide good visibility into the health and performance of your database infrastructure:

Eventually you will recognize additional metrics that are particularly relevant to your own infrastructure and use cases. Of course, what you monitor will depend on the tools you have and the metrics available to you. In the next part of this series, we’ll provide step-by-step instructions on how to collect Redis metrics. Then, follow on to Part 3 where we’ll show you how to visualize key metrics, request traces, and logs with Datadog Application Performance Monitoring.


Source Markdown for this post is available on GitHub. Questions, corrections, additions, etc.? Please let us know.