The Service Map for APM is here!

ElastiCache dashboard (Redis)

What Is Amazon ElastiCache?

Amazon ElastiCache is a full managed in-memory cache service offered through Amazon Web Services (AWS). Using a cache greatly improves throughput and reduces latency of read-intensive workloads.

AWS allows you to choose between Redis and Memcached as ElastiCache’s caching engine. We will present the ElastiCache dashboard for Redis since it’s the most widely used caching engine.

AWS ElastiCache Dashboard Overview

ElastiCache can be monitored via several different types of metrics—including host-level metrics, throughput/performance metrics, and memory metrics. An efficient cache can significantly increase your application’s performance and user navigation speed. That’s why understanding these metrics is essential.

Below is an example of the customizable ElastiCache dashboard in Datadog, which helps you visualize the different metrics to monitor. However, even if you’re not a Datadog user, this example can act as a template when assembling your own comprehensive AWS ElastiCache monitoring dashboard using the Redis caching engine.

Datadog dashboard for ElastiCache

Read on for a widget-by-widget breakdown of the metrics in this sample ElastiCache dashboard, parsed out by three key metric categories.

Host-Level Metrics

CPU Utilization by Node (Top 10)

It is important to track CPU utilization by node because high levels can indirectly indicate high latency.

All AWS cache nodes with more than 2.78GB of memory or good network performance are multicore. Be aware that with Redis, the extra cores will be idle since this caching engine is single-threaded. The actual CPU utilization will be equal to this metric’s reported value multiplied by the number of cores.

AWS recommends that you set an alert threshold of 90 percent divided by the number of cores.

Network Incoming Bytes by Node (Top 10)

The number of bytes read from the network by the host.

Network Outgoing Bytes by Node (Top 10)

The number of bytes written to the network by the host.

Throughput and Performance Metrics

Connections by Cluster

The number of current client connections to the cache. It is important to alert on this metric to make sure it never reaches the connections limit.

Hit Rate (%) by Cluster

This time graph is the calculation of cache hits and misses: hits / (hits+misses). Cache hits are the number of requested files that were served from the cache without requesting to the backend. Misses are the number of times a request was answered by the backend because the item was not cached.

The hit rate is your cache efficiency. If it is too low, the cache’s size might be too small for the working data set. A high hit rate, on the other hand, helps to reduce your application response time, ensure a smooth user experience, and protect your databases.

Get commands by cluster

The number of Get commands received by your ElastiCache cluster.

Set commands by cluster

The number of Set commands received by your ElastiCache cluster.

Get commands by node (Top 10)

The number of Get commands received by the top 10 ElastiCache nodes.

Set commands by node (Top 10)

The number of Set commands received by the top 10 ElastiCache nodes. When monitored alongside Get commands by node, you can check if nodes are all healthy, and if the traffic is well-balanced among nodes.

Replication lag by node

This graph tracks the time taken for a cache replica to update changes made in the primary cluster. Monitoring this metric helps to ensure that you’re not serving stale data.

Memory Metrics

Memory usage by node

Memory usage is critical for your cache performance. If it exceeds the total available system memory, the OS will start swapping old or unused sections of memory.

Memory usage by cluster

This reports memory usage, but this time per cluster.

Available system memory by node

This graph tracks the remaining memory on each host, which shouldn’t be too low, otherwise it can lead to swap usage.

Available system memory by cluster

This graph displays the remaining memory at the cluster level.

Swap usage by cluster

This graph tracks the SwapUsage host-level metric, which increases when the system runs out of memory and the operating system starts using disk to hold data that should be in memory.

Evictions by cluster

Evictions happen when the cache memory usage limit (maxmemory with Redis) is reached and the cache engine has to remove items to make space for new writes.

Evicting a large number of keys can decrease your hit rate, leading to longer latency times. If the number of evictions is growing, you should increase your cache size by migrating to a larger node type.

ElastiCache events

This list tracks ElastiCache events, such as node addition failure or cluster creation. When correlated with ElastiCache metrics, these events will help you investigate cache cluster activity.

Monitoring AWS ElastiCache with Datadog

If you’d like to see this dashboard for your ElastiCache metrics, you can try Datadog for free. This dashboard will be populated immediately after you set up the Amazon ElastiCache integration.

For a deep dive on ElastiCache metrics and how to monitor them, check out our three-part series on how to monitor ElastiCache performance metrics.

ElastiCache dashboard (Redis)