Monitor the Azure Cosmos DB Integrated Cache With Datadog | Datadog

Monitor the Azure Cosmos DB integrated cache with Datadog

Author Steve Harrington
Datadog Product Manager
Author Paul Gottschling
Datadog Technical Content Writer
Author Tim Sander
Microsoft Program Manager

Published: December 7, 2021

Azure Cosmos DB is a fully managed NoSQL database that scales automatically with load and supports multiple APIs. This makes it easy to incorporate with your applications while removing the need to maintain your own database servers. Using the Cosmos DB integrated cache can help reduce costs and improve performance for Azure Cosmos DB.

The Cosmos DB integrated cache is a read-through, write-through cache that customers can place between their application and Azure Cosmos DB without needing to change their application business logic. Since Cosmos DB integrated cache hits do not consume request units (RUs), the integrated cache can offer substantial savings over using Azure Cosmos DB alone.

Datadog’s Azure Cosmos DB integration now includes metrics that can help you track the health and performance of your integrated cache and dedicated gateway. Once you enable the Azure integration, you can clone and customize the out-of-the-box Azure Cosmos DB dashboard to get deep insight into your entire storage layer.

The out-of-the-box dashboard for Azure Cosmos DB includes integrated cache metrics, and you can clone and customize it as shown here.
The out-of-the-box dashboard for Azure Cosmos DB includes integrated cache metrics, and you can clone and customize it as shown here.

Optimize your cache hit rate

The Cosmos DB integrated cache is logically organized into an item cache, which stores items by ID and partition key, and a query cache where each key is the text of a query and each value is the result set. For both the item cache and the query cache, you can track the cache hit rate to detect when Azure Cosmos DB reads tend to miss the integrated cache and go straight to the database—adding latency and RUs. You can then use related metrics to determine a course of action.

Datadog’s Azure Cosmos DB integration helps you track the hit rate for the item cache (azure.cosmosdb.integrated_cache_item_hit_rate) and query cache (azure.cosmosdb.integrated_cache_query_hit_rate). If these hit rates are too low (e.g., below 0.7–0.8), you can examine the number of items and queries evicted from the cache due to expiration (azure.cosmosdb.integrated_cache_item_expiration_count and azure.cosmosdb.integrated_cache_query_expiration_count), which happens when a cache key becomes older than the value of your MaxIntegratedCacheStaleness setting (the default is five minutes). If keys are frequently expiring due to MaxIntegratedCacheStaleness, and your application’s clients can tolerate older data, consider raising the MaxIntegratedCacheStaleness value to improve the cache hit rate.

A dashboard showing hit rate metrics for the Cosmos DB integrated cache.

You may also be able to improve your cache hit rate by scaling up your dedicated gateway, which will increase the amount of memory available for caching. This allows more keys to populate the cache before it needs to start evicting keys.

Rightsize your dedicated gateway cluster

Azure gives customers control over the compute resources used by the Cosmos DB integrated cache by running the cache on a dedicated gateway. The dedicated gateway is a cluster of nodes that users can configure to scale as needed.

Rightsizing your dedicated gateway allows you to ensure that the integrated cache can process requests without running into bottlenecks. If the query and item cache hit rate metrics are low, and many items are getting evicted (azure.cosmosdb.integrated_cache_evicted_entries_size), you may need to increase the size of your dedicated gateway nodes.

In addition to scaling your gateway nodes vertically, you can also add more nodes to your dedicated gateway cluster. If the throughput of requests to the dedicated gateway (azure.cosmosdb.dedicated_gateway_requests) is lower than it needs to be to accommodate traffic from your application, you may need to add more nodes in order to increase the volume of requests that your dedicated gateway cluster can handle.

A dashboard showing Azure Cosmos DB dedicated gateway request throughput and memory utilization.

Evaluating the benefits of the Azure Cosmos DB integrated cache

The Azure Cosmos DB integrated cache can help optimize the cost of running some workloads, but not all types of workloads will see improvements. In general, read-heavy workloads are better candidates than write-heavy workloads, since the benefits of the integrated cache are entirely scoped to reads. Furthermore, workloads that need to execute many repeated point reads and queries that consume a large number of request units​​ are more likely to see the most significant improvements after adopting the integrated cache, since they can achieve a high cache hit rate.

Datadog’s Azure Cosmos DB integration can help you evaluate the cost benefits of using the Cosmos DB integrated cache. Once you’ve provisioned a dedicated gateway and updated your connection string, Datadog’s Azure Cosmo DB dashboard can help determine the associated cost savings by tracking changes in the total RU consumption (azure.cosmosdb.total_request_units). Repeated point reads and queries won’t use any RUs, so if you can achieve a high cache hit rate, you’ll observe a substantial drop in RU consumption. In the example below, you can see a significant reduction in the number of RUs consumed, indicating that this particular workload is probably a good candidate for the integrated cache.

A graph showing drop in Azure Cosmos DB request units consumed after deployment of the integrated cache.

If the drop in RU consumption allows you to reduce your provisioned throughput, the integrated cache is likely saving money. In order to calculate the precise benefits of adopting the integrated cache for your workload, you should compare the cost savings of reducing your Azure Cosmos DB provisioned throughput to the cost of provisioning dedicated gateway nodes.

Azure Cosmos DB diagnostic logs can also help you assess your application’s repeated query volume and total RUs spent on reads. You can also configure your logs to include full query text, which will enable you to analyze trends, including specific queries’ RU consumption and the number of repeated identical queries. Our embedded Azure integration provides a single-click process to send diagnostic logs to Datadog, where you can visualize, analyze, and correlate them with Azure Cosmos DB metrics and other data from your stack to get comprehensive insights.

Monitor your Azure Cosmos DB account

Datadog’s Azure Cosmos DB integration gives you full visibility into the integrated cache, helping you optimize your savings and read latency. You can also use APM and Log Management alongside this integration to get even deeper visibility into the interaction between your applications and Azure Cosmos DB. Datadog integrates with all Azure services, so you can easily monitor your entire cloud environment.

Not yet a Datadog customer? Sign up for a free trial to get all of your Azure metrics and logs flowing into Datadog in just minutes.