Snowflake is a data platform that enables users to easily store, manage, analyze, and share high volumes of structured and semi-structured data. Whereas traditional data architectures often consist of multiple databases, data warehouses, and data lakes, Snowflake’s Data Cloud breaks down the silos between your different data sources and serves as a single source of truth for a wide range of simultaneous workloads.
Datadog’s integration provides full visibility into Snowflake’s architecture so you can get the most out of what it has to offer. In this post, we’ll show you how to:
- Optimize your storage usage
- Monitor warehouse performance and compute credit consumption
- Detect misconfigurations and security threats
Snowflake offers a single, persistent storage system for all of your data, regardless of where it’s coming from, how it’s structured, and what it’s needed for. Data can be staged and loaded into Snowflake in several ways, such as with the Snowflake Web UI, the Snowflake CLI, and a variety of third-party tools. This flexibility allows data to be migrated from an existing database or data lake, or continuously ingested from a stream-processing service such as Kafka.
Snowflake has several mechanisms to protect and retain data. Whenever data is modified or deleted, it is stored in Time Travel for a specified period of time. This enables users to restore databases, schemas, or tables that were altered unintentionally. After the Time Travel retention period expires, historical data enters Fail-safe, where it is kept for seven days. Fail-safe serves as the final safety net against data corruption or loss.
Your Snowflake storage cost is determined based on the average amount of data that is stored in internal stages, database tables (including database tables in Time Travel), and Fail-safe. With Datadog, you can track the number of bytes stored in each of these locations and make adjustments as needed.
Datadog anomaly monitors make it easy to detect deviations from historical patterns and can be used to monitor fluctuations in storage usage. For instance, if an anomaly monitor detects an abnormal spike in the number of bytes used in database storage (
snowflake.storage.database.storage_bytes), you may choose to configure a particular table or set of tables to be either temporary or transient, rather than permanent. Temporary and transient tables can have a Time Travel retention period of 0 days, and they are also not stored in Fail-safe. These features ensure that temporary and transient tables never incur more than a day’s worth of storage fees.
A Snowflake data warehouse consists of at least one compute cluster that is provisioned by the cloud provider of your choice and is used to execute queries, load data, and perform other DML operations. With Snowflake, every virtual data warehouse is able to access the same storage layer without competing with other warehouses for resources, which means data can be loaded and queried at the same time.
Warehouses have several configuration options that help you optimize their performance. For example, every warehouse is assigned a size, each of which contains a specific number of nodes. The size of the warehouse dictates how quickly it’s able to process queries. If the warehouse is not able to keep up with the number of queries that are coming in, those queries are placed in a queue until they can be processed.
You can use Datadog to configure a forecast monitor for the number of queries queued due to overload for any given warehouse (
snowflake.warehouse.query.queued_load), which will intelligently predict if and when the warehouse will no longer be able to perform at the desired level. If the monitor triggers, you can resize the warehouse, or, if your account has multi-cluster warehouses enabled, increase the maximum number of clusters in that warehouse to ensure that it can autoscale to accommodate the query load.
Warehouses of different sizes also operate with different amounts of memory. If a warehouse cannot accommodate an operation in memory, it starts spilling data to local storage, and then to remote storage. This spilling can dramatically degrade query performance and is best remediated by resizing the warehouse or reducing the amount of data that a particular query must process. Datadog ingests metrics that represent the daily average of both local and remote disk spilling (
snowflake.query.bytes_spilled.remote), so you can alert on spikes and respond accordingly.
Your compute cost is determined by the size of your warehouses and how long they run. Critically, warehouses only consume compute credits when they are active.
If you notice a spike in the number of credits used by your warehouses (
snowflake.billing.virtual_warehouse.sum), you may want to consider taking a closer look at query throughput (
snowflake.query.total) over time. For instance, if you see that a warehouse is consuming credits but only processing one query every 30 minutes, you may want to adjust the warehouse’s auto-suspend value to ensure that the warehouse is only active and consuming credits when it is needed. You might also consider eliminating the warehouse entirely and directing that query to another, more active warehouse.
Additionally, Datadog tags allow you to gain deeper insight into how your Snowflake compute credits are being used across your organization. You can slice and dice the total number of credits your warehouses consume by the
warehouse_name tags in order to see which teams and team members are running the most expensive workloads. This knowledge can help you make informed decisions about how Snowflake credits should be allocated in the future.
The cloud services layer, which is managed entirely by Snowflake, is often referred to as the “brain” of the system. It handles tasks such as authentication, authorization, and query optimization, and it is responsible for ensuring that all warehouses can access the most up-to-date data without experiencing performance degradation. The cloud services layer is also home to the metadata store, which enables key Snowflake capabilities such as zero copy cloning, Time Travel, and data sharing.
Because the cloud services layer is responsible for authentication, it provides us with metrics related to login history. The graph below compares the number of successful login attempts (in blue) and failures (in red). You may also want to configure a monitor to alert you to an increase in failed login attempts from any single account, user, or type of client by using tags like
client_type. A spike in failed login attempts could be indicative of a security threat or a misconfiguration.
Datadog provides crucial insight into every layer of the Snowflake architecture so you can carefully monitor and optimize your usage. And because Datadog integrates with over 500 other technologies, including Spark, Airflow, and Kafka, you can be sure that you’ll have a complete picture of all data-related activity in your system, no matter which services you use in tandem with Snowflake.