Snowflake is a data platform that enables users to easily store, manage, analyze, and share high volumes of structured and semi-structured data. Whereas traditional data architectures often consist of multiple databases, data warehouses, and data lakes, Snowflake’s Data Cloud breaks down the barriers between siloed data sources and serves as a single source of truth for a wide range of simultaneous workloads.
Datadog’s integration provides full visibility into Snowflake’s architecture so that you can get the most out of what it has to offer, whether you’re managing a single account or overseeing Snowflake usage throughout your entire organization. In this post, we’ll show you how to:
- Optimize your storage usage
- Monitor warehouse performance and compute credit consumption
- Detect misconfigurations and security threats
Snowflake offers a single, persistent storage system for all of your data, regardless of where it’s coming from, how it’s structured, and what it’s needed for. Data can be staged and loaded into Snowflake in several ways, such as with the Snowflake Web UI, the Snowflake CLI, and a variety of third-party tools. This flexibility allows data to be migrated from an existing database or data lake, or continuously ingested from a stream-processing service such as Kafka.
Snowflake has several mechanisms to protect and retain data. Whenever data is modified or deleted, it is stored in Time Travel for a specified period of time. This preservation of data enables users to restore databases, schemas, or tables that may have been unintentionally altered. After the Time Travel retention period expires, historical data enters Fail-safe, where it is kept for seven days. Fail-safe serves as the final safety net against data corruption or loss.
Your Snowflake storage costs are based on the average amount of data that is stored in your internal stages, database tables (including database tables in Time Travel), and Fail-safe. With Datadog, you can track the number of bytes stored in each of these locations and make adjustments as needed.
By default, our integration is tailored to monitor individual Snowflake accounts, but it can also be configured to collect metrics at an organizational level (i.e., across multiple related accounts), or at both the account and organizational levels simultaneously. Two key organizational metrics are
snowflake.organization.storage.credits, which measures the total number of compute credits used by all of the Snowflake accounts in your organization, and
snowflake.organization.storage.average_bytes, which measures the average usage of database storage throughout those accounts.
Datadog anomaly monitors make it easy to detect deviations from historical patterns and can be used to monitor fluctuations in storage usage. For instance, if an anomaly monitor detects an abnormal spike in the number of bytes used in database storage (
snowflake.storage.database.storage_bytes), you may choose to configure a particular table or set of tables to be either temporary or transient, rather than permanent. Temporary and transient tables can have a Time Travel retention period of 0 days, and they are also not stored in Fail-safe. These features help ensure that temporary and transient tables never incur more than a day’s worth of storage fees.
A Snowflake data warehouse consists of at least one compute cluster that is provisioned by a cloud provider of your choice and is used to execute queries, load data, and perform other DML operations. With Snowflake, every virtual data warehouse is able to access the same storage layer without competing with other warehouses for resources, which means data can be loaded and queried at the same time.
Warehouses have a number of configuration options that help you optimize their performance. For example, the size parameter defines the amount of compute resources available per cluster in a given warehouse and thus the speed at which that warehouse can process queries. If a warehouse is not able to keep up with the volume of queries it receives, those queries are placed in a queue until they can be processed.
The size assigned to a warehouse also dictates its memory capacity. If a warehouse cannot accommodate an operation in memory, it starts spilling data to local storage, and then to remote storage. This spilling can dramatically degrade query performance and is best remediated by resizing the warehouse or reducing the amount of data that a particular query must process. Datadog ingests metrics that represent the daily average of both local and remote disk spilling (
snowflake.query.bytes_spilled.remote), so you can alert on spikes and respond accordingly.
For any given warehouse, you can also use Datadog to configure a forecast monitor for the number of queries queued as a result of overload (
snowflake.warehouse.query.queued_load). This technique will intelligently predict if and when the warehouse will no longer be able to perform at the desired level. If the monitor triggers, you can resize the warehouse, or (if your account has multi-cluster warehouses enabled) increase the maximum number of clusters in that warehouse to ensure that it can autoscale to accommodate the query load.
Your Snowflake compute cost is determined by the size of each of your warehouses and how long they run. Critically, warehouses only consume compute credits when they are active.
If you notice a spike in the number of credits used by your warehouses (
snowflake.billing.virtual_warehouse.sum), you may want to consider taking a closer look at query throughput (
snowflake.query.total) over time. For instance, if you see that a warehouse is consuming credits but only processing one query every 30 minutes, you may want to adjust the warehouse’s auto-suspend value to ensure that the warehouse is only active and consuming credits when it is needed. You might also consider eliminating the warehouse entirely and directing those queries to another, more active warehouse.
Collecting organizational metrics allows you to monitor credit usage across any number of associated accounts. You can measure overall usage throughout your organization (
snowflake.organization.storage.credits), and track precisely how and where credits are being used across accounts (
snowflake.organization.warehouse.total_credit.avg). You can also track how credits are used across warehouses (
snowflake.organization.warehouse.virtual_warehouse.avg). Yet another option is to monitor usage across cloud services (
You can also review your contract information (
snowflake.organization.contract.amount) and track your available balance (
snowflake.organization.balance.capacity) and effective rates (
snowflake.organization.rate.effective_rate) based on your organization’s credit usage.
Datadog tags offer even deeper insight into how your Snowflake compute credits are being used across your organization. You can slice and dice the total number of credits your warehouses consume by the
warehouse_name tags in order to see which teams and team members are running the most expensive workloads.
Monitoring at an organizational level and parsing your metrics with tags can facilitate an in-depth understanding of Snowflake at any scale, helping you make informed decisions about usage and how and where to allocate credits in the future.
The cloud services layer, which is managed entirely by Snowflake, is often referred to as the “brain” of the system. It handles tasks such as authentication, authorization, and query optimization, and it is responsible for ensuring that all warehouses can access the most up-to-date data without experiencing performance degradation. The cloud services layer is also home to the metadata store, which enables key Snowflake capabilities such as zero-copy cloning, Time Travel, and data sharing.
Because the cloud services layer is responsible for authentication, it provides us with metrics related to login history. The graph below compares the number of successful login attempts (in blue) and failures (in red). You may also want to configure a monitor to alert you to an increase in failed login attempts from any single account, user, or type of client by using tags like
client_type. A spike in failed login attempts could be indicative of a security threat or a misconfiguration.
Datadog provides crucial insight into every layer of the Snowflake architecture so that you can carefully monitor and optimize your usage, or that of your entire organization. And because Datadog integrates with over 500 other technologies, including Spark, Airflow, and Kafka, you can be sure that you’ll have a complete picture of all data-related activity in your system, no matter which services you use in tandem with Snowflake.