Hashicorp Vault is a tool for managing secrets—sensitive data such as passwords, certificates, and API keys. Vault allows you to encrypt your secrets, control access to them, and audit activity to see who has requested data from your Vault. Datadog already monitors the status of your Vault servers—for example, you can configure the Vault integration to automatically notify you if a Vault server is unexpectedly sealed, or if there is a leader change in your Vault cluster. And now Datadog captures Vault metrics, too, so you can visualize Vault’s performance, resource utilization, throughput, and more at a glance.
Vault emits metrics that describe its performance and resource utilization, as well as its volume of activity. To securely collect Vault metrics in Datadog, you should configure the Agent with a Vault client token. The example configuration file shows you how to use the
client_token_path options to include a token or point the Agent to a directory that contains one. To collect metrics without a client token, you’ll need to set the
no_token option to
For every secret that Vault creates, it creates a lease that contains metadata about the secret such as its time to live (TTL) and whether the secret can be renewed. Monitoring your lease metrics can help you understand how often your secrets are being used.
Vault will automatically revoke the lease on a secret when its TTL is up, and Vault operators can manually revoke a lease. When a lease is revoked, the data in the associated object is invalidated, and its secret can no longer be used. Monitoring the current number of Vault leases can help you spot trends in the overall level of activity of your Vault server.
The graph below shows the
vault.vault.expire.num_leases metric, reflecting Vault’s current number of leases. A rise in this metric could signal a spike in traffic to your application, whereas an unexpected drop could mean Vault can’t access secrets from the storage backend quickly enough to serve traffic.
Whenever Vault authenticates a client (e.g., a user or a service), it issues a token. The client must provide the token with each request it makes to Vault. As long as a client provides an unexpired token (i.e., one whose TTL has not run out), Vault considers the client to be authenticated.
Datadog collects metrics that track how efficiently Vault is performing token authentication. For example, you can monitor the
vault.vault.core.handle.login_request.quantile metric for a measure of how quickly Vault is authenticating incoming client requests. If your Vault clients can’t authenticate quickly, they may be slow in responding to requests, so this can be a valuable metric for troubleshooting latency in your application.
You can also graph metrics that track the volume of Vault’s token activity, such as the number of token lookups and how many tokens are being created and renewed. This can help you understand how busy your system is overall.
Vault supports several storage backends for storing the encrypted data that Vault manages. The out-of-the-box Vault dashboard includes graphs visualizing the performance of a Consul backend, but if you’re using any of the other supported storage backends—such as etcd, S3, or Cassandra—you can easily clone and customize the dashboard to graph metrics from your preferred backend.
It’s important to monitor the performance of Vault’s storage backend so you know that your storage infrastructure is properly resourced and performing well.
If you see an increase in the values graphed in the Storage Backend section of the dashboard, it means Vault is spending more time accessing the backend to get, put, list, or delete items in its durable storage. This indicates that your users could be experiencing latency due to storage bottlenecks. You can create alerts to automatically notify your team if Vault’s access to the storage backend is slowing down. This can give you a chance to remediate the problem—for example, by moving to disks with higher I/O throughput—before rising latency affects the experience of your application’s users.
Datadog’s Vault integration collects metrics that you can use to visualize Vault’s resource usage. In the screenshot below, GC Time shows the amount of time per sampling period Vault has paused to allow for garbage collection. The Allocated memory graph shows trends in the maximum amount of memory each host allocates to its Vault server. Correlating this with the available resources of that server’s host can inform you if Vault is at risk of running out of memory. For example, if a server’s allocated memory rises above 90 percent of its host’s available RAM, you should provision more memory.
You can gain an even deeper understanding of Vault’s activity by collecting and analyzing logs from your Vault servers. Once you’ve configured the Datadog Agent to collect Vault logs, you can click any point on a Vault metric graph to view related logs in the Datadog Log Explorer. For example, if you notice any unusual activity, you can immediately pivot to logs for more context to begin troubleshooting.
Datadog automatically tags your logs with metadata from your host and your cloud provider, making it easy for you to filter and analyze logs from Vault and any of the other technologies you’re monitoring.
The screenshot below shows how you can use the
service attribute to filter your logs. In this example, we’re showing only logs related to the
primaryvault service and filtering out logs from the
secondaryvault service. The highlighted log indicates that the Vault is sealed, which could mean that your secrets management service is unavailable, preventing your services from communicating with one another.
It’s easy to customize the built-in Vault dashboard to include all the Vault metrics you need to monitor; see our documentation for a full list of available metrics.
Datadog integrates with more than 400 technologies, so you can monitor Vault side-by-side with the other technologies in your stack. If you’re not already using Datadog, start today by signing up for a free 14-day trial.