What is Chef?
Chef is an open source “infrastructure as code” framework that helps you automate your infrastructure and applications. Chef is designed to manage scalable, dynamic infrastructure in any environment.
Chef dashboard overview
As a central component in your infrastructure, Chef should be closely monitored to ensure that the automation features you depend on are working correctly and efficiently. Connecting the Chef framework to Datadog enables you to:
- receive real-time reports on Chef client runs
- track key Chef performance metrics across all your servers
- quickly identify infrastructure issues and resolve them with your team
Below is an example of the customizable Chef dashboard in Datadog, which helps you visualize Chef metrics in different ways. However, even if you’re not a Datadog user, this example can act as a template when assembling your own comprehensive Chef monitoring dashboard.
Read on for a widget-by-widget breakdown of the metrics in this template Chef dashboard.
Key Chef metrics to monitor
Resources updated, past day (%)
chef.resources.updated metric tracks how many resources are updated in each Chef run. By comparing this metric with the
chef.resources.total metric tracking the total count of managed resources, this graph displays the percentage of resources updated per run.
If you start to see more resources being updated per run, that may point to rippling changes in the infrastructure.
Chef runs, past day
In the Chef framework, every node that Chef manages has a locally installed agent called the chef-client. A client run is intended to bring the node to its desired state, as specified in a Chef “recipe.” Client runs and failures appear as events in Datadog.
The “Chef runs” widget tracks the number of Chef runs in the past day by aggregating successful and failed run events into a running timeline. The number of Chef runs should be fairly predictable, scaling with the number of nodes.
Chef failures, past day
This event widget displays all failed Chef runs in a browsable stream. If you see an increased error rate on your Chef run timeline, you can use the stream of individual failure events to dive in quickly and find more details about why the run failed.
Avg/worst execution time, past day(s)
chef.resources.elapsed_time metric tracks the total time elapsed during a Chef run (in seconds). This timeseries graph displays the changes in average and worst-case execution times—the time it takes to cycle through all the steps in a Chef run—over the past day. The dashboard also displays an Avg execution time widget for a more recent view of Chef run times.
Wild swings in execution time may point to network issues.
Monitoring Chef with the Datadog Dashboard
If you’d like to start tracking your Chef metrics in this pre-built dashboard, you can try Datadog for free for 14 days. The dashboard will appear and begin populating with metrics and events as soon as you set up the Chef integration.