Centrally Govern and Remotely Manage Datadog Agents at Scale With Fleet Automation | Datadog

Centrally govern and remotely manage Datadog Agents at scale with Fleet Automation

Author Vignesh Palaniappan
Author Neha Julka

Published: November 20, 2023

As customers scale to thousands of hosts and deploy increasingly complex applications, it can be difficult to ensure that every host is configured to give you the visibility you need to monitor your infrastructure and applications. To ensure visibility across a growing number of hosts, you need to know that your observability strategy is implemented uniformly across your entire fleet of Datadog Agents installed on these hosts. This requires a central view into their configuration, which all users can rely on as a source of truth.

Fleet Automation allows teams to efficiently govern and manage observability components such as Datadog Agents at scale, ensuring that everyone across your organization has continuous visibility into their dynamic environments. In this post, we’ll show you how Fleet Automation centralizes Agent management and provides actionable visibility into the health of your Agents.

Centralize and simplify Agent management

In many organizations, a small team of SREs is responsible for configuring and maintaining a massive fleet of Datadog Agents. To confirm a configuration change after a deployment or to ensure consistency across the fleet, they typically need to perform the time-consuming task of logging into hosts to view Agent configurations. To avoid this bottleneck, you could provide host-level access to more engineers, but this can lead to complex permissions or insufficient security. Fleet Automation gives you a centralized view into the configuration of all of your Agents, allowing you to ensure configuration consistency without logging into each host to review its configuration.

In the screenshot below, Fleet Automation shows an overview of the fleet status, including Agents that are eligible for an upgrade, Agents with integration issues, and hosts running without an Agent. In this example, the view is filtered to show hosts owned by the containers team that have one or more unconfigured integrations. A filter like this highlights opportunities for each team to gain greater visibility by configuring all available integrations.

The Fleet Automation view shows a list of agents  that have unconfigured integrations managed by the containers team.

You can use facets to filter your data even further, for example to view Agents based on their Agent version or enabled integrations. You can also filter to find Agents using specific API keys, enabling you to maintain security best practices of rotating keys across a fleet of Agents and disabling unused keys. And if your Agents are running custom checks, you can use the Custom Checks Status facet to find checks that need to be migrated before upgrading the Agent because they are incompatible with the latest version.

Enhance observability coverage

Understanding how your Agents are configured—for example, which integrations are enabled and whether they’re hindered by incomplete or faulty configurations—is key to maintaining visibility into your environment. As your organization grows, the new hosts in your environment may not be properly configured to provide the visibility you expect—for example, if a new team spins up hosts without installing the Agent or enabling the required integrations. It’s critical that you can quickly detect this gap and work with the affected team to ensure uniform observability across all teams and applications.

Fleet Automation makes it easy to find Agents with configuration issues and hosts that are running an older version of the Agent. To further help you identify observability gaps, it can even leverage host information collected through our integrations with cloud providers to detect hosts where the Agent isn’t installed. This key information can help you quickly spot any Agents that are missing or need to be reconfigured to provide full visibility into the performance of the host, its integrations, and the applications it supports.

Troubleshoot configuration issues faster

If you need to troubleshoot the Agent’s configuration—for example, if it reports a broken integration—you can often find the cause of the problem by exploring that Agent’s YAML files. Reviewing YAML can reveal a simple explanation for a gap in your visibility, such as a typo or improper indentation. But exploring Agent configuration files via the command line can be a slow and error-prone process.

To allow you to efficiently and safely review an Agent’s configuration, Fleet Automation surfaces each Agent’s YAML configuration file so you can review it without logging into the affected host. In the screenshot below, the Configuration tab shows an error in the Agent’s Log Management (logs_config) configuration—it doesn’t have an API key, which is required for forwarding this host’s logs to Datadog.

The Fleet Automation view shows a host's YAML configuration. The logs configuration section contains an empty API key.

Fleet Automation makes it easy to get help troubleshooting Agent behavior that you can’t resolve through updating or reconfiguring the Agent. For any host that has Remote Configuration enabled, you can send a flare from within the Datadog app to quickly create or update a support ticket without logging into the affected host. The screenshot below shows a form that allows you to send a flare to a new or existing support ticket.

From the Support tab in the Fleet Automation view, the Send Ticket button launches a form to send a flare for an existing or new support ticket.

Manage your Datadog Agents with Fleet Automation

By bringing configuration from your entire fleet of Agents into a single view, Fleet Automation helps SREs remain focused on core observability by making it more efficient to manage and configure Agents at scale. Fleet Automation is now available in public beta. See the documentation for information on getting started. If you’re not already using Datadog, you can start today with a 14-day .