Pursuing Full Observability to Keep a Stable Platform
Hosted on GCP cloud environments in Europe and the United States, Dust’s platform complies with current regulatory requirements. It uses standard Google technologies such as Kubernetes clusters, Managed SQL, Managed Redis, and Google Cloud Storage. To provide the right data to its AI agents, Dust relies on semantic search, a technology that understands the context and intent behind user queries.
To ensure a stable infrastructure for the thousands of teams using the platform—about 10,000 monthly active users—Dust invested in full observability from the very beginning.
“At our previous company, Stripe, we experienced the benefits of moving from Splunk to Datadog as part of a massive observability effort to build the most stable platform possible. When we created Dust, Datadog naturally became the obvious choice. We were especially impressed by the performance of its advanced log access, management, and analytics capabilities.”
The Advanced Observability Needed for Generative AI Models
Datadog’s features stand apart from the less intuitive tools offered by cloud providers. Beyond monitoring and optimization, Datadog enables fast ingestion and querying of massive log volumes—capabilities Dust relies on heavily in its development process.
Dust uses Datadog Infrastructure Monitoring, which provides metrics, visualizations, and alerts that help the R&D team maintain, optimize, and secure their cloud environment. A user-friendly interface and detailed security insights support effective team communication and faster problem-solving.
“When issues arise, On-Call instantly aligns the team with the right context for faster resolution, better incident control, and better collaboration. Critical information and data are easy to access within a single platform, eliminating the need to switch environments.”
Using large language models creates long-running server interactions because the models generate tokens—units of text used to encode information for efficient processing by generative AI. This process increases the need for advanced observability: server calls and responses are often streamed with long-lived open connections. This creates significant resource consumption challenges, where Datadog enables constant monitoring of instances. Additionally, the nature of language model interactions means error rates are typically higher than in traditional SaaS applications.
From Anomaly Detection to Infrastructure Control
Another key characteristic of Dust is the heavy work involved in retrieving enterprise-specific context and indexing data from platforms like Slack, Notion, or GitHub. This results in near-real time processing of large volumes of customer data. Datadog monitoring is essential here as well—this ingestion pipeline is complex and error-prone, especially when credentials are revoked or a service API misbehaves.
“Zero error isn't possible for us, so precise monitoring is essential to understand whether an error rate is nominal or an indication of a real issue.”
Most of Dust’s services are powered by Datadog metrics, with alerts that flag when certain instances need to scale up—ensuring proper infrastructure control. While Dust doesn’t host the AI models it uses, it does monitor their resource consumption through Datadog, performing anomaly detection via token counts—the unit that determines AI cost.
When investigating an issue with a user request, Dust uses Datadog APM, which provides full execution tracing and correlation with infrastructure events or logs.
In addition to seamlessly integrating with GCP, Datadog also supports Dust’s multi-region cloud strategy by making it simple to assign dashboards and monitors by region. This enables highly effective, fully transparent global monitoring. Datadog’s ecosystem of libraries and tools for deployment and integration is extremely mature—an important asset for Dust as they rapidly build and enrich their platform. The next Datadog products under evaluation at Dust will focus on security.
“Datadog is the single best partner to simplify visibility and control over a global infrastructure without having to switch between tools.”