Innovating AI Without Compromising Latency or Reliability
AssemblyAI is the leading platform for building Voice AI applications, delivering state-of-the-art speech-to-text and speech understanding APIs. Thousands of developers and enterprises rely on AssemblyAI to power meeting notetakers, contact center analytics, and voice agents that automate business workflows while delivering exceptional customer experiences.
As an applied research company, AssemblyAI builds its own in-house models and ships improvements continuously. Teams deploy dozens of updates every week while processing more than 30 million hours of audio per month and up to 2 million in just one day across a multi-cloud environment powered by thousands of GPUs. Reliability, latency, and cost efficiency are core product requirements.
AssemblyAI made an early decision to invest in observability as part of its platform foundation. The team began with the Datadog for Startups program, which made it easy to get up and running while the business scaled quickly. The program enabled AssemblyAI to engage with Datadog during its early growth stages and ultimately evolve that relationship into a long-term partnership.
“When you are building AI at this scale, visibility is a prerequisite,” says Ben Gotthold, Staff Software Engineer at AssemblyAI. “We made observability part of how we build from the very beginning.”
As AssemblyAI’s customer base grew, so did the complexity of its AI infrastructure. The platform runs large-scale inference pipelines across multiple cloud providers using GPUs and TPUs. Model performance, infrastructure efficiency, and customer experience are tightly linked.
Latency is a core product metric for AssemblyAI. Even small regressions in inference time or system behavior can directly affect customer outcomes and operating costs. “We care deeply about milliseconds because that is what our customers feel,” Gotthold explains. “If latency drifts, the product experience degrades.”
To continue shipping new models at a high pace, AssemblyAI needed a clear, consistent way to understand performance across every stage of its inference pipeline and respond quickly when something changed.
How AssemblyAI builds AI at scale
AssemblyAI designed its AI platform with deep instrumentation from day one. Engineers use Datadog Custom Metrics to track every stage of the inference pipeline, including customer-perceived latency, internal processing time, and GPU utilization. These metrics give the team a clear understanding of where performance and cost tradeoffs exist as workloads scale. “At our scale, observability is what lets us operate thousands of GPUs efficiently,” says Gotthold.
To support this at scale, AssemblyAI relies on Datadog Infrastructure Monitoring to maintain real-time visibility across its multi-cloud GPU fleet. This helps ensure deployments remain reliable and cost-effective as traffic grows month over month.
When issues arise, engineers turn to Datadog Log Management to investigate and resolve problems quickly. Logs are correlated with metrics so teams can move from detection to root cause without manually stitching together data across systems. Customer support teams use the same logs to debug issues and help customers faster.
Gotthold shares,“Datadog gives us the visibility we need to understand how our models behave in production and keep improving them.”
Shipping fast with confidence
AssemblyAI ships continuously and uses staged deployments to manage risk. Datadog Synthetic Monitoring is used to validate critical API paths and customer workflows, helping teams catch regressions early and confirm releases behave as expected in production-like environments. “We move fast by design, and clear signals let us take that speed into production safely,” Gotthold explains.
As the organization scaled, AssemblyAI also adopted Datadog Workflow Automation to standardize incident response. Automated workflows help route alerts and trigger consistent actions, reducing manual coordination and keeping response times low.
Performance that powers growth
As AssemblyAI scaled its AI platform, Datadog’s tooling helped the team operate more efficiently and respond faster.
- 2X throughput with 50% lower infrastructure costs
- 40% reduction in MTTR, enabling faster incident resolution
- 50% reduction in investigation and postmortem time
- ~750K/year in avoided engineering costs
“Instead of spending time maintaining monitoring systems, we spend that time improving our models,” Gotthold explains. “That focus has been critical to our growth.”
AssemblyAI continues to expand its AI platform and push the boundaries of Voice AI. The observability foundation the team built early allows them to scale traffic, deploy new models, and adopt new infrastructure with confidence.
“Datadog is a core part of how we operate at scale,” says Gotthold. “It gives us the visibility we need to keep building and scaling AI without slowing down.”