Throughput and Availability Threatened by Poor Visibility
SpringServe’s business hinges on its ability to quickly and consistently deliver video advertisements during varying levels of demand. Failure to respond to an advertising opportunity in a timely manner translates to lost revenue for SpringServe and the publisher, a missed opportunity for the advertiser, and a likely win for a competing ad platform. Yet as SpringServe advanced their infrastructure to meet performance demands, their open source monitoring tools could no longer provide the visibility they needed and ultimately became an impediment to growth.
“We were hitting that point when Graphite was being really, really slow. Our Nagios setup wasn’t great either,” says David Buonasera, SpringServe’s Chief Technology Officer, adding that these self-hosted tools created a maintenance burden. “You’d need Nagios to monitor Nagios in other regions,” he says. SpringServe’s infrastructure was becoming more dynamic and distributed in order to provide better service around the world. But significant blind spots hampered those efforts: they could not track application performance across regions, nor could they correlate metrics between systems to uncover the source of issues.
“ The general expectation is that an ad server is up 99.9% of the time.”
SpringServe needed a reliable, real-time monitoring solution that could keep pace with its auto-scaling infrastructure, allow them to adopt innovative technologies, and keep growing quickly—without sacrificing the speed or consistency their customers depend on.
Trustworthy Monitoring for a Dynamic Environment
SpringServe turned to Datadog for real-time, granular data that allows them to monitor every layer of their infrastructure and custom applications, identifying issues before they affect the business. “It took away pretty much every use case we had for Nagios and Graphite almost immediately,” says Buonasera.
“ Datadog is significantly more reliable than our previous setup was, and allows us to monitor things we couldn’t before.”
From their initial Datadog deployment, “it just worked,” Buonasera says, allowing SpringServe engineers to easily correlate metrics across their infrastructure and throughout regions to avoid costly downtime. Additionally, SpringServe can now analyze business metrics to identify areas of the platform that are ripe for improvement, which would have been prohibitively time-consuming with their old setup. With reduced MTTR, and without the burden of managing a self-hosted monitoring stack, SpringServe engineers are now able to focus on enhancing their platform’s value-producing offerings. Moreover, they have been able to embrace a fully dynamic infrastructure to respond to traffic spikes, boost performance, and meet publishers’ near-immediate response times in a cost-effective manner.
The breadth and depth of visibility provided by Datadog’s built-in integrations surpassed what SpringServe had access to prior, allowing Buonasera and his team to readily prevent performance and capacity issues. “The most important thing we do is monitor AWS Kinesis,” a streaming data platform in the Amazon cloud that is fundamental to SpringServe’s functionality, says Buonasera. Kinesis delivers event data from SpringServe’s ad and pixel-tracking servers to their aggregation system. However, as Kinesis reaches its provisioned capacity, it begins to throttle this event data and buffer it back to the servers. If capacity is not promptly increased, the servers quickly become overloaded, data may be lost, and a crash may be imminent. Buonasera and his team now receive Datadog alerts as Kinesis approaches its shared capacity in order to avoid this “catastrophic” scenario.
Monitoring Amazon Elastic Load Balancing (ELB) also allows SpringServe to proactively detect issues and keep transactions flowing. High ELB latency means that SpringServe’s backend servers are not responding to publishers’ ad requests promptly and SpringServe may lose these revenue-generating opportunities. What’s worse, high ELB latency also signals that backend servers may be overburdened and liable to crash. “ELB latency is immensely important,” says Buonasera. “Anytime we see it go over about fifteen milliseconds, we don’t have a ton of time to respond, so alerting on it is super important.” Creating alerts based on out-of-the-box latency metrics from Datadog’s ELB integration allows SpringServe to investigate and resolve the underlying issue before it produces ripple effects on revenue.
End-To-End Visibility Optimizes Platform Performance
While Datadog’s built-in integrations enabled SpringServe to gain immediate visibility into their most critical systems, the ability to collect and monitor metrics from across the business has increased their ability to make data-driven decisions. By tracking every stage of the ad-serving process in Datadog, SpringServe is able to proactively reduce risk and latency throughout their platform and increase the value they provide to customers over time.
Engineers and business teams alike monitor SpringServe DirectConnect, a proprietary channel in which a majority of the transactions between demand- and supply-side partners take place. By collecting custom application metrics and graphing them in a Datadog dashboard, engineers ensure that DirectConnect is available and healthy, while executives can extrapolate revenue. Moreover, by tracking impressions, requests, and fill rates for individual parties, SpringServe’s business team is able to identify their most productive partners and optimize ad delivery cadence accordingly.
“ The dashboarding thing is amazing—the fact that we can put whatever we want on here and have our own custom view. I have this open all day long, and I know other engineers do, too.”
Comprehensive visibility from custom metrics also allows SpringServe to provide better service to their customers, and ensure quality throughout their ecosystem. Datadog prevents unintentional ad spend by helping SpringServe identify partners who inadvertently place repeated ad requests with runaway scripts. “Our clients are not only able to adjust to the market quicker, but when they make a mistake, they find it out faster and are able to mitigate risk,” says Buonasera. “That’s an advantage.”
Enabling An Agile Future
Clients turn to SpringServe for quick and reliable video ad serving, real-time analytics with insight into revenue, and fast feedback to optimize campaigns. For SpringServe and their customers, “it’s all about using data to inform performance,” says Buonasera.
To ensure the stability of their platform and business, SpringServe turned to Datadog, an investment that continues to produce returns as they adopt new technologies and deliver innovative products. “We never have to think, ‘Does this work with Datadog?’ before we select a new technology. That’s just a default—it’s going to work with Datadog,” says Buonasera.
“ It just seems to work no matter what we are doing—that’s the killer feature.”