Flaky tests and sluggish pipelines create developer friction
Betterment is a wealth management platform that provides modern, technology-driven solutions for investing, saving, and retirement. As a fintech company that helps over one million users manage $60+ billion in AUM, reliability, accuracy, and speed are paramount. Behind the scenes, Betterment’s engineering team deploys code frequently, safeguarded by a large suite of tests running on a shared CI/CD pipeline. Yet as the company scaled, the cost of managing reliability in their monorepo-based CI infrastructure began to rise dramatically. “With a Rails app like ours handling real finances, our CI pipelines can’t be taken lightly,” says Devin Burnette, Senior Staff Software Engineer, Developer Experience at Betterment.
Despite robust testing practices, Betterment’s CI pipeline became a source of friction. Flaky tests—those that fail and pass unpredictably—eroded developer confidence and wasted time. “Developers would rerun their pipeline with their fingers crossed hoping that it passes the second time around,” Burnette explains.
The stakes were high. Their largest monorepo pipeline ran for nearly 40 minutes, and the ripple effects of one flaky test could block unrelated teams. Build failures grew, confidence in test results plummeted, and valuable engineering hours were lost in reruns. As Burnette recounts, “Our main branch build success rate was at one point lower than 50%. I’m embarrassed to say that out loud.”
A data-driven CI overhaul
Betterment’s team initiated a comprehensive, phased strategy to tackle the growing CI instability. They began by measuring their CI performance to understand the depth of the issue. According to Burnette, “We decided to integrate Datadog’s CI Visibility product in order to collect more detailed information about our builds. This let us track how often builds failed and see our overall build success rate.”
Once they established a baseline, they focused on sharing this data organization-wide. “We built a Datadog dashboard. It shows our main CI health statistics including where we’re trending toward our big 95% goal. We checked this dashboard daily,” Burnette says. “Thanks to Datadog these reports were live, so no more scrolling through weekly Slack messages. Teams could see at a glance how yesterday’s commits affected today’s numbers.”
The team then moved into the improvement phase. This included eliminating flaky tests, speeding up pipelines, and rethinking their usage of open source tools like RSpec Retry. “We realized that by having this tool blindly retrying failing tests we were actually digging ourselves an even deeper hole,” Burnette notes. To address this, they forked the tool and added early flake detection, buying time to fix underlying issues.
They also used Datadog’s Test Optimization product. “This allowed us to identify more common patterns for tests that were flaky or slow and address them accordingly,” says Burnette. For each category of flaky test, they dug into the code and implemented fixes.
Ownership was another critical pillar. “We assigned every test or suite of tests to the team that knows that area of the codebase best,” Burnette explains. “Now it’s always clear who will handle a given test failure.”
“This allowed us to identify more common patterns for tests that were flaky or slow and address them accordingly.”
Finally, to enforce accountability and prevent regressions, they used Datadog monitors and automated workflows. “We set up Datadog monitors to help us prevent backsliding in our progress,” says Burnette. “If the success rate dipped or a new flaky test appeared, it triggered a Slack notification. We automated the creation of low severity incidents using Datadog’s workflow automation product.”
Speed, savings, and cultural change
Betterment built a robust system to identify and take action on issues in CI, using Datadog as its foundational tool. Through strategic implementation of Datadog CI Visibility and Test Optimization, Betterment successfully tracked flakiness trends, drove transparency, and reinforced healthy engineering habits. “This constant feedback loop meant that improvements or regressions in test stability were immediately apparent to all,” says Burnette. “The data was impossible to ignore.”
The results of these efforts were dramatic. “We improved our success rate, taking it from below 50% all the way up to 95% as an org-wide average,” says Burnette. “Our main branch builds now consistently pass on the first run at least 95% of the time.”
Build duration also improved significantly. “From almost 40 minutes to under 10 minutes as an org-wide average. That’s a 75% reduction in build times,” says Burnette. Compute usage dropped by more than half. Developers spent less time rerunning jobs and more time building features.
“From almost 40 minutes to under 10 minutes as an org-wide average. That's a 75% reduction in build times.”
Beyond metrics, the transformation was cultural. “That’s really the power of accountability. It turns a project into habit,” Burnette emphasizes. CI health became a shared responsibility, and developers began to take pride in keeping the build green. “Developers can now trust that when CI is red, it’s for a good reason,” he adds.
“The hardest part wasn’t finding flaky tests, it was building the habits and incentives to keep CI healthy over time. Datadog gave us the clarity and the tools to do both,” says Andrew Allred, engineering leader at Betterment. “True reliability isn’t just a one-time initiative; it’s now part of our engineering DNA, supported every day by Datadog’s constant feedback loops.”
“The hardest part wasn't finding flaky tests, it was building the habits and incentives to keep CI healthy over time. Datadog gave us the clarity and the tools to do both.”
Test better, build better
Betterment’s journey underscores the impact of combining strong engineering discipline with actionable observability tools. By partnering with Datadog, they not only solved a technical challenge but also unlocked a new standard of excellence in software delivery. “Every minute saved on build time is a minute sooner our customers get new features, security enhancements, and improvements,” says Allred. “It’s a competitive advantage enabled by reliable CI.”
“In the end we truly managed to test better and build better,” says Burnette. “Our CI pipeline went from a source of frustration to a source of confidence. Developers can merge code faster and with peace of mind. Ultimately, that means we can deliver improvements to our customers more reliably.”