
Ryan Lucht
Technical teams want to know the newest, most cutting-edge tools they can implement to give themselves a competitive advantage, whether it’s the latest developer framework or modern CI/CD practices that boost velocity. But there’s one tool from all the way back in the 1920s that can improve any organization, no matter its scale: the randomized, controlled trial—or simply put, experiments.
Even at a time when technologies and technical practices are changing faster than ever before, experiments never go out of style. Enterprise companies like Meta, Google, Amazon, and many more might run upward of 100,000 experiments every year because they provide the gold standard of scientific evidence: a way to use data and statistics to understand causation, not just correlation, between the changes we make and the metrics they affect.
Experiments are often referred to colloquially as A/B tests, a term typically associated with growth and product teams looking to measure how users respond to changes in product features or UI designs. But experimentation helps us evaluate any decision we face, including whether new code is safe to deploy or how to configure AI models in production. By selectively enabling changes for a random sample, we can monitor potential impacts safely, quantify successes, and swiftly roll back failures. With this approach, you can run an experiment anywhere you have changes to make and metrics to measure.
In this post, we’ll look at a few common instances when most teams, not just growth and product, should be A/B testing.
Safe deployments, even for bug fixes
At Datadog and many other companies, every code deployment happens behind a feature flag and is canary tested. Canary testing is a deployment method that treats every code change as a hypothesis: “This change will improve the system without causing regression.” Instead of deploying to 100% of the fleet or 100% of users simultaneously, you route a small percentage of traffic to the new version while the majority continues to receive the old version. If telemetry signals a problem, the rollout halts automatically, before things turn into a tricky incident.
This approach is popular because traditional pre-production testing can’t catch everything. Unit tests verify logic, but production environments are nondeterministic and user behavior is unpredictable. Race conditions, edge-case data, and gradual memory leaks only surface under real load over real time. At best, you can make an informed guess about how your product changes will actually impact customer responses and business metrics until you collect actual data.
When 86 Google Cloud products went down for more than 7 hours in 2025, Google’s official postmortem pointed out that the real cause of the blast radius wasn’t the null pointer crash loop that had been introduced, but that the change had been deployed without an experiment. A canary test would have surfaced the crash before the code change that caused it propagated.
Microsoft’s Azure team also recognizes the value of canary tests, which is why they built Gandalf, an analytics service that monitors every deployment—guest OS patches, host agent updates, firmware changes—and correlates fault signals with ongoing rollouts. Over 18 months of study, Gandalf achieved 92.4% precision and 100% recall for data-plane rollouts. That 100% recall means zero high-impact service outages were caused by bad rollouts during the study period.
Monzo, a UK bank, uses Argo Rollouts on Kubernetes to automatically canary test hundreds to thousands of deployments a day, stepping them from 1% to 5%, 25%, and finally 100%, pausing at each stage to gather metrics. Their engineering team reports this system saved them from a number of bad deployments to production.
In modern software, if it hasn’t been canaried, it hasn’t been tested.
Drive improvements to key metrics
Back around the turn of the millennium, Amazon engineer Greg Linden ran a strange series of A/B tests to see how much page load time impacted customer behavior. He tried intentionally adding additional delay to pages in increments of 100 ms and found that even very small delays resulted in substantial drops in revenue: Every 100 ms decreased sales by 1%. At Amazon’s scale back then, that translated to over $100 million in lost revenue annually. Today, it would be billions.
Similar results have been found by plenty of other businesses who tried similar experiments. Retailer Zalando found that every 100 ms of improvement they could make equaled 0.7% more revenue per session. These experiments proved the return on investment of improving latency and directly demonstrated the value of real user monitoring.
Not every experiment validates expectations. Etsy spent months re-architecting their search results page to support an infinite scroll, commonly thought to be a best practice for increasing user engagement. The hypothesis was sound: Removing pagination friction should lead to more items viewed, favorited, and purchased. When they A/B tested it, the results were the opposite. Users clicked on fewer search results and favorited fewer items. The postmortem revealed why: Users lost their mental landmarks. In a paginated interface, users remember that a specific item was in a specific place; for example, at the top of page 2. Infinite scroll erases these waypoints. Etsy’s experiment prevented a permanent degradation of their search experience; a disaster that they almost celebrated as a launch.
When Duolingo CEO Luis von Ahn gives his quarterly earnings calls to investors, he commonly mentions the importance of the company’s experimentation program. Duolingo runs hundreds of A/B tests every quarter, which contributed to one quarter’s jaw-dropping 71% year-over-year increase in paid subscriptions, as they reported during their Q2 2022 earnings call. Many of their biggest wins came from testing engagement mechanics around their “streak” feature. Making streaks visible increased daily active users by 3%. Adding virtual currency wagers on streaks increased two-week retention by 5% and in-app purchase revenue by 600%. Each test built on the last, compounding understanding over time.
Experimentation generates value in multiple ways. It helps teams correctly identify treatments that work, supports failing cheaply to avoid costly mistakes, and enables them to build the measurement muscle to discover what specific users actually respond to.
Tune, orchestrate, and evaluate AI models
Experimentation is also a crucial tool in the age of AI. Traditional software has unit tests that either pass or fail. Machine learning has accuracy metrics, such as precision and recall. Large language models in production have a more complex evaluation problem. An LLM is nondeterministic; it might answer correctly nine times and hallucinate on the tenth. And offline benchmarks almost always prove to be poor predictors of production success. A model that excels at generic reasoning may fail on your specific customer support queries or proprietary codebase. This is why AI engineering is shifting toward online evaluation, which treats production interactions as experiments.
GitHub is constantly A/B testing the models and configurations that power Copilot. Their primary metric isn’t whether a developer accepts a suggestion, but whether the code is retained. A user might press tab and then immediately backspace to delete a wrong suggestion. GitHub trains dozens of model candidates per month. Each goes through offline testing, internal dogfooding, and then online A/B experiments on a small percentage of real-world requests.
When they tested a new custom model architecture for Copilot, they measured multiple metrics simultaneously, including acceptance rate, retained characters, and latency. The results were a 12% higher acceptance rate, 20% more accepted and retained characters, and a 35% reduction in latency. They ship only when improvements are statistically significant across real developer workloads, not just benchmark scores.
Spotify faced a cold-start problem with their audiobooks product that they solved with AI and experimentation. Users would search for books, but if the book’s metadata didn’t exactly match the user query, retrieval failed. They tested a system they called AudioBoost, using LLMs to generate synthetic descriptions and query variations for audiobooks. They ran a 3-week A/B test comparing standard metadata retrieval against LLM-augmented metadata, tracking impressions, clicks, and exploratory query completions. The LLM-augmented group saw a 1.22% increase in audiobook clicks and a 1.82% boost in exploratory completions. After validating these improvements in production, they rolled out AudioBoost globally.
Every knob in an AI system—the model, the prompt, the retrieval strategy, the serving infrastructure—is a hypothesis waiting to be tested. Experimentation helps not only prove that LLMs are working as intended but that they are yielding the best possible improvements in efficiency and output for your product.
Build competitive advantage
Safe deployments, improved metrics, and optimized AI are valuable individually. Together, they compound. Organizations that run experiments continuously across infrastructure, product, and AI build a learning velocity that is hard for their competitors to match.
Researchers from Microsoft, Booking.com, and Outreach.io built a model to describe this called the A/B Testing Flywheel, showing how experimentation programs scale from “crawl” to “fly.” Each turn of the flywheel—investment in infrastructure, scaling the number of experiments run, demonstrating value, and securing more investment in experimentation—accelerates the next. Organizations need to evolve both technically and culturally in an iterative way, but once they do, they can learn faster, find key product and system improvements, and avoid incidents and counterproductive features.
It’s something of a maxim that the team who learns fastest wins. In her book Radical Focus, Christina Wodtke writes, “The most important advantage a company can have is its speed of learning. The rate of change in the market is only growing.” Harvard Business School professor Stefan Thomke echoes this in his book Experimentation Works: “Agility comes from responding more quickly than the rate of change in the competitive environment.”
It’s deceptively intuitive to think that experimentation only drives value when tests succeed or fail, we find successful improvements to metrics and outcomes, or avoid shipping changes that would cause degradation. But we still learn valuable information when tests don’t demonstrate any measurable change at all. We can use that information to better allocate our resources and focus our work on other areas that are more likely to move the needle. Every experiment we run generates learnings and creates value that compounds and builds competitive advantage.
Start experimenting now
The term “A/B test” has narrowed how we think about experimentation. It sounds like something for marketing teams testing button colors. But the scientific method doesn’t care what team you’re on or what you’re changing. If you deploy code, you can canary it. If you serve AI, you can measure it. If you affect user behavior, you can run a controlled experiment.
The tooling that enables experimentation is ready for you to use in platforms like Datadog. Feature flags decouple deployment from release. Statistical methods separate signal from noise. Observability platforms surface the impact of changes in real time. What’s left is the decision to use them, and the discipline to trust data over intuition.
Start with one experiment. Pick a change you’re uncertain about, expose it to a fraction of your traffic, and measure what happens. You might be surprised at what you learn. Read our documentation to learn more about Feature Flags and Experiments with Datadog. If you’re new to Datadog, get started with a 14-day free trial.





