What are Canary Tests and How do They Work? | Datadog
What are Canary Tests and How do They Work?

Infrastructure

What are Canary Tests and How do They Work?

Explore canary testing to deploy updates safely and improve user experience. Learn how this method minimizes risks for seamless software releases.

Deploying code in your application or rolling out brand-new features to your product can feel daunting.

Fortunately, there’s a way to validate new features and changes before they are released to the wider public. It’s called canary testing — and it’s a process we follow for every single code deployment here at Datadog.

Here’s what we’ll cover in this primer:

  • What is canary testing?
  • How does canary testing compare to A/B testing?
  • How does canary testing compare to blue-green testing?
  • Advantages and disadvantages of canary testing
  • How canary testing works
  • How canary testing and feature flags work together

What is canary testing?

Canary testing is a method of slowly rolling out software changes to a small group of users in a live environment to minimize risk. It allows you to test new features or updates on a small scale without the danger of exposing all your users to potential issues.

To fully answer the question “What are canary tests?”, we’ll embark on a little history lesson.

It’s called “canary” testing because coal miners used to release canaries (the bird) into the mines. Canaries have a lower tolerance to toxic gasses than humans, and they’d alert miners if gasses inside the mine were reaching dangerous levels — before the miners noticed.

Just like coal miners would use canaries to detect dangerous gasses in mines, software teams use canary testing to detect issues before exposing all users to new changes.

What is a canary in software? What role do they play?

The “canaries” are typically a small group of users, although they can also be units like regions or datacenters. Once segmented, a small group of your unit of choice becomes the “canary group” and will receive the update first.

This canary group is closely monitored to ensure no issues arise before rolling out the change to the entire user base.

How does canary testing compare to A/B testing?

These two concepts are very similar. Canary testing and A/B testing both aim to de-risk deployments, refine user experiences, and measure impact to both observability and business metrics with real user feedback… but they can differ in execution and purpose.

You could actually think of A/B testing as an even more beneficial/advanced form of canary testing.

Here’s the main difference between the two:

  • Canary testing focuses solely on risk mitigation before wider release so that new updates don’t compromise system stability. A small and specific portion of the user base is exposed to these updates.
  • A/B testing can also provide risk mitigation, but it further seeks to validate improvements to metrics like engagement by comparing feature versions. A/B testing is more intentional about randomly assigning audiences to each release version to enable both risk mitigation and statistical inference. ‍ To put it simply, while canary testing introduces changes to a small audience for safety, A/B testing applies changes into more intentionally-sized and randomized user groups for statistical analysis, emphasizing optimization in addition to immediate risk management.

How does canary testing compare to blue-green testing?

We know what canary testing is, but what about blue-green testing?

Blue-green testing involves keeping two separate production environments, blue and green. You release the new version to green, and — if it works — you direct all your traffic to the green environment.

Both techniques aim to enable safe deployments by introducing new features or updates, involving real users for feedback, and deploying changes gradually.

However, they differ mainly in:

  • Deployment strategy: Canary testing introduces the new version to a small user group first, expanding to all if successful. Blue-green testing switches all traffic from an old version (blue) to a new one (green) in a separate environment upon success.
  • Resource requirements: Canary testing is more resource-efficient, affecting a segment of users, whereas blue-green testing requires duplicating the entire production environment.
  • Risk management: Canary testing minimizes risk by initially impacting a small group, allowing quick rollback. Blue-green offers easy rollback to a stable environment, reducing downtime risk. ‍ In other words, canary testing is typically suitable for a wider array of use cases, especially for smaller teams or projects with fewer resources, focusing on performance validation. Blue-green testing is ideal for critical applications requiring stable releases, emphasizing minimal service disruption.

What are the benefits of using canary tests?

So, what are canary tests actually good for?

The key benefits of using canary tests can be divided into three main pillars:

Quick feedback

  1. Rapid insight collection: Canary tests deliver immediate feedback on new features. This quick glimpse into user reactions and metric impacts allows you to take swift action.
  2. Agile response: If an issue does surface, it’s easy to halt the canary test, sparing the broader user base from potential frustrations. This feedback loop makes updating and improving new features much faster. ‍

Reduced risk

  1. Controlled exposure: Launching updates to everyone at once can introduce significant risks, such as widespread user dissatisfaction. Canary testing confines these risks to a manageable group of users.
  2. Safe production testing: By limiting the initial release, canary testing provides a safety net. If the test reveals any problems, stopping it promptly prevents issues from spreading into other areas. You always want to preserve your system’s overall integrity. ‍

Data-driven decisions

  1. Authentic user feedback: Canary tests are grounded in actual user data and behavior, offering insights into how new updates perform in terms of engagement, conversion, and error rates.
  2. Evidence-based launches: Analyzing data from canary tests helps you know whether a feature is ready for a wider release or needs further refinement. However, this is only a first step in moving decision-making from speculative to empirical. Truly data-driven decision making requires valid inference from a more complete A/B test. ‍

What are the challenges of using canary tests?

Running canary tests can still bring some complications into the mix:

Navigating mobile apps’ single environment: It’s tricky to segment updates in mobile apps because each user’s device acts as a standalone environment, complicating targeted canary testing.

Solution: Use feature flags to enable or disable features for specific users remotely, overcoming the single-environment hurdle.

Juggling multiple features at once: When you’re rolling out lots of new features quickly, it can be tough to keep track and ensure each one works as expected without overwhelming your testing process.

Solution: Deploy a refined feature flag system to manage and monitor each feature separately, making it easier to handle rapid releases.

Dealing with multiple production machines: Coordinating canary tests across several machines can feel like a logistical nightmare, requiring tight synchronization to maintain consistency.

Solution: Embrace automation through infrastructure as code (IaC) and CI/CD tools to simplify deployment and keep all machines in lockstep.

Providing a consistent experience for everyone: With so many different devices and operating systems out there, it’s a challenge to make sure everyone has a smooth experience during testing.

Solution: Use tools that allow you to test across a broad range of devices and OSes, ensuring that every user gets the same quality experience.

Getting the most out of your test data: Canary testing throws a lot of data at you, and it can be tricky to sift through it all to find the insights you need.

Solution: Integrate your tests with a powerful analytics platform that can help you sort, analyze, and act on your data quickly and accurately. ‍

Step-by-step process of how canary tests are done

Once you’ve identified what you want to test, you’re ready to run a canary test. Here’s a five-step process to get started:

  1. Select your canary group: Choose a small percentage (ideally 1-5%) of your user base who’ll receive the new feature first. These users will serve as the early indicators of how the new update performs in a live environment. ‍
  2. Gradually roll out the features: Increase the percentage of users exposed to the new features over time by adjusting your feature flags. Aim to increase the exposure by about 5% every few days, and carefully monitor the impact it has at each stage. ‍
  3. Never stop monitoring performance and user feedback: Keep a close watch on key metrics such as revenue, signups, and page views. Also, pay attention to customer feedback, including any comments and complaints, to spot negative impacts early on. ‍
  4. Respond quickly to any issues: Should any problems arise, immediately adjust the feature flag to roll back the feature for the canary group. This swift action minimizes impact and protects the broader user base while you address the issue. ‍
  5. Decide whether to expand or roll back: If the canary test goes well, roll out the new features to your entire user base gradually. If not, roll back the feature for all canary users and plan for a future re-test once you’ve solved what’s causing the problem.

Canary testing with feature flags In order to successfully complete the steps above, you’ll need to use feature flags.

Feature flags are the infrastructure that allow you to toggle features on and off for certain users. For instance, you can enable a new feature for just 1% of your users to start, monitor how it performs, and then gradually roll it out to more people.

You can see how this goes hand in hand with canary testing. It minimizes risk because, if something goes wrong, you only impact a small group.

Using Datadog Feature Flags makes canary testing a breeze. Our feature flagging tools allow you to allocate traffic and test across environments in a matter of seconds. Plus, you get granular control over who sees which features.

Whether you’re using simple on-off flags, gradual rollouts, or even full-blown enterprise-wide experiments, Datadog Feature Flags is the out-of-the-box solution for implementing feature flags into your canary testing fast and with little to no manual intervention.

Get free unlimited monitoring for 14 days