Best Practices for Monitoring Dark Launches | Datadog

Best practices for monitoring dark launches

Author Paul Gottschling

Published: 5月 12, 2021

What is a dark launch?

A dark launch is a deployment strategy for testing new versions of a service in production. When running a dark launch, you deploy a new version of a service and route a copy of production traffic to it without returning responses to users. This lets you see how a new version of a service handles production load, watch for errors, and compare performance between the old and the new versions—without affecting users. Once you determine that a dark launch is successful, you can roll it out to users with, for example, a canary release. Since changes become less risky, organizations can run more experiments and introduce new features more quickly.

When running a dark launch, monitoring is essential, most of all because you need to confirm if the new version of a service is ready for real use. You will also want to ensure that your infrastructure can handle the dark launch while preventing any impact on your users. In this post, we will explore best practices for monitoring dark launches, including how to:

We’ll illustrate this post with a hypothetical deployment of a SaaS application that enables users to book meeting rooms in the city of their choice. Our api-service receives API traffic from users and forwards it to upstream services for processing. One of these services, officebooking, processes POST requests to the /api/v1/booking endpoint. While the first version of officebooking registers meeting attendees in sequence, we have rolled out a change that improves performance by registering attendees in parallel. We deploy the new version of officebooking as a dark launch, using Traefik (a reverse proxy) to mirror requests between the released version of officebooking and the new version.

Test your dark launch

Your monitoring setup needs to verify the success or failure of your dark launch by tracking the data that best summarizes your service’s health and performance while catching any issues before they reach users. In this section, we’ll explain how to:

An overview of the officebooking service and dark launch within Datadog.
An overview of the officebooking service within Datadog.

Follow your SLIs

You will need to establish explicit criteria to determine if your dark launch is successful. If you haven’t already done so, we recommend setting service level objectives (SLOs) for any service that you run, and identifying service level indicators (SLIs) to track your SLOs. This way, when you deploy a dark launch, you can use these SLIs to track performance. If the SLIs you set for your service begin to show unacceptable values during your dark launch, you’ll have a good idea of what changes you should prioritize.

All of your monitoring data should be tagged by version so you can easily visualize the health and performance of your dark launch alongside that of your latest release.

We design our officebooking SLOs around service level indicators (SLIs) that most immediately affect users, as shown below:

  • the uptime of the service
  • the percentage of requests to /api/v1/booking that result in internal server errors (HTTP 500)
  • the percentage of requests with a p95 response latency of over 500ms (this is a lower p95 latency target than the released version of the service)
Datadog's SLO Summary Widgets for the service we are testing with a dark launch.
SLO Summary Widgets within Datadog.

Next, you should set automated alerts on your service’s SLIs, grouped by the version tag you set. That way, if the dark launch performs below expectations, your team can get notified automatically. For example, one alert for our officebooking service tracks the number of HTTP 500 errors for both the released (version:v1) service and the dark launch (version:v2), and notifies us if there are more than 100 such errors in a five-minute period.

You’ll want to specify in advance how long you should let your dark launch run without triggering an alert before you consider it successful. Google recommends running the dark launch for at least one day to ensure that it can handle realistic levels of load variance.

An alert set on 500 errors from the officebooking service, including the dark launch.

While alerting on SLIs is critical to spotting issues and ensuring your dark launch “works,” you’ll also want to set up a dashboard that can help you diagnose any issues that may arise (or use a monitoring service that can produce such a dashboard automatically). The dashboard should visualize SLIs alongside other key metrics to provide context for troubleshooting. You can use the version tag in these graphs to compare performance between pre- and post-release deployments more easily. When running the dark launch of our officebooking service, we can follow the service’s error rate alongside other performance metrics, broken down by version.

The Deployment Tracking view in Datadog showing the officebooking service, including the dark launch.
The Deployment Tracking view in Datadog shows key application metrics for the officebooking service.

Spot unexpected responses

As you track your dark launch’s SLIs, you’ll also want to watch for unexpected responses to client requests. An application can appear successful in terms of SLIs—e.g., it could be available and error free—but still run into bugs while processing data. Tracking your dark launch’s responses will allow you to identify and fix unexpected behavior before you risk a degraded user experience.

You should track response bodies from your dark launch by:

Analyze your logs

To spot unexpected behavior from your dark launch, you should log data from your service’s responses. Your service may be generating too many payloads for you to examine all of them, but you will still need to surface as much data as you can from individual responses to investigate bugs.

To see both high-level trends and low-level context, you should design your service to emit structured log messages (e.g., in JSON) that include key information from your response payloads. It’s important that you tag your logs by version so you can compare responses from your dark launch with responses from the released version of your service. To do so, you can use the structured logging library for your programming language (e.g., Java, Go, C#, and Python)—or use a monitoring agent to tag logs with the version of your service. (Since your dark launch may be logging a high volume of responses while handling production load, you should have a policy in place for managing these logs.)

For example, the officebooking service validates payloads before writing to the persistence layer, and emits a log after validation that includes the data it will write. This way, we can track the service’s validation and storage behavior during the dark launch.

Once you’ve set up logging, a monitoring platform can help you aggregate and graph your logs by the values of your log fields, so you can determine whether your service is handling user requests as expected. For example, the officebooking service registers meeting attendees by writing to a persistence layer. Because we have set up our service to log these writes, we can use a monitoring platform to graph the number of attendees that would be registered by service version. This way, we can tell if our dark launch would register attendees as expected.

The Log Analytics view for the released service and dark launch.

Run automated tests

While analyzing logs from your dark launch gives you an overall sense of how it responds to requests, you’ll also want to establish a baseline for whether your application is sending the expected HTTP messages in response to typical requests. That way, you can use these tests, along with your SLI-based alerting and dashboards, to evaluate whether your dark launch is ready for real users. You should use automated tests to check whether your dark launch returns the expected results for a predefined set of user interactions—and you should label these tests by version so you can isolate test failures to your dark launch or the released version of your service.

If your application has a browser-based GUI, your CI/CD environment should run end-to-end tests using a browser automation tool like Puppeteer or Selenium WebDriver—or you can use a centralized platform that executes browser tests for you. And if your dark launch implements an API, you’ll want to ensure that your new code hasn’t violated your service’s API definition. You should set up tests within your CI/CD environment to query your service and evaluate the response.

To make sure our officebooking dark launch meets the expectations we set for our API definition, we set up synthetic tests that send valid POST requests to the /api/v1/booking endpoint (as shown below) as well as invalid ones, and ensure that the service returns the expected responses. To set up something similar, you will need to make sure that requests from your synthetic test runner will reach your dark launch—if they hit your proxy instead, you will receive responses from the released version of the service.

An API test for the officebooking dark launch.

Prevent infrastructure issues

Dark launches require deploying a whole new set of compute resources for your service, along with any dependencies needed to route traffic and interact with persistent data. If you’re multiplying your mirrored traffic to test load—as Google recommends—you’ll likely need to scale up your production infrastructure for the dark launch. You should monitor your infrastructure to ensure that your dark launch doesn’t degrade production infrastructure—and unintentionally affect users. To do so, you should:

  • Ensure that your infrastructure has ample capacity to handle the dark launch
  • Set up monitoring to surface unintended interactions between your dark launch and other parts of your infrastructure

Ensure proper capacity

To check whether your infrastructure has the proper capacity to handle your dark launch, you should use dashboards and automated alerts to monitor key resource metrics. You should determine what levels of resource utilization on your service’s compute infrastructure correspond with warning signs, such as OOM errors, full storage devices, or failed health checks. Make sure you have created automated alerts on these thresholds, tagged by version. You should also have dashboards set up for any metrics you use for automated alerts, and include links to relevant dashboards in your alert notification messages, so you can get context into any alerts you receive.

Aside from the infrastructure running your dark launch, you’ll also need to monitor the infrastructure you use to manage the deployment. If you’re using a feature flag store (e.g., Consul or etcd) or reverse proxy to manage the dark launch, make sure it has adequate CPU and memory to handle the volume of requests you expect.

We monitor the infrastructure we use to run our officebooking service with a dashboard that tracks key resource metrics, and use the version tag to distinguish between the released and dark launch versions of the service. The dashboard tracks resource utilization across the containers running service instances as well as those that run our proxy server (Traefik).

A dashboard for metrics from the officebooking dark launch infrastructure.

Spot unintended service interactions

If your dark launch needs to interact with a persistence layer (e.g., a cache or a database), you have two options for protecting your user data: either configure your dark launch to have read-only access to your persistence layer or deploy a separate instance of the persistence layer for the sole use of your dark launch. Either way, you should monitor these interactions to make sure your dark launch isn’t modifying customer data.

First, you should look out for unexpected interactions between your dark launch and your production persistence layer. Your dark launch should not be connecting to the production database if it is designed to use a copy, for example, nor should your production database be committing write transactions for your dark launch.

The easiest way to identify unexpected interactions with your persistence layer is to instrument your application code for tracing, then use a visualization and analytics platform like Zipkin to display a map of requests between services. You can also use a monitoring platform to collect network flow logs and build a graph of relationships between services, helping you discover unintended traffic between services you have not yet instrumented for tracing.

In our dark launch, for example, the Network Map below helps us confirm that our proxy is routing traffic between the released version and the dark launch, and that each version uses its own copy of the database (postgres-main and postgres-canary).

The Network Map showing a dark launch.

Second, you should monitor work metrics for your dark launch’s persistence layer, such as the number of commits per second, and correlate that with resource metrics such as CPU, disk, and RSS memory utilization. You should tag these metrics to see if any unusually heavy reads or writes are the result of your released service or the dark launch.

The tags you use for these work metrics will depend on how you have deployed your dark launch, as well as the infrastructure you use for your persistence layer. If the persistence layer is shared between the released and dark launch versions of your service, your database metrics may not be able to distinguish between connections from your dark launch and connections from your released version. In that case, you can correlate your database metrics with the aggregate log data we mentioned earlier—grouped by version—to understand where demand on your database tends to originate.

If you have deployed a copy of your persistence layer for your dark launch, it’s much easier to understand interactions with your persistence layer: tag each copy of the database with enough information to identify the deployment it belongs to. In our case, we are graphing resource metrics for our Postgres hosts alongside work metrics. Since we have deployed a copy of the database for the dark launch, we have grouped our graphs using the service tag: postgres-main and postgres-canary.

storage-dash.png
A dashboard for the persistence layer of the officebooking dark launch infrastructure.

Get comprehensive visibility into your dark launches

In this post, we have learned how to monitor dark launches in order to confirm that new versions of a service are working as expected before releasing them to users.

Datadog provides a unified visualization, alerting, and analysis platform for monitoring all data from your dark launches. Getting complete visibility across pre– and post-release versions of a service is easy with Unified Service Tagging and Deployment Tracking, and you can use Datadog’s Synthetic API tests and browser tests—along with SLO management—to get detailed insight into whether your dark launches are functioning as expected.

While dark launches can involve multiple services and teams, Datadog helps reduce the complexity. Built-in collaboration tools, such as Notebooks and Incident Management, make it easier to reach out to teams that own your service’s dependencies. What’s more, Datadog integrates with services that help manage your dark launch, such as Consul and LaunchDarkly for feature flags, as well as traffic routing technologies like HAProxy and Istio—plus all major CI/CD technologies.

If you’re not yet a Datadog user, you can get started monitoring your dark launch with a .