
Anthony Rindone

Paige Andrews

Ryan Lucht
As major incidents like AWS’s October 2025 outage illustrate, modern systems are immensely interconnected. A failure in one can lead to a cascade of downstream problems. In this case, issues with DNS resolution for DynamoDB led to widespread disruptions with other AWS services and, subsequently, thousands of applications and services that rely on that infrastructure. Even if your application wasn’t hosted on AWS, integral parts of your environment—like your feature flagging service—might have been affected. Losing those services can be just as debilitating as if your application itself went down.
In this post, we’ll look at how Datadog Feature Flags continued to function during this outage due to its architectural design focused on resilience. By distributing configuration data globally via edge content delivery networks (CDNs) and evaluating feature flags locally, Datadog’s system maintained performance and consistency even as parts of the broader cloud ecosystem experienced instability. This localized evaluation and distributed availability also ensures that Datadog Feature Flags never become a single point of failure for customers, even in the unlikely event of a Datadog or CDN outage.
An architecture built for resilience
The reliability of Datadog Feature Flags hinges on a simple principle: evaluate flags locally using a cached configuration object distributed globally. This architecture provides two key layers of protection during incidents like the recent AWS outage:
- Independence for evaluation: Since the flag evaluation itself is local, it has no dependency on cloud providers like AWS, Google Cloud, or Azure or any other regional service. As long as your application is running, the flags can be evaluated.
- Independence for distribution: The flag configuration files are served by Fastly, a global CDN provider, which is resilient to specific cloud provider outages. Applications initializing during an outage can still fetch their configurations, and applications that are already running simply use their cached versions.
Instead of calling a server for every feature flag decision, the Datadog flagging SDK operates in two main steps:
- Initialization: When your application starts, the SDK makes a single request to fetch a JSON configuration object that contains all precomputed assignments, or all the rules for all your feature flags. This object is not hosted on a traditional application server but is distributed across a global edge network (Fastly).
- Local Evaluation: Once this configuration is fetched, it’s cached by the SDK. From that point on, every time your code checks a feature flag, the decision happens instantly and locally within your own application. No network call is required to determine if a feature should be on or off for a given user.
These design choices mean that even if our primary servers were to go down, the CDN’s cached configurations would remain available, insulating your flag evaluations from the failure and ensuring that your application remains operational.
Deploy safely even during cloud provider outages
Feature flagging is a vital component of modern applications. Datadog Feature Flags is built for resilience by combining distributed architecture, local evaluation, and fallback rules to launch and deploy safely without incidents, even if external systems falter.
Datadog Feature Flags is now in preview. See our documentation for more information and to request access. Also, see how you can use the free Updog.ai service to detect service outages early. For example, Updog.ai detected the Amazon DynamoDB degradation 32 minutes before AWS updated its own official status page. This same technology powers Datadog’s in-app External Provider Status feature, helping engineering teams respond faster and with more context. If you’re not a customer, get started now with a 14-day free trial.




