No service is foolproof. Even the most reliable ones, like Amazon Web Services, can experience outages. In these cases, any websites and applications that use AWS can be severely affected and even become inaccessible. Maybe you have been affected, too.
Even though you are not responsible for AWS outages, your company may lose revenue, and your users will probably blame you. So you want to be immediately alerted when AWS is down in order to make sure you limit the impact as much as possible. That’s why today we are releasing AWS Outage Alerts on Datadog. Thanks to this new feature, your team will be notified right away whenever there is a change in AWS status.
Datadog constantly checks the AWS Service Health Dashboard. This status page is updated by AWS and shows whether each service is operating normally. So, whenever one of them is having an outage or any problem, Datadog knows immediately.
Finally, select Amazon Web Services, and you will be able to set up the conditions of the alert in the Integration Status page:
AWS outage alerts are full-featured Datadog alerts, so you can:
- Choose a scope so you can trigger different alerts depending on the AWS services and the availability zones impacted by the outage
- Set the alert conditions you want (we recommend you trigger and resolve the alert after one check reports an AWS status change)
- Customize the alert message that will be sent to your teams so you can specify what’s happening and suggest what can be done to limit the damage during an AWS outage
- Select who should be notified (specific people, only engineers on call, etc.) and via which communication channels (PagerDuty, email, Slack, HipChat…)
Datadog’s out-of-the-box integration dashboards already give you deep insight into your AWS infrastructure. Now you can also see AWS status changes and outages at a glance by adding a Check Status widget.
You can then select which service you want to monitor the status and which regions matter to you:
A Single Check will monitor only one service/region combination while a Cluster of Checks allows you to monitor any service globally. If 1 out of 3 regions are down for example for the selected service, you will see a red
1 and a green
You may be able to limit the impact on your applications while waiting for AWS to resolve the outage. Depending on the services you use, their configuration, and the volume of traffic you are receiving, you may be able keep your applications up and responsive.
For example, your DynamoDB tables are replicating data across multiple availability zones in order to remain accessible if the service goes down in a specific AZ. Your load balancers should be able to distribute incoming requests to viable availability zones thanks to the cross-zone load balancing feature. If DynamoDB went out in one AZ, you might want to consider adding more DynamoDB instances in the remaining viable AZs in order to avoid overloading your remaining instances and having their requests throttled.