AWS outage? Datadog alerts you
No service is foolproof. Even the most reliable ones, like Amazon Web Services, can experience outages. You might have heard about the DynamoDB service disruption that happened last month. Many websites and applications such as Netflix, Reddit, Medium, Pocket, Buffer, and Product Hunt were affected and became inaccessible. Maybe you were affected, too.
Even though you are not responsible for AWS outages, your company may lose revenue, and your users will probably blame you. So you want to be immediately alerted when AWS is down in order to make sure you limit the impact as much as possible. That’s why today we are releasing AWS Outage Alerts on Datadog. Thanks to this new feature, your team will be notified right away whenever any AWS service is having an issue.
Datadog constantly checks the AWS Service Health Dashboard. This status page is updated by AWS and shows whether each service is operating normally. So, whenever one of them is having an outage or any problem, Datadog knows immediately.
Set it up in 1 minute
Finally, select Amazon Web Services, and you will be able to set up the conditions of the alert in the Integration Status page:
All the power of Datadog alerts
AWS outage alerts are full-featured Datadog alerts, so you can:
- Choose a scope so you can trigger different alerts depending on the AWS services and the availability zones impacted by the outage
- Set the alert conditions you want (we recommend you trigger and resolve the alert after one check reports a status change)
- Customize the alert message that will be sent to your teams so you can specify what’s happening and suggest what can be done to limit the damage during an AWS outage
- Select who should be notified (specific people, only engineers on call, etc.) and via which communication channels (PagerDuty, email, Slack, HipChat…)
Spot issues directly from your dashboards
You can also see AWS outages at a glance by adding a Check Status widget to your screenboards.
You can then select which service you want to monitor the status and which regions matter to you:
A Single Check will monitor only one service/region combination while a Cluster of Checks allows you to monitor any service globally. If 1 out of 3 regions are down for example for the selected service, you will see a red
1 and a green
Riding out the storm
You may be able to limit the impact on your applications while waiting for AWS to resolve the outage. Depending on the services you use, their configuration, and the volume of traffic you are receiving, you may be able keep your applications up and responsive.
For example, your DynamoDB tables are replicating data across multiple availability zones in order to remain accessible if the service goes down in a specific AZ. Your load balancers should be able to distribute incoming requests to viable availability zones thanks to the cross-zone load balancing feature. If DynamoDB went out in one AZ, you might want to consider adding more DynamoDB instances in the remaining viable AZs in order to avoid overloading your remaining instances and having their requests throttled.