End-to-End Reliability Testing With PagerDuty & Datadog | Datadog

End-to-end reliability testing with PagerDuty & Datadog

Author Ashwin Jiwane

Published: July 30, 2014

This is a guest post by Ashwin Jiwane, Software Engineer at PagerDuty. You can connect with Ashwin on Twitter @ashwin1287.

Data isn’t only for spreadsheets and executive meetings. People on the frontlines can also make use of data to improve their systems’ reliability. At PagerDuty, we use Datadog to measure the effectiveness of the third-party services we rely on to deliver SMS alerts to our customers.

Because our customers depend on us to send them alerts when their systems are down, having an outage of our own can have a ripple effect. We put extra emphasis on our notification pipeline to ensure that all alerts are sent and received by our customers.

Reliability meets third-party providers

Reliability and reliability testing gets a little trickier when you use third-party services. At PagerDuty, we can’t send SMS notifications on our own. We have to rely on providers and carrier networks to get the job done and deliver alerts. But, just because we can’t control our third-parties doesn’t mean we can point the finger at them when they’re down. Their failure is our failure and our customers hold us accountable for any downtime.

In response, by combining Datadog and PagerDuty, we have created the End-to-End Third-Party Provider testing practice. This process proactively discovers outages occurring in one of our provider’s systems and quickly finds a replacement to minimize or avoid customer impact.

By setting up three phones, each with different mobile carrier networks, we use an internally-built mobile app that is configured to send a PagerDuty SMS alert to each of the phones in a round-robin rotation.

Using Datadog, we calculate the time taken for each SMS to reach the designated phone and how long our testing app takes to reply back to us. Based on measurements we determine if a provider is down or degraded, then take the appropriate action.

When the receive and reply statuses take too long, the color of the boxes changes to visually indicate there are issues.

Reliability Testing with PagerDuty and Datadog
Color-coded results of SMS latency (in seconds)

When a provider exceeds our acceptable thresholds, it is considered downgraded and a PagerDuty alert is sent to the on-call engineer on our Real-Time team using the Datadog and PagerDuty integration. The engineer will then switch the priority levels of each of our providers to ensure the most functional provider is used first and subsequent providers are only tried if one fails.

Reliability Testing with PagerDuty and Datadog
Average SMS latency over last 10 minutes

Reliability testing - test everything that matters

Any system will ultimately fail. Having visibility into your systems and consistently testing your systems is an imperative to limit the impact a failure will have on your customers.

If, like PagerDuty, you rely on third-party vendors to power some parts of your product or service that are customer facing, you must test them and have redundant services you can fail over to should you need them.

It’s not always enough to make sure your code quality is 100%. With many interchangeable parts from third-party vendors, taking extra initiative will make sure your customers aren’t impacted.

If you’ve experienced in the past, outages from the third-party tools you rely on and would like to know as soon as they occur, try PagerDuty for free and a of Datadog. These two services working together will help keep your systems available for your customers.