Understand Your Nagios Alerts With Datadog

Understand your Nagios alerts with Datadog

Step 1: Cut through the Nagios alert noise

Things failed and you’re getting alerts. But instead of a barrage of obscure Nagios notifications, Datadog:

aggregates Nagios alerts, here 60 of them
lets you see what successive states the check has gone through
gives you additional info with tags, such as your AWS availability zone
shows the alert in-context with other events, such as AWS downtime or Chef runs

Step 2: Fix it, with the help of your team

Since you saw the alert first, better fix it!

Besides the additional context I just mentioned, Datadog recalls what was done the last time a similar alert took place, and brings it right back to you, in-context. It also tells you who worked on it in the past, so you know who to ask. This is particularly useful if you have a large ops team, or want to give developers operational responsibilities.

Once you’ve fixed the issue, be sure to share what you did. This way Datadog will remember it the next time you or someone else gets an alert.

Step 3: Post-mortem. Understand what really happened

Ok, so you’ve stabilized the problem. Phones stopped ringing and everyone’s blood pressure went down a notch. In many cases you’ll want to look back at the alert, understand what combination of factors led to it, and identify what code or systems you should durably fix.

To that end, Datadog lets you easily correlate events and metrics across tools and services: all events can be searched and overlaid over metrics graphs.

On this picture, we’re showing Nagios alerts related to our faulty process as red bars—darker means “more alerts”—overlaid over a cache hit rate metric sourced from our Cassandra integration. Looks like big waves of cache misses correlate pretty strongly with alerts here.

Step 4: Trend analysis. See the big picture, improve what matters.

Last but not least, you need to step back on a regular basis, assess your overall situation, and verify that you are—indeed—improving as time goes by, not just knocking down one alert after another.

Datadog sends you weekly reports identifying notable alerting trends. And because there’s more than one way to slice your data, you can explore it all interactively!

alerting trends report — Click on the report to see it in action.

Wait, there’s more.

Nagios is only one of the many tools and services integrated by Datadog, and although a number of them have interesting interactions with Nagios, such as Chef, Puppet, and Pagerduty, I’ll leave them for another day.

If you found this interesting, use Nagios, and want to do better than your inbox for alert management, do create your Datadog account now. It only takes a few minutes.

Want to work with us? We're hiring!

Understand your Nagios alerts with Datadog

Further Reading

Step 1: Cut through the Nagios alert noise

Step 2: Fix it, with the help of your team

Step 3: Post-mortem. Understand what really happened

Step 4: Trend analysis. See the big picture, improve what matters.

Wait, there’s more.

Further Reading

Start monitoring your metrics in minutes

Understand your Nagios alerts with Datadog

Further Reading

Step 1: Cut through the Nagios alert noise

Step 2: Fix it, with the help of your team

Step 3: Post-mortem. Understand what really happened

Step 4: Trend analysis. See the big picture, improve what matters.

Wait, there’s more.

Related jobs at Datadog

Further Reading

Datadog, now with Windows support

On the importance of real time graphs

Try the new PagerDuty integration

Lessons learned from running a large gRPC mesh at Datadog

Start monitoring your metrics in minutes