モニタリングのベストプラクティス | Datadog


Author Jay Hotta
Published: February 24, 2015

Datadogのco-founderでCTOのAlexis Lê-Quôcが、PagerDutyのインタビューを受けた際に、”モニタリングのベストプラクティス”について話した内容をPagerDutyのVivian Auさんがまとめたポストがあったので、重要な部分を日本語に翻訳して、Datadogの日本語ブログサイトで公開したいと思います。


Monitoring Best Practices Learned from IT Outages(原文)より

Monitoring goals

Why would you spend time getting better monitoring?

  1. To know about an issue before your customers or your boss
  2. To know how your systems & applications are performing
  3. To minimize your stress level



  1. お客さまや上司が気付く前に、問題の発生を検知するため
  2. システムやアプリケーションが正常に動作しているかを知るため
  3. ストレスを最小限にとどめるため


Classifying metrics

What kind of metrics does your monitoring tool track? Examples are: CPU utilization, memory utilization, database or web requests. That’s a lot of different types of metrics and they can be divided into two fundamental classifications of metrics – work and resource.



Work metrics

A work metric measures how much useful stuff your system or application is producing. For instance, we could look at the number of queries that a database is responding to or the number of pages that a web server is serving per second. The purpose of a database is to answer queries. The purpose of a web server is to serve pages. So these are appropriate work metrics.

Another work metric would be things like how much money is your application producing? That’s a very useful work metric to track availability and understand the effectiveness of your application and infrastructure.




Resource metrics

The other class is resource metrics. A resource is something that is used to produce something useful. You use a resource to produce some work. So a resource metric measures how much of something is consumed to produce work. When you ask the question, “how much CPU am I consuming in the database?” it doesn’t really say much about whether that’s useful or not. It just says, “Well, I have more CPU available” or “I’m maxed out and my CPU is completely maxed out.” Same for memory, disk, network and so on. In general, I’ve used resource metrics for capacity planning rather than for availability management.



Optimizing your monitoring

Now that we’ve defined work and resource metrics, we can move to best practices.



1. Classify key metrics as work or resource

Look at your key metrics, specifically the ones you really care about, and figure out whether they’re work metrics or resource metrics.

1. キーメトリクスをワークメトリクスとリソースメトリクスに分類します。


2. Only alert on work metrics

Once you’ve done this classification – and it’s really important to spend time doing this – you need to identify what you want to get alerted on. You only want to get alerted on work metrics.

In other words, you want to get alerted on things that measure how useful your system is.

I should mention that it’s useful to alert on some resource metrics if they’re a leading indicator of a failure. For instance, disk space is a resource metric. However, when you run out of disk space, the whole show stops so it’s also important to alert on these metrics. But in general, alerting on resource metrics should be rare.

2. ワークメトリクスについてアラートを設定します。




3. Only alert on actionable work metrics

The tweak to the previous best practice is that you really only want to alert on actionable work metrics. In other words, you want to alert on work metrics that you can do something about.

For instance, an actionable work metric for a web server is how many webpages you serve without errors per second. That’s a work metrics because if you’re serving zero pages, your website is not running at all – it’s down.

A non-actionable work metric could be how many 404s I’m serving per second. This isn’t an actionable work metric because this will entirely depend on what people are doing on your site. If they are browsing to URLs that don’t exist, then you’re going to get a lot of 404s. This doesn’t mean it’s bad, but rather that they’re doing something that’s not expected. So you should not alert on non-actionable work metrics.

3. アクションが起こせるワークメトリクスにアラートを設定する。


例えば、「1秒間にエラー無しで、配信したwebページ数」は、何かのアクションがとれるワークメトリクスになります。何故ならば、もしも0ページしか配信していないなら、webサーバは一切稼働していなことになります。- それは、サーバが停止していることを意味しています。


4. Review metrics and alerts periodically

The fourth, and maybe one of the hardest best practices, is to actually do a review and iterate on this process on a regular basis. Maybe it’s a weekly, bi-weekly or monthly thing, but you really want to carve out some time in your busy schedule and do a review with your team.

4. メトリクスをレビューし、アラートの最適化をする。


Back to goals

Now, let’s tie back back these best practices to the initial goals of monitoring that I mentioned. Classifying key metrics as work or resource is a prerequisite for everything.



a. To know about an issue before your customers or your boss

Only alert on work metrics so you know that you won’t be alerting on stuff that’s not useful and therefore have a much better result

a. お客さまや上司が気付く前に、問題の発生を検知するため


b. To minimize your stress level

Only alert on actionable work metrics because you’re not going to get alerted on things over which you have no control

b. ストレスを最小限にとどめるため


c. To know how your systems & applications are performing

Review metrics and alerts periodically so you have a good sense of how your systems are performing, trending and how you can change things.

c. システムやアプリケーションが正常に動作しているかを知るため


Use these best practices to improve your monitoring strategy and when you’re ready to implement, try a 14-day free trial of Datadog to graph and alert on your actionable work metrics and any other metrics and events from over 80 common infrastructure tools.

これらのベストプラクティスにより監視戦略を改善をしてみてください。そして、その新しい監視戦略を実装する際には、Dataogが提供する14日間のフリートライアルで、アクショナブルワークメトリクスとイベントのグラフ化、アラート、 80種類を超えるインテグレーションを試してみてください。