モニタリングのベストプラクティス

Jay Hotta

Datadogのco-founderでCTOのAlexis Lê-Quôcが、PagerDutyのインタビューを受けた際に、"モニタリングのベストプラクティス"について話した内容をPagerDutyのVivian Auさんがまとめたポストがあったので、重要な部分を日本語に翻訳して、Datadogの日本語ブログサイトで公開したいと思います。

Monitoring Best Practices Learned from IT Outages(原文)より

Monitoring goals ---------------- Why would you spend time getting better monitoring? 1. To know about an issue before your customers or your boss 2. To know how your systems & applications are performing 3. To minimize your stress level

モニタリングをする目的

よりよいモニタリングを行う必要があるのは何故でしょう。

お客さまや上司が気付く前に、問題の発生を検知するため
システムやアプリケーションが正常に動作しているかを知るため
ストレスを最小限にとどめるため

などの、目的があります。

Classifying metrics ------------------- What kind of metrics does your monitoring tool track? Examples are: CPU utilization, memory utilization, database or web requests. That’s a lot of different types of metrics and they can be divided into two fundamental classifications of metrics – work and resource.

メトリクスの分類

皆さんは今使っているモニタリングツールで、どのようなメトリクスを監視、追跡してしていますか？例えば、CPU使用率、メモリ使用率、データベースやWebのリクエストなどでしょう。それらは多くの異なるタイプのメトリクスです。それらのメトリクスは、ワーク（仕事）とリソース（資源）の2つのタイプのメトリクスに分類することができます。

### Work metrics A work metric measures how much useful stuff your system or application is producing. For instance, we could look at the number of queries that a database is responding to or the number of pages that a web server is serving per second. The purpose of a database is to answer queries. The purpose of a web server is to serve pages. So these are appropriate work metrics. Another work metric would be things like how much money is your application producing? That’s a very useful work metric to track availability and understand the effectiveness of your application and infrastructure.

ワークメトリクスでは、システムやアプリケーションが生み出している価値を計測します。例えば、データベースが応答しているクエリの数や、webサーバが一秒間に配信しているページの数などを監視することがここで言う価値の計測になると思います。データベースの目的はクエリに応答することで、応答したクエリの数が価値です。Webサーバの目的はページを配信することで、配信したページの数が価値です。したがって、これらは適切なワークメトリクスになります。

もう一つのワークメトリクスは、「アプリケーションがどのくらいのお金を生み出している」というようなものでしょうか。これはアベイラビリティを監視、追跡し、アプリケーションとそれを運用するインフラの有効性を理解するために非常に便利なワークメトリクスです。

### Resource metrics The other class is resource metrics. A resource is something that is used to produce something useful. You use a resource to produce some work. So a resource metric measures how much of something is consumed to produce work. When you ask the question, “how much CPU am I consuming in the database?” it doesn’t really say much about whether that’s useful or not. It just says, “Well, I have more CPU available” or “I’m maxed out and my CPU is completely maxed out.” Same for memory, disk, network and so on. In general, I’ve used resource metrics for capacity planning rather than for availability management.

リソースメトリクス

もう一つの分類はリソースメトリクスです。リソースは有用な何かを生成するために使用される資源(リソース)です。あなたは仕事的価値を生み出すために資源を使っています。リソースメトリクスとは、仕事的価値を生み出すために何をどれだけ消費しているかを計測します。あなたが次のような質問をしたとしましょう。「データベースで、どれくらいのCPUパワーを消費しているのだろう」。この質問はその有用性に関しては言い及んでいません。これを言い換えると、「まだCPUには余裕があるな。」とか「これは大変だ！CPUパワーを完全に使い切っている。」と言っているだけです。メモリ、ディスク容量、ネットワーク帯域なども同じことです。従って私はリソースメトリクスをインフラのアベイラビリティの管理ではなく、キャパシティプランニングに使ってきました。

Optimizing your monitoring -------------------------- Now that we’ve defined work and resource metrics, we can move to best practices.

モニタリングの最適化

ワークメトリクスとリソースメトリクスの定義ができたので、ベストプラクティスについて話を発展させることにします。

### 1. Classify key metrics as work or resource Look at your key metrics, specifically the ones you really care about, and figure out whether they’re work metrics or resource metrics.

1. キーメトリクスをワークメトリクスとリソースメトリクスに分類します。

メトリクスを精査し、本当に必要なもの（キーメトリクス）を特定します。その後、ワークメトリクスとリソースメトリクスに分類します。この作業には十分な時間を費やすことが非常に重要です。

### 2. Only alert on work metrics Once you’ve done this classification – and it’s really important to spend time doing this – you need to identify what you want to get alerted on. You only want to get alerted on work metrics. In other words, you want to get alerted on things that measure how useful your system is. I should mention that it’s useful to alert on some resource metrics if they’re a leading indicator of a failure. For instance, disk space is a resource metric. However, when you run out of disk space, the whole show stops so it’s also important to alert on these metrics. But in general, alerting on resource metrics should be rare.

2. ワークメトリクスについてアラートを設定します。

分類が出来たら、アラートを受けたいメトリクスを特定します。ここでアラートを設定するメトリクスは、ワークメトリクスのみにします。

言い換えれば、あなたのシステムがどれくらい有用なのかという基準でアラートを設定するようにします。

ただし、特定のリソースメトリクスがシステム障害の先行指標になるのであれば、そのメトリクスよりアラートを受けることは価値があることがあります。例えば、ディスク空き容量はリソースメトリクスです。しかし、ディスク空き容量が無くなってしまえば、システム全体が止まってしまいます。従って、それらのメトリクスに基づいてアラートを受けることは重要です。しかし、リソースメトリクスに基づいてアラートを受けることは、一般的に稀なことです。

### 3. Only alert on actionable work metrics The tweak to the previous best practice is that you really only want to alert on actionable work metrics. In other words, you want to alert on work metrics that you can do something about. For instance, an actionable work metric for a web server is how many webpages you serve without errors per second. That’s a work metrics because if you’re serving zero pages, your website is not running at all – it’s down. A non-actionable work metric could be how many 404s I’m serving per second. This isn’t an actionable work metric because this will entirely depend on what people are doing on your site. If they are browsing to URLs that don’t exist, then you’re going to get a lot of 404s. This doesn’t mean it’s bad, but rather that they’re doing something that’s not expected. So you should not alert on non-actionable work metrics.

3. アクションが起こせるワークメトリクスにアラートを設定する。

上記項目のベストプラクティスへの調整は、アクショナブルなワークメトリクスからアラートを受けることです。別の言葉で言うと、アラートの原因に対して何かのアクションが起こせるワークメトリクスからのみアラートを受けることにします。

例えば、「１秒間にエラー無しで、配信したwebページ数」は、何かのアクションがとれるワークメトリクスになります。何故ならば、もしも0ページしか配信していないなら、webサーバは一切稼働していなことになります。- それは、サーバが停止していることを意味しています。

アクションを取ることができないワークメトリクスの例は、「配信している404エラーの数」です。これは、アクショナブルなメトリクスではありません。何故ならこの数値は、サイトにアクセスしてきた人の行動に完全に依存しているからです。もし、サイトにアクセスしてきた人が存在しないURLを閲覧しようとし、大量の404エラーが出ているのなら、必ずしもシステムの問題ではないからです。サイトにアクセスしてきた人が、サイト運営者が予期していなかった行動をとっているということになります。したがって、アクションを取ることができないワークメトリクスに基づいてアラートを設定するべきではないのです。

### 4. Review metrics and alerts periodically The fourth, and maybe one of the hardest best practices, is to actually do a review and iterate on this process on a regular basis. Maybe it’s a weekly, bi-weekly or monthly thing, but you really want to carve out some time in your busy schedule and do a review with your team.

4. メトリクスをレビューし、アラートの最適化をする。

最後に、おそらく最も大変なベストプラクティスは、レビューと改善のプロセスを定期的に回していくことです。多分そのプロセスは、毎週、隔週または月ベースで、忙しいスケジュールの中で時間を削っての、他のチームメンバーとのレビューになるでしょう。

Back to goals ------------- Now, let’s tie back back these best practices to the initial goals of monitoring that I mentioned. Classifying key metrics as work or resource is a prerequisite for everything.

最初に設定した目的に照らし合わせてみる

それでは、これらのベストプラクティスを最初に設定した目的に照らして検討してみましょう。尚、重要なメトリクスを、ワークメトリクスとリソースメトリクスに分類するのが全ての基本になります。

### a. To know about an issue before your customers or your boss Only alert on work metrics so you know that you won’t be alerting on stuff that’s not useful and therefore have a much better result

a. お客さまや上司が気付く前に、問題の発生を検知するため

ワークメトリクスにのみアラートを設定します。これにより、「意味のないアラートを受けることがない」、「すべてのアラートは対処が必要!」という意識になります。したがって適切にアラートに対応できるようになります。

### b. To minimize your stress level Only alert on actionable work metrics because you’re not going to get alerted on things over which you have no control

b. ストレスを最小限にとどめるため

対応するアクションが取れるワークメトリクスにのみアラートを設定します。何故なら、対応することができないアラートを受けても何もすることが無いからです。

### c. To know how your systems & applications are performing Review metrics and alerts periodically so you have a good sense of how your systems are performing, trending and how you can change things.

c. システムやアプリケーションが正常に動作しているかを知るため

定期的にメトリクスとアラートのレビューを実施します。このレビューを繰り返すことで、システムがどのように役目を果たしているか、どのような傾向にあるのか、どのように物事を改善することができるか、という感覚が磨かれて行きます。

Use these best practices to improve your monitoring strategy and when you’re ready to implement, try a 14-day free trial of Datadog to graph and alert on your actionable work metrics and any other metrics and events from over 80 common infrastructure tools.

これらのベストプラクティスにより監視戦略を改善をしてみてください。そして、その新しい監視戦略を実装する際には、Dataogが提供する14日間のフリートライアルで、アクショナブルワークメトリクスとイベントのグラフ化、アラート、 80種類を超えるインテグレーションを試してみてください。

Get Started with Datadog