Managing Datadog with Terraform

Managing Datadog with Terraform

/ /
Published: April 7, 2017

What is Terraform?

Terraform is an increasingly popular infrastructure-as-code tool for teams that manage cloud environments spanning many service providers. New users are often drawn to Terraform’s ability to quickly provision compute instances and similar resources from infrastructure providers, but Terraform can also manage platform-as-a-service and software-as-a-service resources.

What Datadog calls integrations, Terraform calls providers; Terraform’s Heroku provider, for example, is all the code within Terraform that interacts with Heroku. This article introduces the Datadog provider, which can manage your timeboards, alerts, users, and scheduled downtime. We’ll use it to set up and edit a Datadog alert, and we’ll see how to create an AWS EC2 instance and an associated Datadog alert rule with a single command. But first, a brief pitch for Terraform. If you’re already a user or are already convinced to try Terraform, feel free to skip past the next few sections.

Why Terraform?

Maybe you’re the type who feels a twinge of guilt when you’ve spent too much time fumbling around in your favorite cloud provider’s web console. You’re keen on automating your environment as much as possible, and you’re familiar with Datadog’s API libraries, like datadogpy and its CLI counterpart, dog.

If you have only a small cloud footprint, rarely provision new services, or you’re a one-person team, interacting with our API from ad hoc scripts and shell one-liners may be as much automation as you need. Or, if you’re already using the Datadog libraries alongside those of other cloud providers to develop in-house tooling that fits neatly into your team’s workflow, great! We won’t try to convince you to ditch your own tooling. But if you’re a growing team without a well-established workflow, or your scale has outgrown your existing tooling, Terraform is worth a look.

Its design principles will be familiar to users of configuration management tools like Puppet. You’ll model your environment with declarative templates. Repeated application of the templates is idempotent (i.e. a template calling for three DNS records will not, when applied twice, create six records). And of course, though Terraform doesn’t require it, you will collaborate with teammates on your templates using a version control system like Git. Once you have adopted it, you should use Terraform—and only Terraform—to manage your entire cloud environment. Any scaling up, down, or out should start with a pull request and team review. No more tribal knowledge or out-of-date runbooks. It’s infrastructure as code.

Get Terraform

If you’re on a Mac and use Homebrew, a quick brew install terraform will get you started. If you don’t use Homebrew, or you want to ensure you get the latest version and a stable build, download the Terraform package for your system directly from HashiCorp and follow their instructions to complete the install.

Like most Go programs, Terraform is delightfully easy to install; all of its third-party dependencies are built into the terraform executable. The Datadog provider has been built in since it was introduced in version 0.6.12 (released in early 2016).

Before writing your first Datadog resource, make sure Terraform is installed:

$ which terraform
/usr/local/bin/terraform
$ terraform version
Terraform v0.9.2

Your first Datadog resource

Let’s create a simple monitor resource to set an alert condition in Datadog. We’ll start with a bare resource and fill in the minimum requirements soon thereafter.

In any directory, create a file main.tf with the following contents:

resource "datadog_monitor" "cpumonitor" {
}

Resource types always follow the pattern ”<provider_name>_<resource>”. In this case, the type is ”datadog_monitor”. The resource name is up to us, and we’ve chosen ”cpumonitor”. Resource names only serve to uniquely identify the resource within Terraform templates; we’ll give the monitor its Datadog name in a moment.

Let’s apply the bare template:

$ terraform apply
provider.datadog.api_key
  Enter a value: 

You can enter your Datadog credentials via CLI, but obviously you won’t want to do this every time you run Terraform. Cancel this apply by entering blank credentials or by interrupting Terraform (Ctrl-C), and let’s provide credentials a better way.

There are two options: set them within a ”datadog” provider block, or in shell environment variables. You should never set secrets directly in a provider block since they would get checked into source control, but there’s another way. HashiCorp recommends using a provider block which—for credentials and other secrets—references variables you’ve added in a separate, source-control-exempt file, terraform.tfvars. We won’t get into using input variables here, so let’s just set credentials in the Datadog provider’s special environment variables:

$ export DATADOG_API_KEY="<your_api_key>"
$ export DATADOG_APP_KEY="<your_app_key>"

See your Datadog account settings if you don’t know your API key, and create an application key at the same page if you don’t already have one.

Then, apply the template again:

$ terraform apply
4 error(s) occurred:

* datadog_monitor.cpumonitor: "message": required field is not set
* datadog_monitor.cpumonitor: "name": required field is not set
* datadog_monitor.cpumonitor: "query": required field is not set
* datadog_monitor.cpumonitor: "type": required field is not set

The monitor resource requires four fields. If you’ve ever created a monitor via our API, this won’t be news to you.

Let’s fill in the minimum fields required to create a ‘metric alert’ type monitor. Edit main.tf as follows:

resource "datadog_monitor" "cpumonitor" {
  name = "cpu monitor"
  type = "metric alert"
  message = "CPU usage alert"
  query = "avg(last_1m):avg:system.cpu.system{*} by {host} > 60"
}

And apply the template once more:

$ terraform apply
datadog_monitor.cpumonitor: Creating...
  include_tags:        "" => "true"
  message:             "" => "CPU usage alert"
  name:                "" => "cpu monitor"
  new_host_delay:      "" => "<computed>"
  notify_no_data:      "" => "false"
  query:               "" => "avg(last_1m):avg:system.cpu.system{*} by {host} > 60"
  require_full_window: "" => "true"
  type:                "" => "metric alert"
datadog_monitor.cpumonitor: Creation complete (ID: 1852924)

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

The state of your infrastructure has been saved to the path
below. This state is required to modify and destroy your
infrastructure, so keep it safe. To inspect the complete state
use the `terraform show` command.

Success! Terraform created the metric monitor and didn’t touch anything else. Every terraform apply compares your template to the state of the same resources, if any, that Terraform knows it has managed previously. Since this is your first apply, Terraform couldn’t have modified any of your existing monitors (unless you had first imported them).

Terraform knows which resources fall within its purview by keeping its own copy of the state of the infrastructure it manages. Notice that your first apply created a new file terraform.tfstate. Terraform uses this file to know which resources in your templates map to which resources in reality. If you’ll be collaborating with teammates on your templates, you should store state in a remote backend rather than having each collaborator use his or her own local terraform.tfstate.

Check out the new monitor in Datadog:

CPU Monitor

If some of your hosts are busy right now (i.e. hovering above 60% system CPU usage), the new monitor may start flapping between OK and ALERT. You won’t get spammed with alerts, though, since we didn’t add a notification channel to the monitor’s message.

As mentioned earlier, Terraform operations are idempotent; if we immediately apply the same template, Terraform ought to read the attributes of the new monitor, see that they still match those specified in the template, and do nothing:

$ terraform apply
datadog_monitor.cpumonitor: Refreshing state... (ID: 1852924)

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Now let’s add some optional fields to the monitor to illustrate how Terraform updates resources. It’s common to set tiered thresholds on metric monitors, so add some thresholds to the resource in a block-style field:

resource "datadog_monitor" "cpumonitor" {
  name = "cpu monitor"
  type = "metric alert"
  message = "CPU usage alert"
  query = "avg(last_1m):avg:system.cpu.system{*} by {host} > 60"

  thresholds {
    ok = 20
    warning = 50
    critical = 60
  }
}

This time, let’s not be so quick to apply the change. We’re about to modify an existing resource, and that can be nerve-racking, especially in a production environment. Fortunately, Terraform allows you to plan changes and review them before making them live:

$ terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

datadog_monitor.cpumonitor: Refreshing state... (ID: 1852924)
The Terraform execution plan has been generated and is shown below.
Resources are shown in alphabetical order for quick scanning. Green resources
will be created (or destroyed and then created if an existing resource
exists), yellow resources are being changed in-place, and red resources
will be destroyed. Cyan entries are data sources to be read.

Note: You didn't specify an "-out" parameter to save this plan, so when
"apply" is called, Terraform can't guarantee this is what will execute.

~ datadog_monitor.cpumonitor
    thresholds.%:        "0" => "3"
    thresholds.critical: "" => "60"
    thresholds.ok:       "" => "20"
    thresholds.warning:  "" => "50"


Plan: 0 to add, 1 to change, 0 to destroy.

By now you’ve noticed that Terraform can be quite verbose in its output. Here, it warns us we’re not saving this plan with the -out parameter. We won’t get into the particulars of terraform plan here, but if the no-guarantees disclaimer raises your eyebrow, read more about plans to understand how and why to use them.

Near the end of the plan output, we’re shown how Terraform will change the monitor if we run terraform apply. It looks how we expect, so go ahead and apply the template to add the thresholds:

$ terraform apply
datadog_monitor.cpumonitor: Refreshing state... (ID: 1852924)
datadog_monitor.cpumonitor: Modifying... (ID: 1852924)
  thresholds.%:        "0" => "3"
  thresholds.critical: "" => "60"
  thresholds.ok:       "" => "20"
  thresholds.warning:  "" => "50"
datadog_monitor.cpumonitor: Modifications complete (ID: 1852924)

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

You’ll see the new thresholds in the monitor’s page in Datadog, of course, but let’s inspect the monitor with Terraform:

$ terraform show
datadog_monitor.cpumonitor:
  id = 1852924
  escalation_message = 
  include_tags = true
  locked = false
  message = CPU usage alert
  name = cpu monitor
  new_host_delay = 300
  no_data_timeframe = 0
  notify_audit = false
  notify_no_data = false
  query = avg(last_1m):avg:system.cpu.system{*} by {host} > 60
  renotify_interval = 0
  require_full_window = true
  silenced.% = 0
  tags.# = 0
  thresholds.% = 3
  thresholds.critical = 60.0
  thresholds.ok = 20.0
  thresholds.warning = 50.0
  timeout_h = 0
  type = metric alert

Here we see a full list of the monitor’s attributes, including those we didn’t set. They’ve been set to default values, but note that these defaults were applied by the Terraform provider, not by the Datadog API. For any optional fields missing from a resource declaration, Terraform inserts those fields—using its own defaults—into its request to the Datadog API, so be mindful that Terraform’s defaults may occasionally differ from the cloud providers’ defaults.

Before moving on to a more interesting example, let’s finish our tour of your first Terraform Datadog resource with the last CRUD operation, delete:

$ terraform destroy
Do you really want to destroy?
  Terraform will delete all your managed infrastructure.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value: yes

datadog_monitor.cpumonitor: Refreshing state... (ID: 1852924)
datadog_monitor.cpumonitor: Destroying... (ID: 1852924)
datadog_monitor.cpumonitor: Destruction complete

Destroy complete! Resources: 1 destroyed.

Any time you’re squeamish about destroying resources, first run terraform plan -destroy.

Linking Datadog resources to other providers

Terraform’s value becomes especially obvious when you start to snap resources together, particularly those from different providers. You could have easily created the monitor in the previous section with dog, and you could further recruit awscli to provision the new resource below, but with Terraform, you can get it all done with one power tool.

Let’s create a datadog_monitor alongside one of the most time-tested and popular resource types, aws_instance. If you have an AWS account and don’t mind potentially spending a few cents, feel free to follow along. Otherwise, the example is still illuminating to read.

Wipe out the template from the previous section and start a new main.tf:

provider "aws" {
  region = "us-west-2" # or your favorite region
}

resource "aws_instance" "base" {
  ami = "ami-ff7cade9" # choose your own AMI — this one isn't real
  instance_type = "t2.micro"
}

resource "datadog_monitor" "cpumonitor" {
  name = "cpu monitor ${aws_instance.base.id}"
  type = "metric alert"
  message = "CPU usage alert"
  query = "avg(last_1m):avg:system.cpu.system{host:${aws_instance.base.id}} by {host} > 10"
  new_host_delay = 30 # just so we can generate an alert quickly
}

Let’s unpack this block by block.

This time we are using a provider block, but to configure a region, not account credentials. You can provide AWS credentials in this block, too, or in environment variables as we did earlier for the Datadog provider. Terraform will examine the provider block and environment variables to find everything it needs to manage AWS resources. See the AWS provider reference to find the field names for credentials.

Our aws_instance resource minimally provides an AMI ID and an instance size. Pick any AMI that’s available in your AWS account in us-west-2, preferably one that has the Datadog Agent pre-baked in. If you don’t have such an AMI handy, no problem, but if you’d like to actually test alerting for this instance later on, you’ll need to configure SSH access to it so you can install the Agent. See the aws_instance reference for a full list of optional fields, including those related to SSH access (security_groups and key_name).

Finally, although the datadog_monitor above resembles the one from earlier, there’s a crucial difference here: it references an attribute from the aws_instance. This is a powerful feature in Terraform. The attribute aws_instance.base.id is called a computed attribute because AWS chose it for us. It’s an output of the instance resource, not an input, but once it’s known (i.e. once it’s been “computed”) it can become an input for any other resource. Since the datadog_monitor references it, Terraform cannot create the monitor until the instance is done provisioning and its ID is known. Terraform implicitly orders the creation of all resources in your templates based on these resource-to-resource references (though you can also make resource ordering explicit by using the depends_on metaparameter with any resource). Resources that don’t depend on one another will be provisioned concurrently, each in its own Goroutine.

After you’ve set your AWS credentials, chosen an AMI, and optionally configured a security group and/or SSH key for the instance, apply the template:

$ terraform apply
aws_instance.base: Creating...
  ami:                         "" => "ami-ff7cade9"
  associate_public_ip_address: "" => "<computed>"
  availability_zone:           "" => "<computed>"
  ebs_block_device.#:          "" => "<computed>"
  ephemeral_block_device.#:    "" => "<computed>"
  instance_state:              "" => "<computed>"
  instance_type:               "" => "t2.micro"
  ipv6_addresses.#:            "" => "<computed>"
  key_name:                    "" => "<computed>"
  network_interface_id:        "" => "<computed>"
  placement_group:             "" => "<computed>"
  private_dns:                 "" => "<computed>"
  private_ip:                  "" => "<computed>"
  public_dns:                  "" => "<computed>"
  public_ip:                   "" => "<computed>"
  root_block_device.#:         "" => "<computed>"
  security_groups.#:           "" => "<computed>"
  source_dest_check:           "" => "true"
  subnet_id:                   "" => "<computed>"
  tenancy:                     "" => "<computed>"
  vpc_security_group_ids.#:    "" => "<computed>"
aws_instance.base: Still creating... (10s elapsed)
aws_instance.base: Still creating... (20s elapsed)
aws_instance.base: Creation complete (ID: i-0d09c60dae94dc6fd)
datadog_monitor.cpumonitor: Creating...
  include_tags:        "" => "true"
  message:             "" => "CPU usage alert"
  name:                "" => "cpu monitor i-0d09c60dae94dc6fd"
  new_host_delay:      "" => "30"
  notify_no_data:      "" => "false"
  query:               "" => "avg(last_1m):avg:system.cpu.system{host:i-0d09c60dae94dc6fd} by {host} > 10"
  require_full_window: "" => "true"
  type:                "" => "metric alert"
datadog_monitor.cpumonitor: Creation complete (ID: 1853221)

Apply complete! Resources: 2 added, 0 changed, 0 destroyed.

Observe the instance’s ID within the name and query fields of the monitor. As intended, the monitor will only apply to this one host.

Before testing the new monitor, append an email address (e.g. “@you@example.com”) or other notification channel to the monitor’s message field and apply again:

$ terraform apply
aws_instance.base: Refreshing state... (ID: i-0d09c60dae94dc6fd)
datadog_monitor.cpumonitor: Refreshing state... (ID: 1853221)
datadog_monitor.cpumonitor: Modifying... (ID: 1853221)
  message: "CPU usage alert" => "CPU usage alert @you@example.com"
datadog_monitor.cpumonitor: Modifications complete (ID: 1853221)

Apply complete! Resources: 0 added, 1 changed, 0 destroyed.

Finally, login to the instance and run any command that will fire up the CPU above 10% usage (e.g.while true; do lsof; done). Brace yourself for an alert!

CPU Alert!

Don’t forget to clean up with a terraform destroy when you’re finished.

Conclusion

The previous examples, though simple, should give you an idea of what’s possible with Terraform. It’s capable of managing extremely large and complex infrastructure; it doesn’t matter how many instances, resource types, providers, or regions make up your environment. Datadog is an important piece of your environment, and now you can manage it alongside the rest of your cloud services with this increasingly adopted and thoughtfully designed devops tool.

The Datadog provider (and Terraform in general) is under very active development, but it doesn’t yet support everything the Datadog API exposes. If you’re a Gopher, we encourage you to contribute!

If you’re already a Datadog customer, you can start managing your alerts, timeboards, and more with Terraform today. Otherwise, you’re welcome to sign up for a of Datadog.

Happy Terraforming!

Acknowledgments

Many thanks to Otto Jongerius for contributing the Datadog provider to Terraform, and to Seth Vargo from HashiCorp for providing feedback on a draft of this article.


Want to write articles like this one? Our team is hiring!