Monitor Apache Airflow with Datadog | Datadog
Datadog's Research Report: The State of Serverless Report: The State of Serverless

Monitor Apache Airflow with Datadog

Author Jordan Obey

Published: February 24, 2020

Apache Airflow is an open source system for programmatically creating, scheduling, and monitoring complex workflows including data processing pipelines. Originally developed by Airbnb in 2014, Airflow is now a part of the Apache Software Foundation and has an active community of contributing developers.

Airflow represents workflows as Directed Acyclic Graphs (DAGs), which are made up of tasks written in Python. This allows Airflow users to programmatically build and modify their workflows.

If you use Airflow to orchestrate your workflows, you’ll want to keep track of the status of your scheduled tasks and ensure that Airflow executes them as expected. We are pleased to announce Datadog’s new integration with Apache Airflow, which takes advantage of Airflow’s StatsD plugin to collect metrics with our DogStatsD service.

As soon as you enable our Airflow integration, you will see key metrics like DAG duration and task status populating an out-of-the-box dashboard, so you can get immediate insight into your Airflow-managed workloads.

Ensure your DAGs don’t drag

Airflow represents each workflow as a series of tasks collected into a DAG. DAGs define the relationships and dependencies between tasks. An Airflow scheduler monitors your DAGs and initiates them based on their schedule. The scheduler then attempts to execute every task within an instantiated DAG (referred to as a DAG Run) in the appropriate order based on each task’s dependencies.

Ideally, the scheduler will execute tasks on time and without delay. The higher the latency of your DAG Runs, the more likely subsequent DAG Runs will start before previous ones have finished executing. Having an increasing number of concurrent DAG Runs may lead to Airflow reaching the max_active_runs limit, causing it to stop scheduling new DAG runs and possibly leading to a timeout of currently scheduled workflows.

You can use the airflow.dag.task.duration.avg metric to monitor the average time it takes to complete a task and help you determine if your DAG runs are lagging or close to timing out. To add context to incoming duration metrics, Datadog’s DogStatsD Mapper feature tags your DAG duration metrics with task_id and dag_id so you can surface particularly slow tasks and DAGs.

For further insight into workflow performance, you can also track metrics from the Airflow scheduler. For instance, the metric airflow.dagrun.schedule_delay provides you with the duration of time between when a DAG run is supposed to start and when it actually starts. When DAG runs are delayed they can slow down your workflows, causing you to potentially miss service level agreements (SLAs). If you notice unusually long delays, the Airflow documentation recommends improving scheduler latency by increasing your DAG’s max_threads and scheduler_heartbeat_sec during configuration.

Keep tabs on queued tasks

Before an Airflow task completes successfully, it goes through a series of stages including scheduled, queued, and running. After tasks have been scheduled and added to a queue, they will remain idle until they are run by an Airflow worker. Large and complex workflows might risk reaching the limit of Airflow’s concurrency parameter, which dictates how many tasks Airflow can run at once. This may lead your queue to balloon with backed-up tasks. With Datadog, you can create an alert to notify you if the amount of tasks running in a DAG is about to surpass your concurrency limit and cause your queue to inflate and potentially slow down workflow execution.

Dig deeper with Datadog APM

Airflow relies on the background job manager Celery to distribute tasks across multi-node clusters. Datadog APM supports the Celery library, so you can easily trace your tasks. This means you can get visibility into the performance of your distributed workflows, for example with flame graphs that trace tasks executed by Celery workers as they propagate across your infrastructure. This helps surface, for instance, where your DAG runs are experiencing latency. Together, our Airflow and Celery integrations can help you gain a complete picture of your workflow performance as you monitor Airflow metrics and distributed traces.

Datadog goes with the flow

Datadog is pleased to include Apache Airflow into our growing list of over 400 integrations, so that you can get comprehensive visibility into your managed workflows. If you’re currently a Datadog user, make sure you have Datadog Agent version 7.17+ so that you can implement DogStatsD Mapper–enabled tagging and get the most out of your Airflow metrics.

If you are not already using Datadog, sign up today for a 14-day