The Ultimate Guide To Apache Airflow DAGs
The Ultimate Guide To Apache Airflow DAGs
10
The Ultimate
Guide to Apache
Airflow® DAGs
Everything a Data Engineer needs to
know about Airflow DAGs
Table of Contents
1. Introduction to DAGs 4 6. Notifications and Alerts 101
Key Concepts 4 Key Concepts 101
What is a DAG? 5 Overview 102
What is a DAG run? 8 Airflow Callbacks 102
Writing a simple DAG 11
7. Write dynamic and adaptable DAGs 106
DAG-level parameters 14
Key Concepts 106
2. DAG Building Blocks 18 Dynamic Task concepts 107
Key Concepts 18 Mapping over sets of keyword arguments 110
Airflow Operators 19 Dynamically map task groups 111
Airflow Decorators 21
8. Programmatically generate DAGs 113
Airflow Sensors 29
Key Concepts 113
Deferrable Operators 33
dag-factory 114
Hooks 37
9. Testing Airflow DAGs 116
3. Airflow Connections 39
Key Concepts 116
Key Concepts 39
Types of Tests 117
Connection options 40
Defining connections in the Airflow UI 41 10. Scaling Airflow DAGs 120
Defining connections with environment variables 42 Key Concepts 120
Masking sensitive information 44 Environment-level settings 121
DAG-level settings 123
4. Writing DAGs 45
Task-level settings 124
Key Concepts 45
DAG writing Best Practices in Apache Airflow 46 11. Debugging Airflow DAGs 126
Task dependencies 50 Key Concepts 126
Branching in Airflow 59 General Airflow debugging approach 127
Task Groups 63 Issues running Airflow locally 127
XCom 74 Common DAG issues 128
Common Task issues 131
5. Scheduling DAGs 81
Missing Logs 132
Key Concepts 81
Troubleshooting connections 133
Scheduling in Airflow 82
I need more help 134
Cron-based schedules 83
Data-driven Scheduling 84 Conclusion 135
Timetables 99
Editor’s Note
Welcome to the Ultimate guide to Apache Airflow® DAGs, brought to you by the
Astronomer team. This eBook covers everything you need to know to work with
DAGs, from the building blocks they consist of to best practices for writing them,
making them dynamic and adaptable, testing and debugging them, and more. It’s
a guide written by practitioners, for practitioners.
KEY CONCEPTS
What is a DAG?
In Airflow, a DAG is a data pipeline or workflow. DAGs are the main organizational unit in Airflow; they
contain a collection of tasks and dependencies that you want to execute on a schedule.
A DAG is defined in Python code and visualized in the Airflow UI. DAGs can be as simple as a single
task or as complex as hundreds or thousands of tasks with complicated dependencies.
The following screenshot shows a complex DAG graph in the Airflow UI. After reading this chapter,
you’ll be able to understand the elements in this graph, as well as know how to define DAGs and use
DAG parameters.
There are two requirements for the structure of a DAG in Airflow: there needs to be a clear direction
between tasks and there can be no circular dependencies. The DAG in the screenshot below fulfills
these two requirements:
● Clear direction: The direction of DAGs is from left to right by default. In this DAG, first
task_1 and task_2 run in parallel then, after task_2 finishes successfully task_3 runs.
task_4 and task_5 depend on task_3 and start running as soon as task_3 has completed
successfully. Lastly, task_6 runs after successful completion of task_5 and task_4.
Figure 2: Screenshot of the Airflow UI showing a DAG run graph with 6 tasks.
The term DAG (Directed Acyclic Graph) originated in mathematics, where it describes a
structure consisting of nodes and edges that is both directed (there is a clear flow of direction
between nodes) and acyclic (there are no circles in the flow between nodes).
Aside from these two requirements, a DAG can be as simple or as complicated as you need! You can
define tasks that run in parallel or sequentially, implement conditional branches, and visually group
tasks together in task groups.
Each task in a DAG performs one unit of work. Tasks can be anything from a simple Python function
to a complex data transformation or a call to an external service.
In Airflow, DAGs:
● are defined using operators: Python classes (incl. sensors, decorators and deferrable
operators)
By default, DAGs are read from left to right and all upstream tasks need to be successful for a down-
stream task to run. In this DAG there are 3 sequential tasks:
Data orchestration sits at the heart of any modern data stack and provides elaborate automation of
data pipelines. With orchestration, actions in your data pipeline become aware of each other and
your data team has a central location to monitor, edit, and troubleshoot their workflows.
● Data pipelines as code: code-based pipelines are dynamic, extensible and you can manage
them using software-engineering best practices like version control and CICD.
● Tool agnosticism: Airflow can connect to any application in your data ecosystem that allows
connections through an API. Prebuilt operators exist to connect to many common data tools.
● High extensibility: Since Airflow pipelines are written in Python, you can build on top of the
existing codebase and extend the functionality of Airflow to meet your needs. Anything you
can do in Python, you can do in Airflow.
● Infinite scalability: Given enough computing power, you can orchestrate as many processes
as you need, no matter the complexity of your pipelines.
● Active OSS community: With millions of users and thousands of contributors, Airflow is here
to stay and grow. Join the Airflow Slack to become part of the community.
● Observability: The Airflow UI provides an immediate overview of all your data pipelines and
can provide the source of truth for workflows in your whole data ecosystem.
Each DAG run has a unique dag_run_id and contains one or more task instances. The history of
previous DAG runs is stored in the Airflow metadata database.
In the Airflow UI, you can view previous runs of a DAG in the Grid view and select individual DAG
runs by clicking on their respective duration bar. A DAG run graph looks similar to the DAG graph, but
includes additional information about the status of each task instance in the DAG run.
Figure 4: Screenshot of the Airflow UI showing a DAG run graph with 3 tasks.
DAG R U N P R O P E R T I E S
A DAG run graph in the Airflow UI contains information about the DAG run, as well as the status of
each task instance in the DAG run. The following screenshot shows the same DAG as in the previous
section, but with annotations explaining the different elements of the graph.
● logical_date: The point in time for which the DAG run is scheduled. This date and time is
not necessarily the same as the actual moment the DAG run is executed. See Scheduling
DAGs for more information.
● task state: The status of the task instance in the DAG run. Possible states are running, suc-
cess, failed, skipped, restarting, up_for_retry, upstream_failed, queued, scheduled, none, re-
moved, deferred, and up_for_reschedule, they each cause the border of the node to be colored
differently. See the OSS documentation on task instances for an explanation of each state.
The previous screenshot also shows the four common ways you can trigger a DAG run:
● Scheduled: The second DAG run, which is currently selected in the screenshot, was created
by the Airflow scheduler based on the DAG’s defined schedule. This is the default method
for running DAGs.
● Manual: The third DAG run was triggered manually by a user using the Airflow UI or the Air-
flow CLI. Manually triggered DAG runs include a play icon on the DAG run duration bar.
● Dataset triggered: The fourth DAG run was started by a dataset. DAG runs triggered by a
dataset include a dataset icon on their DAG run duration bar.
● Backfill: The first DAG run was created using a backfill. Backfilled DAG runs include a curved
arrow on their DAG run duration bar.
● Queued: The time after which the DAG run can be created has passed but the scheduler has
not created task instances for it yet.
● Success: All task instances are in a terminal state (success, skipped, failed or up-
● Failed: All task instances are in a terminal state and at least one leaf task is in the state
failed or upstream_failed.
C O M P L E X DAG R U N S
When you start writing more complex DAGs, you will see additional Airflow features that are visualized
in the DAG run graph. The following screenshot shows the same complex DAG as in the overview but
with annotations explaining the different elements of the graph. Don’t worry if you don’t know about all
these features yet - you will learn about them as you become more familiar with Airflow.
Figure 6: Screenshot of the Airflow UI showing an annotated complex DAG run graph.
● Edge labels: Edge labels appear on the edge between two tasks. These labels are often
helpful to annotate branch decisions in a DAG graph.
● Task groups: A task group is a tool to logically and visually group tasks in an Airflow DAG.
You can learn more about how to set complex dependencies between tasks and task groups in the
Task dependencies chapter.
There are two types of syntax you can use to structure your DAG. Which one to use is a matter of
personal preference:
● TaskFlow API: The TaskFlow API contains the @dag and @task decorators.
○ All tasks that are instantiated/called within the context of the DAG function are part of
the DAG.
○ Note that you need to call the decorated function for Airflow to register the DAG or task.
● Traditional syntax:
○ You can create a DAG by instantiating a DAG context using the DAG class and defining
tasks within that context.
○ You can create Airflow tasks using traditional Airflow operators (including sensors and
deferrable operators), which are Python classes simplifying common actions in Airflow.
○ All tasks that are instantiated/called within the DAG context are part of the DAG.
Any custom Python function can be turned into an Airflow task by decorating it with @task or pro-
viding it to the python_callable parameter of the PythonOperator. Note DAGs can contain differ-
ent types of building blocks that can depend on each other: you can freely mix the TaskFlow API and
traditional syntax!
This simple DAG is written using the TaskFlow API and consists of two tasks running sequentially.
@dag(
start_date=datetime(2024, 1, 1),
schedule=”@daily”,
catchup=False,
def taskflow_dag():
@task
def my_task_1():
time.sleep(5)
print(1)
@task
def my_task_2():
print(2)
chain(my_task_1(), my_task_2())
taskflow_dag()
def my_task_1_func():
time.sleep(5)
print(1)
with DAG(
dag_id=”traditional_syntax_dag”,
start_date=datetime(2024, 1, 1),
schedule=”@daily”,
catchup=False,
):
my_task_1 = PythonOperator(
task_id=”my_task_1”,
python_callable=my_task_1_func,
my_task_2 = PythonOperator(
task_id=”my_task_2”,
python_callable=lambda: print(2),
# Define dependencies
DAG-level parameters
In Airflow, you can configure when and how your DAG runs by setting parameters in the DAG ob-
ject. DAG-level parameters affect how the entire DAG behaves, as opposed to task-level parameters
which only affect a single task.
The DAGs in the previous section have the following basic parameters defined. It’s best practice to
always define these parameters for any DAGs you create:
● dag_id: The name of the DAG. This must be unique for each DAG in the Airflow environment.
When using the @dag decorator and not providing the dag_id parameter name, the function
name is used as the dag_id.
● start_date: The date and time after which the DAG starts being scheduled.
● schedule: The schedule for the DAG. There are many different ways to define a schedule,
see Scheduling in Airflow for more information.
● catchup: Whether the scheduler should backfill all missed DAG runs between the current
date and the start date when the DAG is unpaused. It is a best practice to always set it to
False unless you specifically want to backfill missed DAG runs. By default catchup is set to
True.
OT H E R DAG PA R A M E T E R S
Besides the four DAG parameters above, additional parameters are available that help you modify
how your DAG behaves in the UI, interacts with Jinja templates, scales, notifies you about its state
and more.
Parameter Description
description A short string that is displayed in the Airflow UI next to the DAG name.
doc_md A string that is rendered as DAG documentation in the Airflow UI. Tip: use __doc__
to use the docstring of the Python file. It is a best practice to give all your DAGs de-
scriptive DAG documentation.
owner_links A dictionary with the key being the DAG owner and the value being a URL to link to
when clicking on the owner in the Airflow UI. Commonly used as a mailto link to the
owner’s email address. Note that the owner parameter is set at the task level, usually
by defining it in the default_args dictionary.
tags A list of tags shown in the Airflow UI to help with filtering DAGs.
default_view The default view of the DAG in the Airflow UI. Defaults to grid.
orientation The orientation of the DAG graph in the Airflow UI. Defaults to LR (left to right).
Parameter Description
template_searchpath A list of folders where Jinja looks for templates. The path of
the DAG file is included by default.
For more information on Jinja templating in Airflow, see the Use Airflow templates guide.
Parameter Description
CALLBACK PARAMETERS
Parameter Description
Parameter Description
params A dictionary of DAG-level Airflow params. See Airflow params for more
information.
dagrun_timeout The time it takes for a DAG run of this DAG to time out and be marked
as failed.
is_paused_upon_creation Whether the DAG is paused when it is created. When not set, the
Airflow config core.dags_are_paused_at_creation is used, which
defaults to True.
auto_register Defaults to True and can be set to False to prevent DAGs using a with
context from being automatically registered which can be relevant in
some advanced dynamic DAG generation use cases. See Registering
dynamic DAGs.
fail_stop In Airflow 2.7+, you can set this parameter to True to stop DAG execu-
tion as soon as one task in this DAG fails. Any tasks that are still running
are marked as failed, and any tasks that have not run yet are marked
as skipped. Note that you cannot have any trigger rule other than all_
success in a DAG with fail_stop set to True.
dag_display_name Airflow 2.9+ allows you to override the dag_id to display a different
DAG name in the Airflow UI. This parameter allows special characters.
● Learn from the experts: how to write data pipelines with Airflow Introductory Webinar
● Decorators: Python decorators that turn Python functions into Airflow tasks.
The base decorator is @task.
● Sensors: operators that wait for a specific state to be reached before running.
● Deferrable Operators: operators that leverage the Python asyncio library to run
tasks waiting for an external resource to finish. This frees up your workers and
allows you to use resources more effectively.
● Hooks: Python classes that connect to external tools using an Airflow connection.
connection ID. The class methods are used to interact with the external tool. Hooks
cannot create Airflow tasks by themselves, they are used in operators or in @task
decorated tasks.
Some use cases require isolated environments to run different tasks, for example due
to Python package conflicts. See the Run tasks in an isolated environment in Apache Airflow
guide in our documentation for more information.
Airflow Operators
Operators are one of the fundamental building blocks of Airflow DAGs. They contain the logic of how
data is processed in a pipeline.
There are many different types of operators available in Airflow. Some operators such as the Pytho-
nOperator execute general code provided by the user, while other operators perform very specific
actions such as transferring data from one system to another. Two important subgroups of operators
are Sensors and Deferrable Operators.
You can find a comprehensive list of available operators in the Astronomer Registry.
Operators are Python classes that encapsulate logic to do a unit of work. They can be viewed as a
wrapper around each unit of work that defines the actions that will be completed and abstract the
majority of code you would typically need to write. When you create an instance of an operator in a
DAG and provide it with its required parameters, it becomes a task.
All operators inherit from the abstract BaseOperator class, which contains the logic to execute the
work of the operator within the context of a DAG.
The following are some of the most frequently used Airflow operators:
def _say_hello():
return “hello”
say_hello = PythonOperator(
task_id=”say_hello”,
python_callable=_say_hello
bash_task = BashOperator(
task_id=”bash_task”,
bash_command=”echo $MY_VAR”,
example_kpo = KubernetesPodOperator(
kubernetes_conn_id=”k8s_conn”,
image=”hello-world”,
name=”airflow-test-pod”,
task_id=”task-one”,
is_delete_operator_pod=True,
get_logs=True,
● The core Airflow package includes basic operators such as the PythonOperator and
BashOperator. These operators are automatically available in your Airflow environment. All
other operators are part of provider packages, which you must install separately. For ex-
ample, the CopyFromExternalStageToSnowflakeOperator is part of the Snowflake provider
package.
● If an operator exists for your specific use case, you should use it instead of your own Python
functions. This makes your DAGs easier to read and maintain.
● Deferrable Operators are a type of operator that releases their worker slot while waiting for
their work to be completed. This can result in cost savings and greater scalability. Astron-
omer recommends using deferrable operators whenever one exists for your use case and
your task takes longer than a minute. You must have a triggerer running to use deferrable
operators.
● Any operator that interacts with a service external to Airflow typically requires a connection
so that Airflow can authenticate to that external system. See Airflow connections.
If an operator doesn’t exist for your use case, you can create a custom operator. See Custom
hooks and operators in our documentation.
Airflow Decorators
Airflow Decorators are part of the TaskFlow API, which is a functional API for using decorators
to define DAGs and tasks, simplifying the process for passing data between tasks and defining
dependencies.
You can use TaskFlow decorator functions (for example, @task) to pass data between tasks by
providing the output of one task as an argument to another task. Decorators are a simpler, cleaner
way to define your tasks and DAGs and can be used in combination with traditional operators.
In general, whether you use the TaskFlow API is a matter of your own preference and style. In
most cases, a TaskFlow decorator and the corresponding traditional operator will have the same
functionality.
def multiply_by_100_decorator(decorated_function):
return result
return wrapper
@multiply_by_100_decorator
@multiply_by_100_decorator
H OW TO U S E A I R F LOW D E C O R ATO R S
The TaskFlow API allows you to write your Python tasks with decorators. It handles passing data
between tasks using XCom and infers task dependencies automatically.
Using decorators to define your Python functions as tasks is easy, just put @task on top of your
Python function:
@task
def say_hello():
return “hello”
The task that the code above creates is equivalent to using the PythonOperator with the say_hello
function as python_callable:
def _say_hello():
return “hello”
say_hello = PythonOperator(
task_id=”say_hello”,
python_callable=_say_hello
The following two DAGs, while functionally equivalent, show the advantages of using decorators
over operators.
@dag(
start_date=None,
schedule=None,
catchup=False,
def task_decorator_dag():
@task
return “Hi”
@task
print(input_string)
t2(input_string=t1())
task_decorator_dag()
def _t1():
return “Hi”
def _t2(ti):
# explicitely pulling information from the XCom table in the Airflow metadata data-
base
return_value_t1 = ti.xcom_pull(task_ids=”t1”)
print(return_value_t1)
@dag(
start_date=None,
schedule=None,
catchup=False,
def python_operator_dag():
t1 = PythonOperator(task_id=”t1”, python_callable=_t1)
t2 = PythonOperator(task_id=”t2”, python_callable=_t2)
chain(t1, t2)
python_operator_dag()
Here are some other things to keep in mind when using decorators:
● You must call all decorated functions in your DAG file so that Airflow can register the task or
DAG. For example, task_decorator_dag() is called at the end of the previous example to
call the DAG function and t1() and t2() are called inside the DAG function to register the
respective Airflow tasks.
● When you define a task, the task_id defaults to the name of the function you decorated. If you
want to change this behavior, you can pass a task_id to the decorator Similarly, other BaseOp-
erator task-level parameters, such as retries or pool, can be defined within the decorator:
@task(
task_id=”say_hello_world”
retries=3,
pool=”my_pool”,
def taskflow_func():
If you call the same task multiple times and do not override the task_id, Airflow creates multiple
unique task IDs by appending a number to the end of the original task ID (for example, say_hello,
say_hello__1, say_hello__2, etc). You can see the result of this in the following example:
# task definition
@task
def say_hello(dog):
# this task will have the id `say_hello` and print “Hello Avery!”
say_hello(“Avery”)
# this task will have the id `greet_dog` and print “Hello Piglet!”
say_hello.override(task_id=”greet_dog”)(“Piglet”)
# this task will have the id `say_hello__1` and print “Hello Peanut!”
say_hello(“Peanut”)
# this task will have the id `say_hello__2` and print “Hello Butter!”
say_hello(“Butter”)
@task
def taskflow_func():
my_function()
This is recommended in cases where you have lengthy Python functions since it will make your DAG
file easier to read.
You can assign the output of a called decorated task to a Python object to be passed as an argu-
ment into another decorated task. This is helpful when the output of one decorated task is needed in
several downstream functions.
@task
def get_fruit_options():
@task
def eat_a_fruit(list):
@task
def gift_a_fruit(list):
my_fruits = get_fruit_options()
eat_a_fruit(my_fruits)
gift_a_fruit(my_fruits)
There are several decorators available to use with Airflow. Some of the most commonly used deco-
rators are:
● Python Virtual Env decorator (@task.virtualenv()), which runs your Python task in a vir-
tual environment.
● Branch decorator (@task.branch()), which creates a branch in your DAG based on an eval-
uated condition.
See create your own custom task decorator in the Airflow documentation to learn how to
create custom decorators.
Airflow Sensors
Sensors are a special kind of operator that are designed to wait for something to happen. When
sensors run, they check to see if a certain condition is met before they are marked successful and
let their downstream tasks execute.
All sensors inherit from the BaseSensorOperator and have the following parameters:
● mode: How the sensor operates. There are two types of modes:
○ poke: This is the default mode. When using poke, the sensor occupies a worker slot
for the entire execution time and sleeps between pokes. This mode is best if you
expect a short runtime for the sensor.
○ reschedule: When using this mode, if the criteria is not met then the sensor releases
its worker slot and reschedules the next check for a later time. This mode is best if
● poke_interval: When using poke mode, this is the time in seconds that the sensor waits
before checking the condition again. The default is 60 seconds.
● exponential_backoff: When set to True, this setting creates exponentially longer wait
times between pokes in poke mode.
● timeout: The maximum amount of time in seconds that the sensor checks the condition. If
the condition is not met within the specified period, the task fails.
● soft_fail: If set to True, the task is marked as skipped if the condition is not met by the
timeout.
Many sensors have a deferrable parameter, which when set to True will turn the sensor into a de-
ferrable operator. This functionality is implemented on a per-sensor basis and may not be available
for all sensors.
Many Airflow provider packages contain sensors that wait for various criteria in different source
systems. The following are some examples of commonly used sensors:
● @task.sensor.decorator: Allows you to turn any Python function that returns a PokeReturn-
Value into an instance of the BaseSensorOperator class. This way of creating a sensor is
useful when checking for complex logic or if you are connecting to a tool via an API that has
no specific sensor available.
● S3KeySensor: Waits for a key (file) to appear in an Amazon S3 bucket. This sensor is useful
if you want your DAG to process files from Amazon S3 as they arrive.
● HttpSensor: Waits for an API to be available. This sensor is useful if you want to ensure your
API requests are successful.
● SqlSensor: Waits for data to be present in a SQL table. This sensor is useful if you want your
DAG to process data as it arrives in your database.
If no sensor exists for your use case, you can create your own using either the @task.sensor decorator
or the PythonSensor. The @task.sensor decorator returns a PokeReturnValue as an instance of the
BaseSensorOperator. The PythonSensor takes a python_callable that returns True or False.
”””
This DAG showcases how to create a custom sensor using the @task.sensor decorator
“””
import requests
def sensor_decorator():
r = requests.get(“https://random.dog/woof.json”)
print(r.status_code)
condition_met = True
operator_return_value = r.json()
else:
condition_met = False
operator_return_value = None
@task
def print_dog_picture_url(url):
print(url)
print_dog_picture_url(check_dog_availability())
sensor_decorator()
The optional xcom_value parameter in PokeReturnValue defines what data will be pushed to XCom
once is_done=true. You can use the data that was pushed to XCom in any downstream tasks.
When using sensors, keep the following in mind to avoid potential performance issues:
● Always define a meaningful timeout parameter for your sensor. The default for this param-
eter is seven days, which is a long time for your sensor to be running. When you implement a
sensor, consider your use case and how long you expect the sensor to wait and then define
the sensor’s timeout accurately.
● Whenever possible and especially for long-running sensors, use deferrable operators in-
stead. If no deferrable operator is available for your use case and you do not want to create
your own, use the Sensor in reschedule mode so it is not constantly occupying a worker slot.
This helps avoid deadlocks in Airflow where sensors take all of the available worker slots.
● If your poke_interval is very short (less than about 5 minutes), use the poke mode. Using
reschedule mode in this case can overload your scheduler.
● Define a meaningful poke_interval based on your use case. There is no need for a task to
check a condition every 60 seconds (the default) if you know the total amount of wait time
will be 30 minutes.
S E N S O R FA I L U R E M O D E S
When using sensors, there are different options to define its behavior in case of an exception raised
within the sensor.
● silent_fail=True: If an exception is raised in the poke method that is not one of: Air-
flowSensorTimeout, AirflowTaskTimeout, AirflowSkipException or AirflowFailException, the
sensor will log the error but continue its execution.
● never_fail=True: (Airflow 2.10+) If the poke method raises any exception, the sensor task
is skipped. This parameter is mutually exclusive with soft_fail.
Deferrable Operators
Deferrable operators leverage the Python asyncio library to efficiently run tasks waiting for an exter-
nal resource to finish. This frees up your workers and allows you to use resources more effectively.
In this guide, you’ll review deferrable operator concepts and learn how to use deferrable operators
in your DAGs. Tasks that are in a deferred state are shown as purple in the Airflow UI.
● asyncio: A Python library used as the foundation for multiple asynchronous frameworks. This
library is core to deferrable operator functionality, and is used when writing triggers.
● Triggers: Small, asynchronous sections of Python code. Due to their asynchronous nature,
they coexist efficiently in a single process known as the triggerer.
● Triggerer: An Airflow service similar to a scheduler or a worker that runs an asyncio event
loop in your Airflow environment. Running a triggerer is essential for using deferrable opera-
tors.
● Deferred: An Airflow task state indicating that a task has paused its execution, released the
worker slot, and submitted a trigger to be picked up by the triggerer process.
The terms deferrable, async, and asynchronous are used interchangeably and have the same mean-
ing in the context of Airflow.
With traditional operators, a task submits a job to an external system such as a Spark cluster and
then polls the job status until it is completed. Although the task isn’t doing significant work, it still
occupies a worker slot during the polling process. As worker slots are occupied, tasks are queued
and start times are delayed. The following image illustrates this process:
With deferrable operators, worker slots are released when a task is polling for the job status. When
the task is deferred, the polling process is offloaded as a trigger to the triggerer, and the worker slot
becomes available. The triggerer can run many asynchronous polling tasks concurrently, and this
prevents polling tasks from occupying your worker resources. When the terminal status for the job
is received, the operator resumes the task, taking up a worker slot while it finishes. The following
image illustrates the process:
Figure 4: Graph showing a deferrable operator releasing its worker slot while polling a spark cluster for a job status. The
U S E D E F E R R A B L E O P E R ATO R S
Deferrable operators should be used whenever you have tasks that occupy a worker slot while poll-
ing for a condition in an external system. For example, using deferrable operators for sensor tasks
can provide efficiency gains and reduce operational costs.
Start a triggerer
To use deferrable operators, you must have a triggerer running in your Airflow environment. If you
are running Airflow on Astro or using the Astro CLI, the triggerer runs automatically if you are on
Astro Runtime 4.0 and later.
As tasks are raised into a deferred state, triggers are registered in the triggerer. You can set the
number of concurrent triggers that can run in a single triggerer process with the default_capacity
configuration setting in Airflow. This config can also be set with the AIRFLOW__TRIGGERER__DEFAULT_
CAPACITY environment variable. The default value is 1000.
Triggers are designed to be highly available. You can implement this by starting multiple triggerer
processes. Similar to the HA scheduler, Airflow ensures that they co-exist with correct locking and
high availability. See High Availability in the Airflow documentation for more information on this topic.
Many Airflow operators, for example the TriggerDagRunOperator and the WasbBlobSensor, can be
set to run in deferrable mode using the deferrable parameter. You can check if the operator you
want to use has a deferrable parameter in the Astronomer Registry.
To always use the deferrable version of an operator if it’s available in Airflow 2.7+, set the Airflow
config operators.default_deferrable to True. You can do so by defining the following
environment variable in your Airflow environment:
AIRFLOW__OPERATORS__DEFAULT_DEFERRABLE=True
After you set the variable, all operators with a deferrable parameter will run as their deferrable
version by default. You can override the config setting at the operator level using the deferrable
parameter directly:
trigger_dag_run = TriggerDagRunOperator(
task_id=”task_in_downstream_dag”,
trigger_dag_id=”downstream_dag”,
wait_for_completion=True,
poke_interval=20,
deferrable=False, # turns off deferrable mode just for this operator instance
Previously, before the deferrable parameter was available in regular operators, deferrable operators
were implemented as standalone operators, usually with an -Async suffix. Some of these operators
are still available. For example, the DateTimeSensor does not have a deferrable parameter, but has a
deferrable version called DateTimeSensorAsync.
You can also create your own deferrable operator. See Create a deferrable operator in our
documentation for template code.
Hooks
A hook is an abstraction over a specific API that allows Airflow to interact with an external system.
Hooks are built into many operators, but they can also be used directly in DAG code either inside
functions decorated with Airflow decorators or in the python_callable provided to a PythonOp-
erator. Hooks standardize how Airflow interacts with external systems and using them makes your
DAG code cleaner, easier to read, and less prone to errors.
Over 300 hooks are available in the Astronomer Registry. If a hook isn’t available for your use case,
you can write your own and share it with the community.
To instantiate a hook, you typically only need a connection ID to connect with an external system.
All hooks inherit from the BaseHook class, which contains the logic to set up an external connection
with a connection ID. On top of making the connection to an external system, individual hooks can
contain additional methods to perform various actions within the external system. These methods
might rely on different Python libraries for these interactions. For example, the S3Hook relies on the
boto3 library to interact with Amazon S3.
● Learn from the experts: how to write data pipelines with Airflow Introductory webinar
KEY CONCEPTS
An Airflow connection is a set of configurations that send requests to the API of an external tool. In
most cases, a connection requires login credentials or a private key to authenticate Airflow to the
external tool.
● Airflow UI
● Environment variables
● Airflow REST API
● Airflow CLI
Each connection has a unique conn_id which can be provided to operators and hooks that require a
connection.
There are a couple of ways to find the information you need to provide for a particular connection type:
● Open the relevant provider page in the Astronomer Registry and go to the first link under
Helpful Links to access the Apache Airflow documentation for the provider. Most commonly
used providers will have documentation on each of their associated connection types. For
example, you can find information on how to set up different connections to Azure in the
Azure provider docs.
● Check the documentation of the external tool you are connecting to and see if it offers guid-
ance on how to authenticate.
● Refer to the source code of the hook that is being used by your operator.
Figure 1: Screenshot of the Airflow UI showing how to navigate to the Connections view.
Airflow doesn’t provide any preconfigured connections. To create a new connection, click the blue
+ button.
As you update the Connection Type field, notice how the other available fields change. Each con-
nection type requires different kinds of information. Specific connection types are only available in
the dropdown list when the relevant Airflow provider package is installed in your environment.
You don’t have to specify every field for most connections. For example, to set up a connection to
a PostgreSQL database, you can reference the PostgreSQL Airflow provider documentation to learn
that the connection requires a Host, a user name as login, and a password in the password field.
Any parameters that don’t have specific fields in the connection form can be defined in the Extra
field as a JSON dictionary. For example, you can add the sslmode or a client sslkey in the Extra
field of your PostgreSQL connection.
Note: If you are synchronizing your project to a remote repository, don’t save sensitive in-
formation in your Dockerfile. In this case, use either a secrets backend, Airflow connections
defined in the UI, or the Astro Environment Manager.
The environment variable used for the connection must be formatted as AIRFLOW_CONN_MYCONNID
and can be provided as a Uniform Resource Identifier (URI) or in JSON.
URI is a format designed to contain all necessary connection information in one string, starting with
the connection type, followed by login, password, and host. In many cases a specific port, schema,
and additional parameters must be added.
AIRFLOW_CONN_MYCONNID=’my-conn-type://login:password@host:port/schema?
param1=val1¶m2=val2’
# the general format of a URI connection that is defined in your Dockerfile
.env AIRFLOW_CONN_MYCONNID=’my-conn-type://login:password@host:port/sche-
# an example of a connection to snowflake defined as a URI
ma?param1=val1¶m2=val2’
AIRFLOW_CONN_SNOWFLAKE_CONN=’snowflake://LOGIN:PASSWORD@/?account=xy12345&
# an example of a connection to snowflake defined as a URI AIRFLOW_CONN_SNOWFLAKE_CON-
region=eu-central-1’
N=’snowflake://LOGIN:PASSWORD@/?account=xy12345®ion=eu-central-1’
AIRFLOW_CONN_MYCONNID=’{
“conn_type”: “my-conn-type”,
“login”: “my-login”,
“password”: “my-password”,
“host”: “my-host”,
“port”: 1234,
“schema”: “my-schema”,
“extra”: {
“param1”: “val1”,
“param2”: “val2”
}’
● Top-level code: Code that is not contained in a task and is executed every time
the DAG file is parsed. You should avoid having long running top-level code,
especially code that connects to external tools
● Three core best practices: keep your tasks atomic, idempotent and modular.
● Automatic retries: a set of configurations that make individual Airflow tasks run
again automatically after a failure.
● Software development best practices: DAG code should be treated like any
software code: keep it in version control and use CICD best practices.
● Task dependencies: In Airflow you can set dependencies between tasks using
chain functions or bitshift operators
● Trigger rules: By default an Airflow task only runs when all tasks it depends on
have completed successfully. Trigger rules allow you to modify this behavior.
● Task groups: Airflow tasks can be assigned to groups to make DAGs easier
to read.
T H R E E C O R E DAG W R I T I N G B E S T P R AC T I C E S
There are three best practices to follow, no matter the purpose of your Airflow pipeline:
● Atomicity: Atomicity refers to designing each task to complete one specific function in the
pipeline. In an ETL or ELT pipeline, each task in your DAG, ideally accomplishes one step: ex-
tracting, transforming, loading, or an ETL adjacent task such as sending a report. This atomic
structure gives you full observability over your pipeline and enables you to rerun only failed
tasks without having to repeat earlier steps, saving time and compute.
● Idempotency: This is just a fancy way of saying “Same in - same out”. When creating an
Airflow pipeline, you ideally have idempotent tasks and DAGs, meaning with the same input
your task always has the same output. In ETL pipelines it is common to use timestamps to
partition your data. When using Airflow you can pull different timestamps relating to the DAG
run time from the Airflow context, for example the logical_date to partition your data, ensur-
ing that the partition stays the same even if the DAG run is rerun months later.
● Modularity: Don’t repeat yourself - this software engineering best practice holds true for
Airflow DAGs. Since DAGs are defined in Python code, you can modularize your functions to
be imported into several DAGs. And if you’d like to standardize how your team interacts with
tools you can write your own custom operator classes.
A best practice closely related to modularity is to treat your DAGs like config files. Try to avoid hav-
ing supporting code like long SQL statements or Python scripts inside your DAG code. Instead, mod-
ularize them in a separate location like an include folder to make the DAG easier to read and enable
reuse of your scripts across DAGs.
AVO I D TO P- L E V E L C O D E
By default Airflow parses all DAG files every 30 seconds. During this parse all top-level code, mean-
ing all code outside of Airflow tasks, is executed. If you have long-running code at the top-level, it
can cause Airflow to slow down, and if you make connections and calls to external systems you can
even start incurring costs!
...
hook = PostgresHook(“postgres_conn”)
@dag(
...
def bad_dag():
...
A common reason users resort to top-level code when connecting to databases is the need to
dynamically generate tasks. For instance, they may want to create one loading task per table in a
schema but do not know the number or names of the tables in advance. A best practice for creating
pipelines that can adapt to such dynamic data at runtime is to use Dynamic Task Mapping, which
avoids top-level code and provides full visibility into the task copies created during each DAG run.
AU TO M AT I C R E T R I E S
Airflow allows you to define how often a task should be retried automatically before it fails, as well
as how long to wait between retries by using the retries and retry_delay task parameters. Retries are
especially important to make ETL and ELT pipelines that might run into concurrency limitations of
databases more resilient.
You can set retries for your whole Airflow environment by using a config variable:
AIRFLOW__CORE__DEFAULT_TASK_RETRIES=5
If you want all tasks in one of your DAGs to retry a different number of times you can override
the config at the DAG level using the default_args dictionary and at the task level by using task
parameters.
@dag(
...
default_args={
“retries”: 3,
“retry_delay”: timedelta(seconds=30)
},
@task(retries=2, retry_delay=timedelta(minutes=10))
def extract():
# your code
return “hi”
If you are not familiar with version control systems, see git’s entry level explanation videos. In short,
version control allows you to track changes to your code, making it possible to reverse specific edits
or merge updates from different developers working on separate branches.
When creating ETL and ELT pipelines with Airflow you should have at least two branches for your
code, the development (dev) branch where you make changes to test in a local development envi-
ronment and the production (prod) branch where your code gets promoted to after thorough testing.
Deploying from the production branch to the production environment should be automatic. Astro
customers can use the GitHub integration or our CICD template scripts.
As soon as you are working with a larger number of DAGs, in a team and/or deploy business-critical
DAGs, we recommend having at least three environments:
● Production: mapped to a cloud environment that powers your data products in production.
When your code changes are promoted from a lower to a higher environment they go through a
CICD process including automated testing.
Task dependencies
In Airflow DAGs, each task is a node in the graph and dependencies are the directed edges that de-
termine how to move through the graph.
● Upstream task: A task that must reach a specified state before a dependent task can run.
● Downstream task: A dependent task that cannot run until an upstream task reaches a
specified state.
The focus of this section is dependencies between tasks in the same DAG. If you need to
implement dependencies between DAGs, see the Cross-DAG dependencies guide in our
documentation.
For example, if you have a DAG with four sequential tasks (t0 ,t1, t2, t3) the dependencies can be
set in the following ways:
● Using set_downstream():
t0.set_downstream(t1)
t1.set_downstream(t2)
t2.set_downstream(t3)
● Using set_upstream():
t3.set_upstream(t2)
t2.set_upstream(t1)
t1.set_upstream(t0)
● Using <<:
Figure 2: Screenshot of the Airflow DAG generated using the above dependency setting methods.
To set a dependency where two downstream tasks are dependent on the same upstream task, use
lists. For example:
Figure 3: Screenshot of the Airflow DAG generated using the above dependency setting methods.
CHAIN FUNCTIONS
Chain functions (chain and chain_linear) are utilities that let you set dependencies between
several tasks or lists of tasks. A common reason to use chain functions over bit-shift operators is to
create dependencies for tasks that were created in a loop and are stored in a list.
list_of_tasks = []
for i in range(5):
if i % 3 == 0:
ta = EmptyOperator(task_id=f”ta_{i}”)
list_of_tasks.append(ta)
else:
ta = EmptyOperator(task_id=f”ta_{i}”)
tb = EmptyOperator(task_id=f”tb_{i}”)
tc = EmptyOperator(task_id=f”tc_{i}”)
chain(*list_of_tasks)
Figure 4: Screenshot of the Airflow DAG generated from the code above.
To set parallel dependencies between tasks and lists of tasks of the same length, use the chain
function:
Figure 5: Screenshot of the Airflow DAG generated from the dependency definition shown above.
When you use the chain function, any lists or tuples that are set to depend directly on each other
need to be the same length.
chain([t0, t1], [t2, t3, t4]) # this code will cause an error
chain([t0, t1], t2, [t3, t4, t5]) # this code will work
To set interconnected dependencies between tasks and lists of tasks, use the chain_linear()
function. This function is available in Airflow 2.7+, in older versions of Airflow you can set similar
dependencies between two lists at a time using the cross_downstream() function.
Replacing chain in the previous example with chain_linear creates dependencies where each
element in the downstream list will depend on each element in the upstream list.
Figure 6: Screenshot of the Airflow DAG generated from the dependency definition shown above.
The chain_linear function can accept lists of any length in any order. For example, the following
arguments are valid:
Figure 7: Screenshot of the Airflow DAG generated from the dependency definition shown above.
@task
def get_num():
return 42
@task
def add_one(num):
return num + 1
@task
def add_two(num):
return num + 2
num = get_num()
add_one(num)
add_two(num)
Figure 8: Screenshot of the Airflow DAG generated from the code above.
@task
def t0():
return “hi”
t1 = BashOperator(
task_id=”t1”,
bash_command=”echo ‘hello’”
_t0 = t0()
chain(_t0, t1)
TRIGGER RULES
In Airflow, trigger rules are used to determine when a task should run in relation to the previous task.
By default, Airflow runs a task when all directly upstream tasks are successful. However, you can
change this behavior using the trigger_rule parameter in the task definition.
You can override the default trigger rule by setting the trigger_rule parameter in the task definition.
@task
def upstream_task():
The Ultimate Guide to Apache Airflow® DAGs 4 Writing DAGs 57
return “Hello...”
@task(trigger_rule=”all_success”)
def downstream_task():
return “ World!”
chain(upstream_task(), downstream_task())
upstream_task = EmptyOperator(task_id=”upstream_task”)
downstream_task = EmptyOperator(
task_id=”downstream_task”,
trigger_rule=”all_success”
chain(upstream_task, downstream_task)
● all_success: (default) The task runs only when all upstream tasks have succeeded.
● all_failed: The task runs only when all upstream tasks are in a failed or upstream_failed
state.
● all_done: The task runs once all upstream tasks are done with their execution.
● all_skipped: The task runs only when all upstream tasks have been skipped.
● one_failed: The task runs when at least one upstream task has failed.
● one_success: The task runs when at least one upstream task has succeeded.
● one_done: The task runs when at least one upstream task has either succeeded or failed.
● none_failed_min_one_success: The task runs only when all upstream tasks have not
failed or upstream_failed, and at least one upstream task has succeeded.
● none_skipped: The task runs only when no upstream task is in a skipped state.
Trigger rules are especially important when using Airflow branching to ensure that tasks down-
stream of the branching task run correctly.
Branching in Airflow
When designing your data pipelines, you may encounter use cases that require more complex task
flows than “Task A > Task B > Task C”. For example, you may have a use case where you need to
decide between multiple tasks to execute based on the results of an upstream task. Or you may
have a case where part of your pipeline should only run under certain external conditions. This can
be achieved using branching tasks.
The most common way to implement branching in Airflow is to use the @task.branch decorator,
which is the decorator version of the BranchPythonOperator. @task.branch accepts any Python
function as an input as long as the function returns a list of valid IDs for Airflow tasks that the DAG
should run after the function completes.
The following example uses the choose_branch task that returns one set of task IDs if the input
provided by the upstream task is greater than 0.5 and a different set if the result is less than or
equal to 0.5. Note that all task IDs returned by the branching task need to be valid IDs of tasks in the
same DAG.
@dag(
start_date=None,
schedule=None,
catchup=False,
def branching_example():
@task
return 1
@task.branch
return [“task_c”]
@task
def task_a():
@task
def task_b():
@task
def task_c():
@task(trigger_rule=”none_failed”)
def downstream():
print(“Running downstream”)
_upstream = upstream()
_choose_branch = choose_branch(decision_input=_upstream)
branching_example()
Whether you want to use the decorated version or the traditional operator is a question of personal
preference.
Branching often leads to one or more tasks being skipped, which can cause downstream tasks
to be skipped as well. To prevent accidental skipping, adjust the trigger rule of downstream
tasks to run even if some of their upstream tasks were skipped. In the DAG example above this
is accomplished for the downstream task by setting its trigger rule to none_failed.
@ TAS K . R U N _ I F / @ TAS K . S K I P _ I F
Airflow 2.10 added the possibility to run or skip individual @task decorated Airflow tasks at runtime
based on whether a function provided to an additional decorator returns true or false.
The code snippet below shows how to use the @task.skip_if decorator with the skip_decision
function to skip a task based on its task id.
def skip_decision(context):
task_id_ending_to_skip = “_skip_me”
return context[“task_instance”].task_id.endswith(task_id_ending_to_skip)
@task.skip_if(skip_decision)
@task
def say_bye():
The Ultimate Guide to Apache Airflow® DAGs 4 Writing DAGs 61
return “hello!”
The @task.run_if decorator functions analogously, in the below example only running tasks
whose task_id ends with _do_run.
@task
def say_hello():
return “hello!”
OT H E R B R A N C H O P E R ATO R S
Airflow offers a few other branching operators that work similarly to the BranchPythonOperator but
for more specific contexts:
● BranchSQLOperator: Branches based on whether a given SQL query returns true or false.
● Organize complicated DAGs, visually grouping tasks that belong together in the Airflow UI.
● Dynamically map over groups of tasks, enabling complex dynamic patterns. See Dynamically
map task groups.
● Turn task patterns into modules that can be reused across DAGs or Airflow instances.
Figure 10: Screenshot of a DAG with 4 tasks, two of which (task_1, task_2) are part of a task group (my_task_group).
W H E N TO U S E TA S K G R O U P S
Task groups are most often used to visually organize complicated DAGs. For example, you might use
task groups:
● In big ELT/ETL DAGs, where you have a task group per table or schema.
● In MLOps DAGs, where you have a task group per model being trained.
● In DAGs owned by several teams, where you have task groups to visually separate the tasks
that belong to each team. Although in this case, it might be better to separate the DAG into
multiple DAGs and use Datasets to connect them.
● When you are using the same patterns of tasks in multiple DAGs and want to create a reus-
able module.
● When you have an input of unknown length, for example an unknown number of files in a di-
rectory. You can use task groups to dynamically map over the input and create a task group
performing sets of actions for each file. This is the only way to dynamically map sequential
tasks in Airflow.
In most cases, it is a matter of personal preference which method you use. The only exception is when
you want to dynamically map over a task group; this is possible only when using @task_group.
The following code shows how to instantiate a simple task group containing two sequential tasks.
You can set dependencies both within and between task groups in the same way that you can with
individual tasks.
In the Airflow UI you can collapse and expand task groups as well as clear them to rerun all tasks
they contain.
t0 = EmptyOperator(task_id=’start’)
@task_group(group_id=’my_task_group’)
def tg1():
t1 = EmptyOperator(task_id=’task_1’)
t2 = EmptyOperator(task_id=’task_2’)
t1 >> t2
t3 = EmptyOperator(task_id=’end’)
t0 = EmptyOperator(task_id=’start’)
t1 = EmptyOperator(task_id=’task_1’)
t2 = EmptyOperator(task_id=’task_2’)
t1 >> t2
t3 = EmptyOperator(task_id=’end’)
Figure 11: Screenshot of the Airflow DAG generated from the code above.
You can use parameters to customize individual task groups. The two most important parameters
are the group_id which determines the name of your task group, as well as the default_args
which will be passed to all tasks in the task group. The following examples show task groups with
some commonly configured parameters:
@task_group(
group_id=”task_group_1”,
default_args={“conn_id”: “postgres_default”},
prefix_group_id=True,
# parent_group=None,
# dag=None,
def tg1():
t1 = EmptyOperator(task_id=”t1”)
tg1()
TA S K _ I D I N TA S K G R O U P S
When your task is within a task group, your callable task_id will be group_id.task_id. This en-
sures the task_id is unique across the DAG. It is important that you use this format when referring
to specific tasks when working with XComs or branching. You can disable this behavior by setting
the task group parameter prefix_group_id=False.
For example, the task_1 task in the following DAG has a task_id of my_outer_task_group.my_
inner_task_group.task_1.
@task_group(group_id=”my_outer_task_group”)
def my_outer_task_group():
@task_group(group_id=”my_inner_task_group”)
def my_inner_task_group():
EmptyOperator(task_id=”task_1”)
my_inner_task_group()
When you use the @task_group decorator, you can pass data through the task group just like with
regular @task decorators:
import json
def task_group_example():
@task
def extract_data():
order_data_dict = json.loads(data_string)
return order_data_dict
@task
total_order_value = 0
total_order_value += value
@task
total_order_value = 0
total_order_value += value
@task_group
The Ultimate Guide to Apache Airflow® DAGs 4 Writing DAGs 67
def transform_values(order_data_dict):
return {
“avg”: transform_avg(order_data_dict),
“total”: transform_sum(order_data_dict),
@task
print(
load(transform_values(extract_data()))
task_group_example()
Figure 12: Screenshot of the Airflow DAG generated from the code above.
There are a few things to consider when passing information into and out of task groups:
● If downstream tasks require the output of tasks that are in the task group decorator, then the
task group function must return a result. In the previous example, a dictionary with two val-
ues was returned, one from each of the tasks in the task group, that are then passed to the
downstream load() task.
N E S T TA S K G R O U P S
You can nest task groups by defining a task group indented within another task group. There is no
limit to how many levels of nesting you can have.
groups = []
@task_group(group_id=f”group{g_id}”)
def tg1():
t1 = EmptyOperator(task_id=”task1”)
t2 = EmptyOperator(task_id=”task2”)
sub_groups = []
@task_group(group_id=f”sub_group{s_id}”)
def tg2():
st1 = EmptyOperator(task_id=”task1”)
st2 = EmptyOperator(task_id=”task2”)
sub_groups.append(tg2())
groups.append(tg1())
Figure 13: Screenshot of the Airflow DAG generated from the code above.
TA S K G R O U P D E P E N D E N C I E S
This section will explain how to set dependencies between task groups.
Dependencies can be set both inside and outside of a task group. For example, in the following DAG
code there is a start task, a task group with two dependent tasks, and an end task. All of these tasks
need to happen sequentially. The dependencies between the two tasks in the task group are set
within the task group’s context (t1 >> t2). The dependencies between the task group and the start
and end tasks are set within the DAG’s context (t0 >> tg1() >> t3).
t0 = EmptyOperator(task_id=”start”)
@task_group(
group_id=”group1”
def tg1():
t1 = EmptyOperator(task_id=”task1”)
t2 = EmptyOperator(task_id=”task2”)
t1 >> t2
The Ultimate Guide to Apache Airflow® DAGs 4 Writing DAGs 70
# End task group definition
t3 = EmptyOperator(task_id=”end”)
Figure 14: Screenshot of the Airflow DAG generated from the code above.
You can also set dependencies between task groups, between tasks inside and out of task groups,
and even between tasks in different (nested) task groups.
The image below shows types of dependencies that can be set between tasks and task groups.
You can find the code that created this DAG in a GitHub repository both for the TaskFlow API and
traditional version.
Figure 15: Screenshot of the Airflow DAG with complex dependencies between tasks inside and outside of task groups.
M O D U L A R I Z E D TAS K G R O U P S
If you have a common pattern of tasks, consider creating a Python class representing that pattern
inside a task group that can be instantiated with just a few parameters.
For example, imagine you have the same extract, transform, load pattern as a part of many of your
DAGs, and only a few parameters like a URL or key of interest change. You can standardize the pat-
tern in a custom class like the MyETLTaskGroup shown in the code below.
class MyETLTaskGroup(TaskGroup):
super().__init__(group_id=group_id, **kwargs)
self.url = url
@task(task_group=self)
Args:
import requests
response = requests.get(url)
return response.json()
@task(task_group=self)
Args:
return api_response[key]
The Ultimate Guide to Apache Airflow® DAGs 4 Writing DAGs 72
@task(task_group=self)
Args:
print(transformed_data)
load(transform(api_response=extract(self.url), key=key_of_interest))
This class can be imported and used in several DAGs to create a three task pattern in just one line
of code.
@dag(
start_date=None,
schedule=None,
catchup=False,
def simple_etl_task_group():
@task
def get_url():
return “https://catfact.ninja/fact”
simple_etl_task_group()
You can learn more about task groups, including custom modularized task groups in this guide and
this webinar.
XCom
XCom stands for “cross-communication” and is Airflow’s default way of passing data between tasks.
XComs are defined by a key, value, and timestamp.
XComs can be “pushed”, meaning sent by a task, or “pulled”, meaning received by a task. When an
XCom is pushed, it is stored in the Airflow metadata database and made available to all other tasks.
Any time a task returns a value (for example, when your Python callable for your PythonOperator
has a return value), that value is automatically pushed to XCom. Tasks can also be configured to
push XComs by calling the xcom_push() method. Similarly, xcom_pull() can be used in a task to
receive an XCom.
You can view your XComs in the Airflow UI by going to Admin > XComs.
The standard XCom backend stores all data that is being passed between tasks in a table in the Air-
flow metadata database.
Figure 18: Screenshot of a DAG pushing data to XCom and pulling data from XCom using a standard XCom backend.
When you use the standard XCom backend, the size-limit for an XCom is determined by your meta-
data database. Common sizes are:
● Postgres: 1 Gb
● SQLite: 2 Gb
● MySQL: 64 Mb
The second limitation in using the standard XCom backend is that only certain types of data can be serialized.
● JSON
If you need to serialize other data types you can do so using a custom XCom backend.
When using a custom XCom backend, the data is stored in a system external to Airflow, typically an
object storage solution like Amazon S3, GCS, or Azure blob storage. The Airflow metadata database
only saves a string referencing the data location, typically a URI.
Custom XCom backends allow data to be saved between tasks in any location and format, removing
limitations on data size and type.
Figure 19: Custom XCom backend. Data is serialized and written to external storage, the Airflow metadata database only
There are two main ways you can create a custom XCom backend:
● Create your own custom XCom backend class, see our documentation for a code example.
● Use the Object Storage XCom backend, the easy way to set up a custom XCom backend
using only environment variables which was added in Airflow 2.9 (experimental).
The other common pattern is to explicitly write your data to external storage from within an Airflow
task using any tool to interact with the external storage solution. This pattern gives you complete
flexibility over what data is saved where and in which format from each task.
It is a best practice to use a custom XCom backend in production environments if any signif-
icant amounts of data is passed through XCom, which has become simple to set up with the
Object Storage XCom backend.
AIRFLOW_CONN_AWS_DEFAULT=’{
“conn_type”:”aws”,
}’
AIRFLOW__CORE__XCOM_BACKEND=”airflow.providers.common.io.xcom.backend.XComObjectStorageBack-
end”
AIRFLOW__COMMON_IO__XCOM_OBJECTSTORAGE_PATH=”s3://my_aws_conn@my-bucket”
AIRFLOW__COMMON_IO__XCOM_OBJECTSTORAGE_THRESHOLD=”1000”
AIRFLOW__COMMON_IO__XCOM_OBJECTSTORAGE_COMPRESSION=”zip”
You can learn more about the Airflow Object Storage feature in this tutorial and about the Object
Storage custom XCom backend in this tutorial.
Pass the .output values of traditional Operator-based tasks to the callable of TaskFlow tasks to
automatically create a relationship between the two tasks. For example:
def first_task_callable():
first_task = PythonOperator(
task_id=”first_task”,
python_callable=first_task_callable
first_task_result = first_task.output
@task
def second_task(first_task_result_value):
second_task(first_task_result)
TaskFlow tasks, when called, return a reference (referred to internally as an XComArg) that can be
passed into templateable fields of traditional operators to automatically create a relationship be-
tween the two tasks.
The list of templateable fields varies by operator. Astronomer maintains a searchable registry of
operators at the Astronomer Registry that details which fields are templateable by default and users
can control which fields are templateable.
Here’s how you could pass the result of a TaskFlow function to a traditional PythonOperator’s call-
able via an argument:
@task
def first_task():
return “hello”
first_task_result = first_task()
def second_task_callable(x):
uppercase_text = x.upper()
second_task = PythonOperator(
task_id=”second_task”,
python_callable=second_task_callable,
You can find more examples of passing data between tasks here.
KEY CONCEPTS
To gain a better understanding of DAG scheduling, it’s important that you become familiar with the
following terms and parameters:
● Data Interval: A property of each DAG run that represents the period of data that each task
should operate on. For example, for a DAG scheduled hourly, each data interval begins at
the top of the hour (minute 0) and ends at the close of the hour (minute 59). The DAG run is
typically executed at the end of the data interval.
● Logical Date: The start of the data interval. It does not represent when the DAG will be exe-
cuted. Prior to Airflow 2.2, this was referred to as the execution date.
● Timetable: A DAG property that dictates the data interval and logical date for each DAG run
and determines when a DAG is scheduled.
● Run After: The earliest time the DAG can be scheduled. This date is shown in the Airflow UI,
and may be the same as the end of the data interval depending on your DAG’s timetable.
● Backfilling and Catchup: Related to scheduling. To learn more, see DAG Runs.
PA R A M E T E R S
The following parameters ensure your DAGs run at the correct time:
● data_interval_start: Defines the start date and time of the data interval. A DAG’s time-
table will return this parameter for each DAG run. This parameter is created automatically by
Airflow, or is specified by the user when implementing a custom timetable.
● data_interval_end: Defines the end date and time of the data interval. A DAG’s timetable
will return this parameter for each DAG run. This parameter is created automatically by Air-
flow, or is specified by the user when implementing a custom timetable.
● schedule: Defines when a DAG will be run. This value is set at the DAG configuration level.
It accepts cron expressions, timedelta objects, timetables, and lists of datasets. The default
schedule is timedelta(days=1), which runs the DAG once per day if no schedule is de-
fined. If you trigger your DAG externally, set the schedule to None.
● start_date: The timestamp after which the first data interval for this DAG can start. Make
sure this is set to be a date in the past, at least one full data interval before the first intended
DAG run. For example if your DAG runs daily, the start_date should be at least one full day
before the first intended run.
● end_date: The last date your DAG will be executed. This parameter is optional.
Note: In Airflow 2.3 and earlier, schedule_interval is used instead of the schedule
parameter and it only accepts cron expressions or timedelta objects.
Cron-based schedules
For pipelines with straightforward scheduling needs, you can define a schedule in your DAG using:
● A cron expression.
● A cron preset.
● A timedelta object.
Cron expressions
You can pass any cron expression as a string to the schedule parameter in your DAG. For example, if
you want to schedule your DAG at 4:05 AM every day, you would use schedule=’5 4 * * *’.
If you need help creating the correct cron expression, see crontab guru.
Cron presets
Airflow can utilize cron presets for common, basic schedules. For example, schedule=’@hourly’
will schedule the DAG to run at the beginning of every hour. For the full list of presets, see Cron Pre-
sets. If your DAG does not need to run on a schedule and will only be triggered manually or external-
ly triggered by another process, you can set schedule=None.
Timedelta objects
If you want to schedule your DAG on a particular cadence (hourly, every 5 minutes, etc.) rather
than at a specific time, you can pass a timedelta object imported from the datetime package to the
schedule parameter. For example, schedule=timedelta(minutes=30) will run the DAG every thir-
ty minutes, and schedule=timedelta(days=1) will run the DAG every day.
L I M I TAT I O N S O F C R O N - B A S E D S C H E D U L E S
The relationship between a DAG’s schedule and its logical_date leads to particularly unintuitive
results when the spacing between DAG runs is irregular. The most common example of irregular
spacing is when DAGs run only during business days from Monday to Friday. In this case, the DAG
run with a Friday logical_date will not run until Monday, even though the data from Friday is avail-
able on Saturday. A DAG that summarizes results at the end of each business day can’t be set using
only a cron-based schedule. In Airflow 2.2 and earlier, you must schedule the DAG to run every day
(including Saturday and Sunday) and include logic in the DAG to skip all tasks on the days the DAG
doesn’t need to run.
● Schedule a DAG at different times on different days. For example, 2:00 PM on Thursdays and
4:00 PM on Saturdays.
● Schedule a DAG at multiple times daily with uneven intervals. For example, 1:00 PM and 4:30 PM.
If you want to implement more complex time-based schedules like these use a custom timetable,
see Timetables in our documentation for an example.
Data-driven Scheduling
With Airflow Datasets, DAGs that access the same data can have explicit, visible relationships, and
DAGs can be scheduled based on updates to these datasets. This feature helps make Airflow da-
ta-aware and expands Airflow scheduling capabilities beyond time-based methods such as cron.
Datasets can help resolve common issues. For example, consider a data engineering team with a
DAG that creates a dataset and a machine learning team with a DAG that trains a model on the data-
set. Using datasets, the machine learning team’s DAG runs only when the data engineering team’s
DAG has produced an update to the dataset.
● Standardize communication between teams. Datasets can function like an API to communi-
cate when data in a specific location has been updated and is ready for use.
● Reduce the amount of code necessary to implement cross-DAG dependencies. Even if your
DAGs don’t depend on data updates, you can create a dependency that triggers a DAG after
a task in another DAG updates a dataset.
● Get better visibility into how your DAGs are connected and how they depend on data. The
Datasets tab in the Airflow UI shows a graph of all dependencies between DAGs and data-
sets in your Airflow environment.
● Reduce costs, because datasets do not use a worker slot in contrast to sensors or other im-
plementations of cross-DAG dependencies.
● Create cross-deployment dependencies using the Airflow REST API. Astronomer customers
can use the Cross-deployment dependencies best practices documentation for guidance.
● (Airflow 2.9+) Create complex data-driven schedules using Conditional Dataset Scheduling
and Combined Dataset and Time-based Scheduling.
DATA S E T C O N C E P T S
You can define datasets in your DAG code and use them to create cross-DAG dependencies. Airflow
uses the following terms related to the datasets feature:
● Dataset: an object that is defined by a unique URI. Airflow parses the URI for validity and
there are some constraints on how you can define it. If you want to avoid validity parsing,
prefix your dataset name with x- for Airflow to treat it as a string. See What is a valid URI? for
detailed information.
● Dataset event: an event that is attached to a dataset and created whenever a producer task
updates that particular dataset. A dataset event is defined by being attached to a specif-
ic dataset plus the timestamp of when a producer task updated the dataset. Optionally, a
dataset event can contain an extra dictionary with additional information about the dataset or
dataset event.
● Dataset schedule: the schedule of a DAG that is triggered as soon as dataset events for
one or more datasets are created. All datasets a DAG is scheduled on are shown in the DAG
graph in the Airflow UI, as well as reflected in the dependency graph of the Datasets tab.
● Producer task: a task that produces updates to one or more datasets provided to its outlets
parameter, creating dataset events when it completes successfully.
● Queued dataset event: It is common to have DAGs scheduled to run as soon as a set of
datasets have received at least one update each. While there are still dataset events missing
to trigger the DAG, all dataset events for other datasets the DAG is scheduled on are queued
dataset events. A queued dataset event is defined by its dataset, timestamp and the DAG it
is queuing for. One dataset event can create a queued dataset event for several DAGs. As of
Airflow 2.9, you can access queued Dataset events for a specific DAG or a specific dataset
programmatically, using the Airflow REST API.
● DatasetAlias (Airflow 2.10+): an object that can be associated to one or more datasets and
used to create schedules based on datasets created at runtime, see Use dataset aliases. A
dataset alias is defined by a unique name.
● Metadata (Airflow 2.10+): a class to attach extra information to a dataset from within the pro-
ducer task. This functionality can be used to pass dataset-related metadata between tasks,
see Attaching information to a dataset event.
Two parameters relating to Airflow datasets exist in all Airflow operators and decorators:
● Outlets: a task parameter that contains the list of datasets a specific task produces updates
to, as soon as it completes successfully. All outlets of a task are shown in the DAG graph in
the Airflow UI, as well as reflected in the dependency graph of the Datasets tab as soon as
the DAG code is parsed, i.e. independently of whether or not any dataset events have oc-
curred. Note that Airflow is not yet aware of the underlying data. It is up to you to determine
which tasks should be considered producer tasks for a dataset. As long as a task has an
outlet dataset, Airflow considers it a producer task even if that task doesn’t operate on the
referenced dataset.
● Inlets: a task parameter that contains the list of datasets a specific task has access to, typ-
ically to access extra information from related dataset events. Defining inlets for a task does
not affect the schedule of the DAG containing the task and the relationship is not reflected in
the Airflow UI.
To summarize, tasks produce updates to datasets given to their outlets parameter, and this action
creates dataset events. DAGs can be scheduled based on dataset events created for one or more
datasets, and tasks can be given access to all events attached to a dataset by defining the dataset
as one of their inlets. A dataset is defined as an object in the Airflow metadata database as soon as
it is referenced in either the outlets parameter of a task or the schedule of a DAG.
● Datasets events are only registered by DAGs or listeners in the same Airflow environment.
If you want to create cross-Deployment dependencies with Datasets you will need to use
the Airflow REST API to create a dataset event in the Airflow environment where your down-
stream DAG is located. See the Cross-deployment dependencies for an example implemen-
tation on Astro.
● Airflow monitors datasets only within the context of DAGs and tasks. It does not monitor
updates to datasets that occur outside of Airflow. I.e. Airflow will not notice if you manually
add a file to an S3 bucket referenced by a dataset. To create Airflow dependencies based on
outside events, use Airflow sensors.
● The Datasets tab in the Airflow UI provides an overview over recent dataset events, existing
datasets as well as a graph showing all dependencies between DAGs containing producing
tasks, datasets and consuming DAGs. See Datasets tab for more information.
As of Airflow 2.8, you can use listeners to enable Airflow to run any code when certain
dataset events occur anywhere in your Airflow instance. There are two listener hooks for the
following events:
- on_dataset_created
- on_dataset_changed
For examples, refer to our Create Airflow listeners tutorial. Dataset Events listeners are an
experimental feature.
DATA S E T D E F I N I T I O N
A dataset is defined as an object in the Airflow metadata database as soon as it is referenced in ei-
ther the outlets parameter of a task or the schedule of a DAG. Airflow 2.10 added the ability to create
dataset aliases, see Use Dataset Aliases.
The simplest dataset schedule is one DAG scheduled based on updates to one dataset which is pro-
duced to by one task. In this example we define that the my_producer_task task in the my_produc-
er_dag DAG produces updates to the s3://my-bucket/my-key/ dataset, creating attached dataset
events, and schedule the my_consumer_dag DAG to run once for every dataset event created.
@dag(
start_date=None,
schedule=None,
catchup=False,
def my_producer_dag():
@task(outlets=[Dataset(“s3://my-bucket/my-key/”)])
def my_producer_task():
pass
my_producer_task()
# BashOperator(
# task_id=”my_producer_task”,
# bash_command=”echo ‘hello’”,
# outlets=[Dataset(“s3://my-bucket/my-key/”)]
# )
my_producer_dag()
Figure 1: Screenshot of the Airflow UI Datasets tab showing the DAG containing the producing task being connected to the
In Airflow 2.9+ the graph view of the my_producer_dag shows the dataset as well.
Figure 2: Screenshot of the Airflow UI showing a DAG graph with a producing task being connected to the dataset that is
Next, schedule the my_consumer_dag to run as soon as a new dataset event is produced to the
s3://my-bucket/my-key/ dataset.
@dag(
schedule=[Dataset(“s3://my-bucket/my-key/”)],
catchup=False,
def my_consumer_dag():
EmptyOperator(task_id=”empty_task”)
my_consumer_dag()
You can see the relationship between the DAG containing the producing task (my_producer_dag),
the consuming DAG my_consumer_dag and the dataset in the Dependency Graph located in the
Datasets tab of the Airflow UI. Note that this screenshot is using Airflow 2.10 and the UI might look
different in previous versions.
Figure 4: Screenshot of the Airflow UI showing the DAG graph of a DAG scheduled on one dataset.
Figure 5: Screenshot of the Airflow UI showing the two DAGs my_consumer_dag and my_producer_dag after
In Airflow 2.10+ you have the option to create dataset aliases to schedule DAGs based on datasets
with URIs generated at runtime. A dataset alias is defined by a unique name string and can be used
in place of a regular dataset in outlets and schedules. Any number of dataset events updating differ-
ent datasets can be attached to a dataset alias.
my_alias_name = “my_alias”
@task(outlets=[DatasetAlias(my_alias_name)])
def attach_event_to_alias_metadata():
yield Metadata(
Dataset(f”s3://{bucket_name}/my-task”),
alias=my_alias_name,
attach_event_to_alias_metadata()
my_alias_name = “my_alias”
@task(outlets=[DatasetAlias(my_alias_name)])
The Ultimate Guide to Apache Airflow® DAGs 5 Scheduling DAGs 92
def attach_event_to_alias_context(**context):
outlet_events = context[“outlet_events”]
outlet_events[my_alias_name].add(
) # extra is optional
attach_event_to_alias_context()
In the consuming DAG you can use a dataset alias in place of a regular dataset.
my_alias_name = “my_alias”
@dag(
start_date=datetime(2024, 8, 1),
schedule=[DatasetAlias(my_alias_name)],
catchup=False,
def my_consumer_dag():
EmptyOperator(task_id=”empty_task”)
my_consumer_dag()
Figure 6: Screenshot of the Airflow UI showing the my_consumer_dag having a Schedule of Unresolved DatasetAlias.
Figure 7: Screenshot of the Airflow UI showing the my_consumer_dag being scheduled on the s3://my-bucket/my-task
dataset after its event has been attached to the dataset alias.
Any further dataset event for the s3://my-bucket/my-task dataset will now trigger the my_con-
sumer_dag. If you attach dataset events for several datasets to the same dataset alias, a DAG
scheduled on that dataset alias will run as soon as any of the datasets that were ever attached to the
dataset alias receive an update.
See Dynamic data events emitting and dataset creation through DatasetAlias for more information
and examples of using dataset aliases.
To use Dataset Aliases with traditional operators, you need to attach the dataset event to the alias
inside the operator logic. If you are using operators besides the PythonOperator, you can either do
so in a custom operator’s .execute method or by passing a post_execute callable to existing
operators (experimental). Use outlet_events when attaching dataset events to aliases in tradition-
al or custom operators. Note that for deferrable operators, attaching a dataset event to an alias is
only supported in the execute_complete or post_execute method.
uri = “s3://my-bucket/my_file.txt”
context[“outlet_events”][my_alias_name].add(Dataset(uri))
BashOperator(
task_id=”t2”,
bash_command=”echo hi”,
outlets=[DatasetAlias(my_alias_name)],
U P DAT E A DATAS E T
As of Airflow 2.9+ there are three ways to update a dataset:
● A task with an outlet parameter that references the dataset completes successfully.
Figure 8: Screenshot of the Airflow UI showing the button to manually update a Dataset in the Dependency Graph
S C H E D U L E A DAG O N DATAS E T S
Any number of datasets can be provided to the schedule parameter. There are 3 types of dataset
schedules:
● Consumer DAGs that are scheduled on a dataset are triggered every time a task that updates
that dataset completes successfully. For example, if task1 and task2 both produce data-
set_a, a consumer DAG of dataset_a runs twice - first when task1 completes, and again
when task2 completes.
● Consumer DAGs scheduled on a dataset are triggered as soon as the first task with that
dataset as an outlet finishes, even if there are downstream producer tasks that also operate
on the dataset.
● Consumer DAGs scheduled on multiple datasets run as soon as their expression is fulfilled
by at least one dataset event per dataset in the expression. This means that it does not mat-
ter to the consuming DAG whether a dataset received additional updates in the meantime,
it consumes all queued events for one dataset as one input. See Multiple Datasets for more
information.
● DAGs that are triggered by datasets do not have the concept of a data interval. If you need
information about the triggering event in your downstream DAG, you can use the parameter
triggering_dataset_events from the context. This parameter provides a list of all the
triggering dataset events with the parameters [timestamp, source_dag_id, source_
task_id, source_run_id, source_map_index]. See Retrieving information in a down-
stream task for an example.
C O N D I T I O N A L DATAS E T S C H E D U L I N G
In Airflow 2.9 and later, you can use logical operators to combine any number of datasets provided
to the schedule parameter. The logical operators supported are | for OR and & for AND.
For example, to schedule a DAG on an update to either dataset1 or dataset2 and either dataset3
or dataset4, you can use the following syntax. Note that the full statement is wrapped in ().
@dag(
start_date=datetime(2024, 3, 1),
schedule=(
(Dataset(“dataset1”) | Dataset(“dataset2”))
catchup=False
def downstream2_one_in_each_group():
downstream2_one_in_each_group()
The DAG shown below runs on a time-based schedule defined by the 0 0 * * * cron expression,
which is every day at midnight. The DAG also runs when either dataset3 or dataset4 is updated.
@dag(
start_date=datetime(2024, 3, 1),
schedule=DatasetOrTimeSchedule(
datasets=(Dataset(“dataset3”) | Dataset(“dataset4”)),
),
catchup=False,
def toy_downstream3_dataset_and_time_schedule():
toy_downstream3_dataset_and_time_schedule()
Custom timetables can be registered as part of an Airflow plugin. They must be a subclass of Time-
table, and they should contain the following methods, both of which return a DataInterval with a
start and an end:
● next_dagrun_info: Returns the data interval for the DAG’s regular schedule
● infer_manual_data_interval: Returns the data interval when the DAG is manually triggered
Tip: You can find an example of a custom time table in our documentation.
C O N T I N U O U S T I M E TA B L E
You can run a DAG continuously with a pre-defined timetable. To use the ContinuousTimetable, set
the schedule of your DAG to “@continuous” and set max_active_runs to 1.
@dag(
start_date=datetime(2023, 4, 18),
schedule=”@continuous”,
max_active_runs=1,
catchup=False,
This schedule will create one continuous DAG run, with a new run starting as soon as the previous
run has completed, regardless of whether the previous run succeeded or failed. Using a Continu-
ousTimetable is especially useful when sensors or deferrable operators are used to wait for highly
irregular events in external data tools.
● Astro alerts: Advanced alerts including for Task Duration limits and
Timeliness that Astro customers can define in the Astro UI across
DAGs and Deployments.
Open-source Airflow offers the option to define Airflow callbacks which are functions that execute
when a DAG or task enters a specific state, for example when it fails (on_failure_callback) or
completes successfully (on_success_callback).
Tip: Astro customers have the option to define complex alerting rules on all their DAGs across
deployments using Astro alerts and the Astro Observe feature.
Airflow Callbacks
In Airflow you can define actions to be taken due to different DAG or task states by providing func-
tions to *_callback parameters:
● on_skipped_callback : Invoked when a task is skipped. Added in Airflow 2.9, this callback
only exists at the task level, and is only invoked when an AiflowSkipException is raised,
not when a task is skipped due to other reasons, like a trigger rule. See Callback Types.
● on_execute_callback: Invoked right before a task begins executing. This callback only
exists at the task level.
● on_retry_callback: Invoked when a task is retried. This callback only exists at the task
level.
● sla_miss_callback: Invoked when a task or DAG misses its defined Service Level Agree-
ment (SLA). This callback is defined at the DAG level for DAGs with defined SLAs and will
be applied to every task. Note that SLAs can be counter-intuitive in open-source Airflow
and Astronomer recommends our customers use Task Duration and Timeliness Astro Alerts
instead.
The DAG code below shows a simple function, my_callback_function being executed for differ-
ent callback parameters at the DAG and task level. Note that you can set callbacks for all tasks in a
DAG by using the DAG-level parameter default_args.
The Ultimate Guide to Apache Airflow® DAGs 6 Notifications & Alerts 102
from airflow.decorators import dag, task
def my_callback_function(context):
t_id = context[“ti”].task_id
t_state = context[“ti”].state
@dag(
start_date=datetime(2023, 4, 18),
schedule=None,
catchup=False,
on_success_callback=[my_callback_function],
on_failure_callback=[my_callback_function],
sla_miss_callback=[my_callback_function],
# callbacks provided in the default_args are giving to all tasks in the DAG
default_args={
“on_execute_callback”: [my_callback_function],
“on_retry_callback”: [my_callback_function],
“on_success_callback”: [my_callback_function],
“on_failure_callback”: [my_callback_function],
“on_skipped_callback”: [my_callback_function],
“retries”: 2,
“retry_delay”: duration(seconds=5),
},
)
The Ultimate Guide to Apache Airflow® DAGs 6 Notifications & Alerts 103
def callbacks_overview():
@task(
on_execute_callback=[my_callback_function],
on_retry_callback=[my_callback_function],
on_success_callback=[my_callback_function],
on_failure_callback=[my_callback_function],
on_skipped_callback=[my_callback_function],
def task_success():
return 10
@task
def task_failing():
return 10 / 0
@task
def task_skip():
task_success()
task_failing()
task_skip()
callbacks_overview()
The Ultimate Guide to Apache Airflow® DAGs 6 Notifications & Alerts 104
You can provide any Python callable to the *_callback parameters or use Airflow notifiers. Notifi-
ers are pre-built classes to send alerts to various tools, for example to Slack using the SlackNotifier.
To execute multiple functions when a DAG or task reaches a certain state, you can provide several
callback items to the same callback parameter in a list.
The OSS notification library Apprise contains modules to send notifications to many services.
You can use Apprise with Airflow by installing the Apprise Airflow provider which contains
the AppriseNotifier. See the Apprise Airflow provider documentation and the Apprise example
DAG for more information and examples.
● How to monitor your pipelines with Airflow and Astro alerts webinar
The Ultimate Guide to Apache Airflow® DAGs 6 Notifications & Alerts 105
7
Write Dynamic &
Adaptable DAGs
KEY CONCEPTS
Note that you can use dynamic tasks within dynamic DAGs.
Airflow tasks have two methods available to implement the map portion of dynamic task mapping.
For the task you want to map, you must pass all operator parameters through one of the following
functions.
● .expand(): This method passes the parameters that you want to map. A separate parallel
task is created for each input. For some instances of mapping over multiple parameters,
.expand_kwargs() is used instead.
● .partial(): This method passes any parameters that remain constant across all mapped
tasks which are generated by .expand().
In the following example, the add task uses both, .partial() and .expand(), to dynamically gen-
erate three task runs.
@task(
# optionally, you can set a custom index to display in the UI (Airflow 2.9+)
# get the current context and define the custom map index variable
context = get_current_context()
return x + y
The Ultimate Guide to Apache Airflow® DAGs 7 Write Dynamic & Adaptable DAGs 107
Example using a traditional operator (PythonOperator):
return x + y
added_values = PythonOperator.partial(
task_id=”add”,
python_callable=add_function,
op_kwargs={“y”: 10},
# optionally, you can set a custom index to display in the UI (Airflow 2.9+)
Figure 1: Screenshot of the [] Mapped Tasks tab for the dynamic task add showing 3 dynamically mapped task instances
Defining the map_index_template parameter (Airflow 2.9+) is optional. If you don’t set it, the
default map index is used, which is an integer index starting from 0. It is a best practice to set
a custom index to make it easier to identify the mapped task instances in the Airflow UI. For
example, if you are mapping over a list of files, you can display the name of the file as the Map
Index in the Airflow UI.
The .expand method creates three mapped add tasks, one for each entry in the x input list.
The .partial method specifies a value for y that remains constant in each task.
The Ultimate Guide to Apache Airflow® DAGs 7 Write Dynamic & Adaptable DAGs 108
When you work with mapped tasks, keep the following in mind:
● You can use the results of an upstream task as the input to a mapped task. The upstream
task must push a value in a dict or list form to XCom using the return_value key, which is
the default key for values returned from a @task decorated task.
● You can use the results of a mapped task as input to a downstream mapped task.
● You can have a mapped task that results in no task instances. For example, when your
upstream task that generates the mapping values returns an empty list. In this case, the
mapped task is marked skipped, and downstream tasks are run according to the trigger rules
you set. By default, downstream tasks are also skipped.
● Some parameters can’t be mapped. For example, task_id, pool, and many other BaseOp-
erator arguments.
● .expand() only accepts keyword arguments, i.e. you need to specify the parameter you are
mapping over.
● You can limit the number of mapped task instances for a particular task that run in parallel by
setting the following parameters in your dynamically mapped task:
○ Set a limit for parallel runs of a task across all DAG runs with the max_active_tis_
per_dag parameter.
○ Set a limit for parallel runs of a task within a single DAG run with the max_active_
tis_per_dagrun parameter.
● XComs created by mapped task instances are stored in a list and can be accessed by us-
ing the map index of a specific mapped task instance. For example, to access the XComs
created by the third mapped task instance (map index of 2) of my_mapped_task, use
ti.xcom_pull(task_ids=[‘my_mapped_task’])[2]. The map_indexes parameter in the
.xcom_pull() method allows you to specify a list of map indexes of interest (ti.xcom_
pull(task_ids=[‘my_mapped_task’], map_indexes=[2])).
For additional examples of how to apply dynamic task mapping functions, see Dynamic Task Map-
ping in the official Airflow documentation.
In the Airflow UI, dynamic tasks are identified with a set of brackets [] following the task ID. All
mapped task instances are combined into one row on the grid view and one node in the graph view.
The Ultimate Guide to Apache Airflow® DAGs 7 Write Dynamic & Adaptable DAGs 109
The number in the brackets shown in the DAG run graph is updated for each DAG run to reflect how
many mapped instances were created. The following screenshot shows a DAG run graph with two
tasks, the latter having 49 dynamically mapped task instances.
Figure 2: Screenshot of the graph of a DAG run with two tasks, the upstream get_fruits task as well as the dynamic task
map_fruits with [49] dynamically mapped task instances for this DAG run.
To see the logs and XCom pushed by each dynamically mapped task instance, click on the dynam-
ically mapped task, either in the DAG run graph or in the grid. Then click on [] Mapped Tasks and
select the mapped task instance you want to inspect.
t1 = BashOperator.partial(task_id=”t1”).expand_kwargs(
The Ultimate Guide to Apache Airflow® DAGs 7 Write Dynamic & Adaptable DAGs 110
The task t1 will have three mapped task instances printing their results into the logs:
● Map index 1: 3
# creating a task group using the decorator with the dynamic input my_num
@task_group(group_id=”group1”)
def tg1(my_num):
@task
def print_num(num):
return num
@task
def add_42(num):
return num + 42
The Ultimate Guide to Apache Airflow® DAGs 7 Write Dynamic & Adaptable DAGs 111
This DAG dynamically maps over the task group group1 with different inputs for the my_num pa-
rameter. 6 mapped task group instances are created, one for each input. Within each mapped task
group instance two tasks will run using that instances’ value for my_num as an input.
The Ultimate Guide to Apache Airflow® DAGs 7 Write Dynamic & Adaptable DAGs 112
8
Programmatically
Generate DAGs
KEY CONCEPTS
dag-factory
A popular way to generate DAG code is using dag-factory. This package allows you to simplify code
and reduce redundancy by allowing DAG definitions in YAML format, as shown in the example below.
my_dag:
schedule_interval: ‘0 0 * * *’
default_args:
owner: ‘Astro’
start_date: ‘2024-10-01’
catchup: False
tasks:
extract:
operator: airflow.operators.python.PythonOperator
python_callable_name: extract
python_callable_file: /usr/local/airflow/include/python_func.py
transform:
operator: airflow.operators.python.PythonOperator
python_callable_name: transform
python_callable_file: /usr/local/airflow/include/python_func.py
dependencies: [extract]
load:
operator: airflow.operators.python.PythonOperator
python_callable_name: load
python_callable_file: /usr/local/airflow/include/python_func.py
dependencies: [transform]
The Ultimate Guide to Apache Airflow® DAGs 8 Programmatically Generate DAGs 114
When dag-factory is installed in an Airflow environment, you can generate DAGs from yaml files like
this with the following DAG generation methods:
import dagfactory
dag_factory = dagfactory.DagFactory(config_file)
dag_factory.clean_dags(globals())
dag_factory.generate_dags(globals())
There are other methods to programmatically create DAG code that allow for more customization
than dag-factory, see this guide for more information.
The Ultimate Guide to Apache Airflow® DAGs 8 Programmatically Generate DAGs 115
9
Testing
Airflow DAGs
KEY CONCEPTS
● Unit Tests: All custom Python code should be unit tested as you would in any software
application code.
● DAG validation tests: You can define Airflow-specific tests to ensure your DAGs meet organi-
zational standards, such as requiring tags on DAGs or restricting the use of certain operators.
● Integration Tests: Since Airflow interacts with many tools it is wise to have end-to-end inte-
gration tests, these typically run as pipelines in a staging environment.
Note that all these tests are about the DAG code itself, the data flowing through your pipeline is test-
ed by specialized tasks in your pipeline using Data Quality Checks.
If you use the Astro CLI to run your DAGs, you can add all of your tests to the tests folder and run them
with the command astro dev pytest (no matter which Python framework you are using for your tests).
UNIT TESTS
In Airflow you can create unit tests the same way as you would for any Python code with any Python
testing framework.
You should test all your custom operators, as well as all Python functions used in your tasks. If you
use operators from provider packages you don’t need to write unit tests for them, since these opera-
tors are already tested by the community.
This is an example of a unit test for a custom operator doing basic math created with the unittest
framework. The test makes sure that if given the inputs 2 and 3 as well as the operation +, the oper-
ator will return 5. You can find the full test suite for this basic custom operator in the corresponding
testing file here.
import unittest
from include.custom_operators import MyBasicMathOperator
class TestMyBasicMathOperator(unittest.TestCase):
def test_addition(self):
operator = MyBasicMathOperator(
task_id=”basic_math_op”, first_number=2, second_number=3, operation=”+”
)
result = operator.execute(None)
self.assertEqual(result, 5)
The Ultimate Guide to Apache Airflow® DAGs 9 Testing Airflow DAGs 117
DAG VA L I DAT I O N T E S T S
DAG validation tests examine your DAGs according to your defined standards. You import the DAGs
using the Airflow DagBag class and create one test for each property you want to enforce, using any
Python testing framework.
The following is an example of using pytest to only allow only one decorator (@task) and one opera-
tor (SQLExecuteQueryOperator).
import logging
import os
import pytest
@contextmanager
def suppress_logging(namespace):
logger = logging.getLogger(namespace)
old_value = logger.disabled
logger.disabled = True
try:
yield
finally:
logger.disabled = old_value
def get_dags():
“””
“””
with suppress_logging(“airflow”):
dag_bag = DagBag(include_examples=False)
def strip_path_prefix(path):
The Ultimate Guide to Apache Airflow® DAGs 9 Testing Airflow DAGs 118
return [(k, v, strip_path_prefix(v.fileloc)) for k, v in dag_bag.dags.items()]
ALLOWED_OPERATORS = [
“SQLExecuteQueryOperator”,
@pytest.mark.parametrize(
“””
“””
assert any(
● How to easily test your Airflow DAGs with the new dag.test() function webinar
● Use frameworks to test Apache Airflow data pipelines | Conf42 Python 2024
The Ultimate Guide to Apache Airflow® DAGs 9 Testing Airflow DAGs 119
10
Scaling
Airflow DAGs
KEY CONCEPTS
If you’re running Airflow on Astronomer, you should modify these parameters with Astronomer envi-
ronment variables. For more information, see Environment Variables on Astronomer.
You should modify environment-level settings if you want to tune performance across all of the
DAGs in your Airflow environment. This is particularly relevant if you want your DAGs to run well on
your support infrastructure.
CORE SETTINGS
Core settings control the number of processes running concurrently and how long processes run
across an entire Airflow environment. The associated environment variables for all parameters in this
section are formatted as AIRFLOW__CORE__PARAMETER_NAME.
● parallelism: The maximum number of tasks that can run concurrently on each scheduler
within a single Airflow environment. For example, if this setting is set to 32, and there are two
schedulers, then no more than 64 tasks can be in a running or queued state at once across
all DAGs. If your tasks remain in a scheduled state for an extended period, you might want to
increase this value. The default value is 32.
On Astro, this value is set automatically based on your maximum worker count, meaning that
you don’t have to configure it.
The Ultimate Guide to Apache Airflow® DAGs 10 Scaling Airflow DAGs 121
● dagbag_import_timeout: How long the dagbag can import DAG objects before timing out
in seconds, which must be lower than the value set for dag_file_processor_timeout. If
your DAG processing logs show timeouts, or if your DAG is not showing in the DAGs list or
the import errors, try increasing this value. You can also try increasing this value if your tasks
aren’t executing, since workers need to fill up the dagbag when tasks execute. The default
value is 30 seconds.
SCHEDULER SETTINGS
Scheduler config settings control how the scheduler parses DAG files and creates DAG runs. The as-
sociated environment variables for all parameters in this section are formatted as AIRFLOW__SCHED-
ULER__PARAMETER_NAME.
● min_file_process_interval: The frequency that each DAG file is parsed, in seconds. Up-
dates to DAGs are reflected after this interval. A low number increases scheduler CPU usage.
If you have dynamic DAGs created by complex code, you can increase this value to improve
scheduler performance. The default value is 30 seconds.
● dag_dir_list_interval: The frequency that the DAGs directory is scanned for new files,
in seconds. The lower the value, the faster new DAGs are processed and the higher the CPU
usage. The default value is 300 seconds (5 minutes).
It’s helpful to know how long it takes to parse your DAGs (dag_processing.total_parse_
time) to know what values to choose for min_file_process_interval and dag_dir_
list_interval. If your dag_dir_list_interval is less than the amount of time it takes to
parse each DAG, performance issues can occur.
If you have fewer than 200 DAGs in a Deployment on Astro, it’s safe to set AIRFLOW__
SCHEDULER__DAG_DIR_LIST_INTERVAL=30 (30 seconds) as a Deployment-level environment
variable.
● parsing_processes: How many processes the scheduler can run in parallel to parse DAGs.
Astronomer recommends setting a value that is twice your available vCPUs. Increasing this
value can help serialize a large number of DAGs more efficiently. If you are running multiple
schedulers, this value applies to each of them. The table below lists the default number of
parsing processes for each Astro Hosted deployment size.
2 2 6 12
The Ultimate Guide to Apache Airflow® DAGs 10 Scaling Airflow DAGs 122
● file_parsing_sort_mode: Determines how the scheduler lists and sorts DAG files to
determine the parsing order. Set to one of: modified_time, random_seeded_by_host and
alphabetical. The default value is modified_time.
● scheduler_heartbeat_sec: Defines how often the scheduler should run (in seconds) to
trigger new tasks. The default value is 5 seconds.
● max_tis_per_query: Changes the batch size of queries to the metastore in the main sched-
uling loop. A higher value allows more task instances to be processed per query, but your
query may become too complex and cause performance issues. The default value is 16 que-
ries. Note that the scheduler.max_tis_per_query value needs to be lower than the core.
parallelism value.
DAG-level settings
DAG-level settings apply only to specific DAGs and are defined in your DAG code. You should modify
DAG-level settings if you want to performance tune a particular DAG, especially in cases where that
DAG is hitting an external system such as an API or a database that might cause performance issues
if hit too frequently. When a setting exists at both the DAG-level and environment-level, the DAG-lev-
el setting takes precedence.
There are three primary DAG-level Airflow settings that you can define in code:
● max_active_runs: The maximum number of active DAG runs allowed for the DAG. When
this limit is exceeded, the scheduler won’t create new active DAG runs. If this setting is not
defined, the value of the environment-level setting max_active_runs_per_dag is assumed.
If you’re utilizing catchup or backfill for your DAG, consider defining this parameter to ensure
that you don’t accidentally trigger a high number of DAG runs.
● max_active_tasks: The total number of tasks that can run at the same time for a given DAG
run. It essentially controls the parallelism within your DAG. If this setting is not defined, the
value of the environment-level setting max_active_tasks_per_dag is assumed.
● concurrency: The maximum number of task instances allowed to run concurrently across all
active DAG runs for a given DAG. This allows you to allow one DAG to run 32 tasks at once,
and another DAG can be set to run 16 tasks at once. If this setting is not defined, the value of
the environment-level setting max_active_tasks_per_dag is assumed.
The Ultimate Guide to Apache Airflow® DAGs 10 Scaling Airflow DAGs 123
You can define any DAG-level settings within your DAG definition. For example:
def my_dag():
Task-level settings
Task-level settings are defined by task operators that you can use to implement additional perfor-
mance adjustments. Modify task-level settings when specific types of tasks are causing perfor-
mance issues.
There are two primary task-level Airflow settings users can define in code:
● max_active_tis_per_dag: The maximum number of times that the same task can run
concurrently across all DAG runs. For instance, if a task pulls from an external resource, such
as a data table, that should not be modified by multiple tasks at once, then you can set this
value to 1.
● pool: Defines the amount of pools available for a task. Pools are a way to limit the number
of concurrent instances of an arbitrary group of tasks. This setting is useful if you have a lot
of workers or DAG runs in parallel, but you want to avoid an API rate limit or otherwise don’t
want to overwhelm a data source or destination. For more information, see the Airflow Pools
Guide.
The parameters above are inherited from the BaseOperator, so you can set them in any operator
definition. For example:
@task(
pool=”my_custom_pool”,
max_active_tis_per_dag=14
def t1():
pass
The Ultimate Guide to Apache Airflow® DAGs 10 Scaling Airflow DAGs 124
Depending on how you run Airflow, you may also need to adjust your executor settings when
scaling up your environment. See:
The Ultimate Guide to Apache Airflow® DAGs 10 Scaling Airflow DAGs 125
11
Debugging
Airflow DAGs
KEY CONCEPTS
● Is the problem with Airflow, or is it with an external system connected to Airflow? Test if the
action can be completed in the external system without using Airflow.
● What is the state of your Airflow components? Inspect the logs of each component and re-
start your Airflow environment if necessary.
● Does Airflow have access to all relevant files? This is especially relevant when running Air-
flow in Docker or when using the Astro CLI.
● Are your Airflow connections set up correctly with correct credentials? See Troubleshooting
connections.
● Can you collect the relevant logs? For more information on log location and configuration,
see the Airflow logging guide.
● Which versions of Airflow and Airflow providers are you using? Make sure that you’re using
the correct version of the Airflow documentation.
● Can you reproduce the problem in a new local Airflow instance using the Astro CLI?
Answering these questions will help you narrow down what kind of issue you’re dealing with and
inform your next steps.
You can debug your DAG code with IDE debugging tools using the dag.test() method. See De-
bug interactively with dag.test().
The Ultimate Guide to Apache Airflow® DAGs 11 Debugging Airflow DAGs 127
The most common issues related to the Astro CLI are:
● The Astro CLI was not correctly installed. Run astro version to confirm that you can suc-
cessfully run Astro CLI commands. If a newer version is available, consider upgrading.
● There are errors caused by custom commands in the Dockerfile, or dependency conflicts
with the packages in packages.txt and requirements.txt.
● Airflow components are in a crash-loop because of errors in custom plugins or XCom back-
ends. View scheduler logs using astro dev logs -s to troubleshoot.
To troubleshoot infrastructure issues when running Airflow on other platforms, for example in
Docker, on Kubernetes using the Helm Chart or on managed services, please refer to the relevant
documentation and customer support.
You can learn more about testing and troubleshooting locally with the Astro CLI in the Astro
documentation.
DAG S D O N ’ T A P P E A R I N T H E A I R F LOW U I
If a DAG isn’t appearing in the Airflow UI, it’s typically because Airflow is unable to parse the DAG. If
this is the case, you’ll see an Import Error in the Airflow UI.
Figure 1: Screenshot of the Airflow UI showing a DAG import error due to a missing task_id.
The message in the import error can help you troubleshoot and resolve the issue.
To view import errors in your terminal, run astro dev run dags list-import-errors with the
Astro CLI, or run airflow dags list-import-errors with the Airflow CLI.
The Ultimate Guide to Apache Airflow® DAGs 11 Debugging Airflow DAGs 128
If you don’t see an import error message but your DAGs still don’t appear in the UI, try these
debugging steps:
● Make sure all of your DAG files are located in the dags folder.
● Airflow scans the dags folder for new DAGs every dag_dir_list_interval, which defaults to 5
minutes but can be modified. You might have to wait until this interval has passed before a
new DAG appears in the Airflow UI or restart your Airflow environment.
● Ensure that you have permission to see the DAGs, and that the permissions on the DAG file
are correct.
● Run astro dev run dags list with the Astro CLI or airflow dags list with the Airflow CLI to make
sure that Airflow has registered the DAG in the metadata database. If the DAG appears in the
list but not in the UI, try restarting the Airflow webserver.
● If you see an error in the Airflow UI indicating that the scheduler is not running, check the
scheduler logs to see if an error in a DAG file is causing the scheduler to crash. If you are
using the Astro CLI, run astro dev logs -s and then try restarting.
Figure 2: Screenshot of the Airflow UI showing a message indication that the scheduler has been down for 33 seconds.
● Contains the word airflow. The scheduler only parses files fulfilling this condition.
● Has its dag function called when defined with the @dag decorator. See Writing a simple DAG.
Note that in Airflow 2.10+, you can configure an Airflow listener as a plugin to run any Python code,
either when a new import error appears (on_new_dag_import_error) or when the dag processor
finds a known import error (on_existing_dag_import_error). See Airflow listeners for more
information.
The Ultimate Guide to Apache Airflow® DAGs 11 Debugging Airflow DAGs 129
DAG S N OT R U N N I N G C O R R E C T LY
If your DAGs are either not running or running differently than you intended, consider checking the
following common causes:
● DAGs need to be unpaused in order to run on their schedule. You can unpause a DAG by
clicking the toggle on the left side of the Airflow UI or by using the Airflow CLI. As of Airflow
2.2 creating a manual DAG run with the play button automatically unpauses a DAG.
Figure 3: Screenshot of the Airflow UI DAGs view with the toggle to unpause a DAG highlighted.
● If you want all DAGs unpaused by default, you can set dags_are_paused_at_creation=False in
your Airflow config. If you do this, remember to set catchup=False in your DAGs to prevent
automatic backfilling of DAG runs. Paused DAGs are unpaused automatically when you
manually trigger them.
● Double check that each DAG has a unique dag_id. If two DAGs with the same id are present
in one Airflow instance the scheduler will pick one at random every 30 seconds to display.
● Make sure your DAG has a start_date in the past. A DAG with a start_date in the future will
result in a successful DAG run with no task runs. Do not use datetime.now() as a start_
date.
● Test the DAG using astro dev dags test <dag_id>. With the Airflow CLI, run airflow
dags test <dag_id>.
● If no DAGs are running, check the state of your scheduler using astro dev logs -s.
● If too many runs of your DAG are being scheduled after you unpause it, you most likely need
to set catchup=False in your DAG’s parameters.
If your DAG is running, but not on the schedule you expected, review the Scheduling DAGs section
of this eBook. If you are using a custom timetable, ensure that the data interval for your DAG run
does not precede the DAG start date.
The Ultimate Guide to Apache Airflow® DAGs 11 Debugging Airflow DAGs 130
Common Task issues
This section covers common issues related to individual tasks you might encounter. If your entire
DAG is not working, see the DAGs are not running correctly section above.
TA S K S A R E N OT R U N N I N G C O R R E C T LY
It is possible for a DAG to start but its tasks to be stuck in various states or to not run in the desired
order. If your tasks are not running as intended, try the following debugging methods:
● Double check that your DAG’s start_date is in the past. A future start_date will result in a
successful DAG run even though no tasks ran.
● If your tasks stay in a scheduled or queued state, ensure your scheduler is running
properly. If needed, restart the scheduler or increase scheduler resources in your Airflow
infrastructure.
● If your tasks have the depends_on_past parameter set to True, those newly added tasks
won’t run until you set the state of prior task runs.
● When running many instances of a task or DAG, be mindful of scaling parameters and
configurations. Airflow has default settings that limit the amount of concurrently running
DAGs and tasks. See Scaling DAGs to learn more.
● If you are using task decorators and your tasks are not showing up in the Graph and Grid
view, make sure you are calling your tasks inside your DAG function.
● Check your task dependencies and trigger rules. Consider recreating your DAG structure
with EmptyOperators to ensure that your dependencies are structured as expected.
● The task_queued_timeout configuration controls how long tasks can be in queued state
before they are either retried or marked as failed. The default is 600 seconds.
● If you are using the CeleryExecutor in an Airflow version earlier than 2.6 and tasks get stuck
in the queued state, consider turning on stalled_task_timeout.
If tasks are failing, the task logs will provide you with more information to help troubleshoot. Most
task failure issues fall into one of 3 categories:
● Issues with operator parameter inputs. For example passing a string to a parameter that
only accepts lists.
● Issues within the operator. For example exceptions caused by outdated logic in a custom
operator, like using deprecated methods of an SDK when connecting to an external system.
The Ultimate Guide to Apache Airflow® DAGs 11 Debugging Airflow DAGs 131
Missing Logs
When you check your task logs to debug a failure, you may not see any logs. On the log page in the
Airflow UI, you may see a spinning wheel, or you may just see a blank file.
Generally, logs fail to appear when a process dies in your scheduler or worker and communication is
lost. The following are some debugging steps you can try:
● Try rerunning the task by clearing the task instance to see if the logs appear during the
rerun.
● Increase your log_fetch_timeout_sec configuration to greater than the 5 second default. This
parameter controls how long the webserver waits for the initial handshake when fetching
logs from the worker machines, and having extra time here can sometimes resolve issues.
● Increase the resources available to your workers (if using the CeleryExecutor) or scheduler
(if using the LocalExecutor).
● If you’re using the KubernetesExecutor and a task fails very quickly (in less than 15 seconds),
the pod running the task spins down before the webserver has a chance to collect the logs
from the pod. If possible, try building in some wait time to your task depending on which
operator you’re using. If that isn’t possible, try to diagnose what could be causing a near-
immediate failure in your task. This is often related to either lack of resources or an error in
the task configuration.
● Ensure that your logs are retained until you need to access them.
● Check your scheduler and webserver logs for any errors that might indicate why your task
logs aren’t appearing.
Astronomer customers can check the View Airflow component and task logs for a Deploy-
ment page of our documentation to see more information on how to find logs for Deployments
running on Astro. To evaluate potential resource bottlenecks, Astro allows you to view detailed
metrics about your Airflow components in the Astro UI.
The Ultimate Guide to Apache Airflow® DAGs 11 Debugging Airflow DAGs 132
Troubleshooting connections
Typically, Airflow connections are needed to allow Airflow to communicate with external systems.
Most hooks and operators expect a defined connection parameter. Because of this, improperly
defined connections are one of the most common issues Airflow users have to debug when first
working with their DAGs.
While the specific error associated with a poorly defined connection can vary widely, you will
typically see a message with “connection” in your task logs. If you haven’t defined a connection,
you’ll see a message such as ‘connection_abc’ is not defined.
● Make sure you have the necessary provider packages installed to be able to use a specific
connection type.
● Change the <external tool>_default connection to use your connection details or define a
new connection with a different name and pass the new name to the hook or operator.
● Define connections using Airflow environment variables instead of adding them in the Airflow
UI. Make sure you’re not defining the same connection in multiple places. If you do, check
the Connection options section to review the order of precedence for connections.
● Test if your credentials work when used in a direct API call to the external tool.
To find information about what parameters are required for a specific connection:
● Read provider documentation in the Astronomer Registry to access the Apache Airflow
documentation for the provider. Most commonly used providers will have documentation on
each of their associated connection types. For example, you can find information on how to
set up different connections to Azure in the Azure provider docs.
● Check the documentation of the external tool you are connecting to and see if it offers
guidance on how to authenticate.
● View the source code of the hook that is being used by your operator.
You can also test connections from within your IDE by using the dag.test() method. See Debug
interactively with dag.test() and How to test and debug your Airflow connections.
The Ultimate Guide to Apache Airflow® DAGs 11 Debugging Airflow DAGs 133
I need more help
The information provided here should help you resolve the most common issues. If your issue was
not covered in this guide, try the following resources:
● Join the Apache Airflow Slack and open a thread in #user-troubleshooting. The Airflow slack
is the best place to get answers to more complex Airflow specific questions.
● Post your question to Stack Overflow, tagged with airflow and other relevant tools you are
using. Using Stack Overflow is ideal when you are unsure which tool is causing the error,
since experts for different tools will be able to see your question.
● If you found a bug in Airflow or one of its core providers, please open an issue in the Airflow
GitHub repository. For bugs in Astronomer open source tools like Cosmos please open an
issue in the relevant Astronomer repository.
To get more specific answers to your question, include the following information in your question
or issue:
● Your method for running Airflow (Astro CLI, standalone, Docker, managed services).
The Ultimate Guide to Apache Airflow® DAGs 11 Debugging Airflow DAGs 134
Conclusion
Congratulations! You’ve learned all you need to know about how to write
Airflow DAGs and are now officially an Airflow magician. Welcome to the
coven!
What’s next? Beyond using your magic to write amazing DAGs delivering the
world’s data, you could consider deploying your DAGs to a free Astro trial.