Apache Airflow Certification - Study Guide For DAG Authoring
Apache Airflow Certification - Study Guide For DAG Authoring
This study guide covers the Astronomer Certification DAG Authoring for Apache
Airflow. Apache Airflow is the leading orchestrator for authoring, scheduling, and
monitoring data pipelines. The exam consists of 75 questions, and you have 60
minutes to write it. The study guide below covers everything you need to know for it.
The exam includes scenarios (both text and images of Python code) where you need
to determine what the output will be, if any at all. To study for this exam I watched the
official Astronomer preparation course, I highly recommend it.
According to Astronomer, this exam will test the following:
“You have to show your capabilities of understanding the different features that
Airflow brings to create DAGs. What are the pros and cons of each one as well as their
limitations. You should be comfortable for recommending settings and design choices
for data pipelines according to different use cases. You should know the most
common operators as well as the specificities of others allowing to define DAG
dependencies, choose different branches, wait for events and so on”
Study Guide
• Variables
◦ Variables are a generic way to store and retrieve arbitrary content or settings as a
simple key value store within Airflow. Variables can be listed, created, updated and
deleted from the UI. Airflow uses Fernet to encrypt variables stored in the
metastore database. It guarantees that without the encryption password, content
cannot be manipulated or read without the key. For information on configuring
Fernet, look at Fernet.
Variables in Airflow UI
◦ Best practice: try to fetch the variables within the tasks, to avoid making useless
connections every 30 seconds
◦ You can use JSON values to call multiple variables in an efficient manner, using the
Jinja Template Engine. You can create the variable list using the Airflow UI
{extract = PythonOperator(
task_id="extract".
python_callable=_extract,
op_args=["{{ var.json.my_dag_partner.name }}"])
• Pools
◦ Airflow pools are used to limit the execution parallelism on arbitrary sets of tasks.
Each time a task is running, a slot is given to that task throughout its execution.
Once the task is finished, the slot is free again and ready to be given to another
task. A slot is given regardless of the resources a task needs. It’s really just a slot, 1
task = 1 slot. If there is no more slots available, the tasks will be queued and so
the number of queued slots will increase. By default, you can execute at most 128
tasks at the same time as the default pool has 128 slots, and it is running in the
default_pool.
◦ The concurrency or the limit to the maximum number of running tasks for a given
DAG is set to 16 by default (dag_concurrency / concurrency). In other words, you
can run 16 tasks at the same time for a same DAG.
◦ There are three ways you can create and manage a pool in Airflow.
▪ Create a pool through the Airflow UI
▪ Go to Admin → Pools and add a new record. You can define a name, the
number of slots, and a description.
▪ Create a pool using the Airflow CLI with the airflow pools command
◦ By default, all tasks in Airflow get assigned to the default_pool which has 128 slots.
You can modify this value, but you can’t remove the default pool. Tasks can be
assigned to any other pool by updating the pool parameter. This parameter is part
of the BaseOperator, so it can be used with any operator.
task_a = PythonOperator(
task_id='task_a',
python_callable=sleep_function,
pool='single_task_pool'
)
• Trigger Rules
◦ Basically, a trigger rule defines why a task gets triggered, on which condition. By
default, all tasks have the same trigger rule all_success set which means, if all
parents of a task succeed, then the task gets triggered. Only one trigger rule at a
time can be specified for a given task
▪ all_done: You just want to trigger your task once all upstream tasks (parents)
are done with their execution whatever their state. good for emails and Slack
messages, regardless of success or failure
▪ none_skipped: With this simple trigger rule, your task gets triggered if no
upstream tasks are skipped. If they are all in success or failed. Use case if you use
branch python operator
• DAG Dependencies
▪ external_task_id:The task_id that contains the task you want to wait for. If
None the sensor waits for the DAG
▪ execution_delta: time difference with the previous execution to look at, the
default is the same execution_date as the current task or DAG. For yesterday, use
positive datetime.timedelta(days=1). Either execution_delta or
execution_date_fn can be passed to ExternalTaskSensor, but not both.
▪ allowed_states: expects a list of statuses. You can say success, so if the task
you are waiting for succeeded, then it works
▪ reset_dag_run: Whether or not clear existing dag run if already exists. This is
useful when backfill or rerun an existing dag run. When reset_dag_run=False
and dag run exists, DagRunAlreadyExists will be raised. When
reset_dag_run=True and dag run exists, existing dag run will be cleared to
rerun.
def print_task_type(**kwargs):
"""
Dummy function to call before and after dependent DAG.
"""
print(f"The {kwargs['task_type']} task has completed.")
with DAG('trigger-dagrun-dag',
start_date=datetime(2021, 1, 1),
max_active_runs=1,
schedule_interval='@daily',
default_args=default_args,
catchup=False
) as dag:
start_task = PythonOperator(
task_id='starting_task',
python_callable=print_task_type,
op_kwargs={'task_type': 'starting'}
)
trigger_dependent_dag = TriggerDagRunOperator(
task_id="trigger_dependent_dag",
trigger_dag_id="dependent-dag",
wait_for_completion=True
)
end_task = PythonOperator(
task_id='end_task',
python_callable=print_task_type,
op_kwargs={'task_type': 'ending'}
)
▪ set_downstream
▪ If you want to create dependencies between two lists, you need to use cross
dependencies
▪ cross_downstream([t1,t2,t3],[t4,t5,t6])
▪ You cannot add anything to this, you would need to define on a new line
▪ [t4,t5,t6] >> t7
◦ Chain Dependencies
▪ What if you want to set multiple parallel cross-dependencies? Unfortunately,
Airflow can’t parse dependencies between two lists (e.g. [t0, t1] » [t2, t3] throws
an error). If you need to set dependencies in this manner, you can use Airflow’s
chain function:
with DAG('dependencies',
) as dag:
t0 = DummyOperator(task_id='t0')
t1 = DummyOperator(task_id='t1')
t2 = DummyOperator(task_id='t2')
t3 = DummyOperator(task_id='t3')
t4 = DummyOperator(task_id='t4')
t5 = DummyOperator(task_id='t5')
t6 = DummyOperator(task_id='t6')
▪ depends_on_past: When set to True, keeps a task from getting triggered if the
previous schedule for the task hasn’t succeeded. Unless a previous run of your
DAG has failed, the depends_on_past should not be a factor, it will not affect the
current run at all if the previous run executed the tasks successfully. It does not
prevent the next DAG run from running if the previous DAG run did not succeed.
There will be no status.
• Idempotency
◦ When designing data pipelines, always aim for idempotence and determinism
▪ Determintistic: When you execute your task with a certain input, you will always
get the same output
▪ Idempotence: When you execute multiple times, your task will always produce
the same result
▪ PostgresOperator(task_id='create_table', sql='CREATE TABLE
my_table;') is not idempotent
• Dynamic DAGs
◦ In Airflow, DAGs are defined as Python code. Airflow executes all Python code in
the dags_folder and loads any DAG objects that appear in globals(). The simplest
way of creating a DAG is to write it as a static Python file. However, sometimes
manually writing DAGs isn’t practical. Maybe you have hundreds or thousands of
DAGs that do similar things with just a parameter changing between them. Or
maybe you need a set of DAGs to load tables, but don’t want to manually update
DAGs every time those tables change. In these cases, and others, it can make more
sense to dynamically generate DAGs.
◦ Whenever you have multiple DAGs that have the same tasks for which only the
inputs change, then it might be better to generate those DAGs dynamically.
▪ Benefits:
▪ It’s simple and easy to implement
▪ Drawbacks:
▪ Since a DAG file isn’t actually being created, your visibility into the code
behind any specific DAG is limited.
▪ Since this method requires a Python file in the dags_folder, the generation
code will be executed every time the dag is parsed
▪ Process
def create_dag(dag_id,
schedule,
dag_number,
default_args):
def hello_world_py(*args):
print('Hello World')
print('This is DAG: {}'.format(str(dag_number)))
dag = DAG(dag_id,
schedule_interval=schedule,
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id='hello_world',
python_callable=hello_world_py,
dag_number=dag_number)
return dag
▪ Multi-File Method:
▪ This time, instead of having one single Python file in charge of generating your
DAGs , you are going to use a script that will create a file of each generated
DAG. At the end you will get one Python File per generated DAG.
▪ Benefits:
▪ It’s more scalable than single-file methods. Because the DAG files aren’t
being generated by parsing code in the dags_folder, the DAG generation
code isn’t executed on every scheduler heartbeat.
▪ Since DAG files are being explicitly created before deploying to Airflow, you
have full visibility into the DAG code, including from the Code button in the
Airflow UI.
▪ Drawbacks:
▪ It can be complex to set up.
▪ Process
▪ To start, we will create a DAG ‘template’ file that defines the DAG’s structure.
This looks just like a regular DAG file, but we have added specific variables
where we know information is going to be dynamically generated, namely
the dag_id, scheduletoreplace, and querytoreplace.
dag = DAG(dag_id,
schedule_interval=scheduletoreplace,
default_args=default_args,
catchup=False)
with dag:
t1 = PostgresOperator(
task_id='postgres_query',
postgres_conn_id=connection_id
sql=querytoreplace)
▪ Next we create a dag-config folder that will contain a JSON config file for
each DAG. The config file should define the parameters that we noted above,
the DAG Id, schedule interval, and query to be executed.
{
"DagId": "dag_file_1",
"Schedule": "'@daily'",
"Query":"'SELECT * FROM table1;'"
}
▪ Finally, we create a Python script that will create the DAG files based on the
template and the config files. The script loops through every config file in the
dag-config/ folder, makes a copy of the template in the dags/ folder, and
overwrites the parameters in that file with the ones from the config file.
import json
import os
import shutil
import fileinput
config_filepath = 'include/dag-config/'
dag_template_filename = 'include/dag-template.py'
new_filename = 'dags/'+config['DagId']+'.py'
shutil.copyfile(dag_template_filename, new_filename)
▪ Now to generate our DAG files, we can either run this script ad-hoc or as
part of our CI/CD workflow. After running the script, our final directory would
look like the example below, where the include/ directory contains the files
shown above, and the dags/ directory contains the two dynamically
generated DAGs:
dags/
├── dag_file_1.py
├── dag_file_2.py
include/
├── dag-template.py
├── generate-dag-files.py
└── dag-config
├── dag1-config.json
└── dag2-config.json
• Versioning
◦ The issue is that if you remove a task, you will not be able to view this task in the
past dag_run logs
◦ There is a temporary way around this, by adding a suffix to your dag_id like
_1_0_0
• DAG Scheduling
◦ Important Parameters
▪ start_date: Date at which your tasks are being scheduled
◦ Timedelta vs CRON
▪ CRON expressions are stateless (as specified)
▪ Use case: When you want to trigger your DAG every three days, TimeDelta is
simpler because you don’t need to set the actual dates
• Templating
◦ Airflow takes advantage of the power of Jinja Templating and this is a powerful
tool to use in combination with macros. Jinja templating allows providing dynamic
content using python code to otherwise static objects such as strings. In other
words, it allows you to fetch a specific chunk of data based on a parameter, rather
than fetching that exam same data every time. Since airflow Macros are evaluated
while the task gets run, it is possible to provide parameters that can change
during execution. For example, passing the result of one operator to another one
that runs after it.
◦ Also, parameters such as execution dates can be passed to fields. All operators
define some of the fields that are template-able, and only those fields can take
macros as inputs.
◦ The argument must be compatible with templating, you can check the
documentation
• TaskFlow API
◦ TaskFlow takes care of moving inputs and outputs between your Tasks using
XComs for you, as well as automatically calculating dependencies - when you call
a TaskFlow function in your DAG file, rather than executing it, you will get an
object representing the XCom for the result (an XComArg), that you can then use
as inputs to downstream tasks or operators.
Taskflow API
@task
def extract_bitcoin_price() -> Dict [str,float):
return requests.get (API).json()['bitcoin']
@task(multiple outputs=True)
def process_data(response: Dict[str, float]) -> Dict[str, float]:
logging.info(response)
return {'usd': response['usd'], 'change': response['usd_24h_change']}
@task
def store_data(data: Dict[str, float]):
logging. info(f"Store: (data['usd']} with change {data['change']}")
◦ You cannot use the TaskFlow API between the parent DAG and the sub DAG. It
tries to make the dependencies automatically for you, which means you are trying
to create dependencies from your DAG and from your subDAG, which is not
possible.
Taskflow API
◦ multiple_outputs: if you want one XCOM for one value, without using
xcom_push more than once, you can set this argument to True. By specifying this
arguemnt to true, you are saying this XCOM is not one XCOM with the diciotnary
as a value, but two XCOMs with different values and keys.
• XCOMs
▪ Limited in size,
▪ 2 GB for SQLite,
▪ 1 GB for PostgreSQL,
▪ 64 KB for MySQL
◦ Use Case:
▪ This data pipeline trains different machine learning models based on a dataset
and the last task selects the model having the highest accuracy. The question is,
how can we get the accuracy of each model in the task Choosing Model to
choose the best one? Using XCOMs!
▪ You can think of an XCOM as an object that is stored in the metadata database
of Airflow with the following fields:
▪ The key is the identifier of your XCom. No need to be unique and is used to
get back the XCOM from a given task.
▪ The value is the value of your XCom. What you want to share. Keep in mind
that your value must be serializable in JSON or pickable. Notice that serializing
with pickle is disabled by default to avoid RCE exploits/security issues.
▪ That execution date corresponds to the execution date of the DagRun having
generated the XCom. That’s how Airflow avoid fetching an XCom coming from
another DAGRun
def _training_model(ti):
accuracy = uniform(0.1, 10.0)
print(f'model\'s accuracy: {accuracy}')
return accuracy
def _choose_best_model(ti):
fetched_accuracy = ti.xcom_pull(key='model_accuracy', task_ids=['training_model
_A'])
print(f'choose best model: {fetched_accuracy}')
def _choose_best_model(ti):
fetched_accuracies = ti.xcom_pull(key='model_accuracy', task_ids=['training_model_
A', 'training_model_B', 'training_model_C'])
print(f'choose best model: {fetched_accuracies}')
• SubDAGs / TaskGroups
◦ If you have a lot of tasks in your dag, It might be hard to understand what’s going
on. You can group the tasks that belong together using SubDags and TaskGroups
◦ SubDAGs
▪ When a SubDAG is triggered, the SubDAG and child tasks take up worker slots
until the entire SubDAG is complete. This can delay other task processing and,
depending on your number of worker slots, can lead to deadlocking.
▪ SubDAGs have their own parameters, schedule, and enabled settings. When
these are not consistent with their parent DAG, unexpected behavior can occur.
◦ TaskGroups
▪ Unlike SubDAGs, Task Groups are just a UI grouping concept. Starting in Airflow
2.0, you can use Task Groups to organize tasks within your DAG’s graph view in
the Airflow UI. This avoids the added complexity and performance issues of
SubDAGs, all while using less code
▪ You can use dependency operators (« and ») on Task Groups in the same way
that you can with individual tasks. Dependencies applied to a Task Group are
applied across its tasks. In the following code, we’ll add additional dependencies
to t0 and t3 to the Task Group, which automatically applies the same
dependencies across t1 and t2
t0 = DummyOperator(task_id='start')
t1 >> t2
# End Task Group definition
t3 = DummyOperator(task_id='end')
• Branching
◦ You are able to choose one task or another based on a condition. The
BranchPythonOperator allows you to choose one branch among branches of
your DAG
▪ If the condition is true, you will return the task_id corresponding to the task
you want to execute next
▪ For example, we can pass the following function that returns one set of task IDs
if the result is greater than 0.5 and a different set if the result is less than or
equal to 0.5:
▪ When you trigger the BranchPythonOperator, one task is trigger next and the
others are skipped.
• SLAs
◦ Note: If you trigger your DAG manually, your SLA’s won’t be checked
• Miscellaneous
◦ The Scheduler will only parse files that contain DAG or Airflow
◦ If you change the start_date in a DAG to be in the past, the new past DAG runs
won’t be triggered automatically
Study Guide for AWS Data Study Guide for GitLab Study Guide for Apache
Analytics Specialty Certified Associate Airflow Fundamentals
Certification Certification Certification
This study guide covers AWS Certification This study guide covers the GitLab Certified This study guide covers Astronomer
for Data Analytics Specialty. This ... Associate Certification. It is a... Certification for Apache Airflow
Fundament...