Chapter 2
Chapter 2
Mike Metzger
Data Engineer
Operators
Represent a single task in a workflow.
Run independently (usually).
DummyOperator(task_id='example', dag=dag)
bash_task = BashOperator(task_id='clean_addresses',
bash_command='cat addresses.txt | awk "NF==10" > cleaned.txt',
dag=dag)
Mike Metzger
Data Engineer
Tasks
Tasks are:
Instances of operators
example_task = BashOperator(task_id='bash_example',
bash_command='echo "Example!"',
dag=dag)
Are not required for a given workflow, but usually present in most
Are referred to as upstream or downstream tasks
In Airflow 1.8 and later, are defined using the bitshift operators
>>, or the upstream operator
task2 = BashOperator(task_id='second_task',
bash_command='echo 2',
dag=example_dag)
Mixed dependencies:
or:
Mike Metzger
Data Engineer
PythonOperator
Executes a Python function / callable
Operates similarly to the BashOperator, with more options
Keyword
sleep_task = PythonOperator(
task_id='sleep',
python_callable=sleep,
op_kwargs={'length_of_time': 5}
dag=example_dag
)
Attachments
Does require the Airflow system to be configured with email server details
email_task = EmailOperator(
task_id='email_sales_report',
to='[email protected]',
subject='Automated Sales Report',
html_content='Attached is the latest sales report',
files='latest_sales.xlsx',
dag=example_dag
)
Mike Metzger
Data Engineer
DAG Runs
A specific instance of a workflow at a point in time
Can be run manually or via schedule_interval
failed
success
1 https://airflow.apache.org/docs/stable/scheduler.html
end_date - Optional attribute for when to stop running new DAG instances
An asterisk * represents running for every interval (ie, every minute, every day, etc)
@hourly 0 * * * *
@daily 0 0 * * *
@weekly 0 0 * * 0
@monthly 0 0 1 * *
@yearly 0 0 1 1 *
1 https://airflow.apache.org/docs/stable/scheduler.html
This means the earliest starting time to run the DAG is on February 26th, 2020