Building Data Pipelines - 4
Building Data Pipelines - 4
work ow
management
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Oliver Willekens
Data Engineer at Data Minded
What is a work ow?
A work ow:
Sequence of tasks
Cron is a dinosaur.
3. Scales horizontally.
my_dag = DAG(
dag_id="publish_logs",
schedule_interval="* * * * *",
start_date=datetime(2010, 1, 1)
task1.set_downstream(task2)
task3.set_upstream(task2)
# Even clearer:
# task1 >> task2 >> task3
Oliver Willekens
Data Engineer at Data Minded
Air ow’s BashOperator
Executes bash commands
Air ow adds logging, retry options and metrics over running this yourself.
bash_task = BashOperator(
task_id='greet_world',
dag=dag,
python_task = PythonOperator(
dag=dag,
task_id='perform_magic',
python_callable=my_magic_function,
).format(master=spark_master)
BashOperator(bash_command=command, …)
Oliver Willekens
Data Engineer at Data Minded
Installing and con guring Air ow
export AIRFLOW_HOME=~/airflow
pip install apache-airflow
airflow initdb
[core]
# lots of other configuration settings
# …
def test_dagbag_import():
"""Verify that Airflow will be able to import all DAGs in the repository."""
dagbag = DagBag()
number_of_failures = len(dagbag.import_errors)
assert number_of_failures == 0, \
"There should be no DAG failures. Got: %s" % dagbag.import_errors
Oliver Willekens
Data Engineer at Data Minded
What you learned
De ne purpose of components of data platforms