0% found this document useful (0 votes)
129 views

Building Data Pipelines - 4

Modern work ow management systems like Apache Airflow address limitations of older cron-based systems. Airflow allows users to (1) create and visualize complex workflows as directed acyclic graphs (DAGs), (2) monitor and log workflows, and (3) scale workflows horizontally across multiple machines. It uses tasks represented by operator classes like BashOperator and PythonOperator to run shell commands and Python code as part of a workflow. Dependencies between tasks are defined by having one task set downstream or upstream of another.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views

Building Data Pipelines - 4

Modern work ow management systems like Apache Airflow address limitations of older cron-based systems. Airflow allows users to (1) create and visualize complex workflows as directed acyclic graphs (DAGs), (2) monitor and log workflows, and (3) scale workflows horizontally across multiple machines. It uses tasks represented by operator classes like BashOperator and PythonOperator to run shell commands and Python code as part of a workflow. Dependencies between tasks are defined by having one task set downstream or upstream of another.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Modern day

work ow
management
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
What is a work ow?
A work ow:

Sequence of tasks

Scheduled at a time or triggered by an event


Orchestrate data processing pipelines

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Scheduling with cron
Cron reads “crontab” les:

tabulate tasks to be executed at certain times

one task per line

*/15 9-17 * * 1-3,5 log_my_activity

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Scheduling with cron
*/15 9-17 * * 1-3,5 log_my_activity
____

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Scheduling with cron
The Air ow task:

An instance of an Operator class


Inherits from BaseOperator -> Must implement execute() method.

Performs a speci c action (delegation):


BashOperator -> run bash command/script

PythonOperator -> run Python script

SparkSubmitOperator -> submit a Spark job with a cluster

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Scheduling with cron
*/15 9-17 * * 1-3,5 log_my_activity
_

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Scheduling with cron
*/15 9-17 * * 1-3,5 log_my_activity
_

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Scheduling with cron
*/15 9-17 * * 1-3,5 log_my_activity
_____

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Scheduling with cron
*/15 9-17 * * 1-3,5 log_my_activity

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Scheduling with cron
*/15 9-17 * * 1-3,5 log_my_activity

# Minutes Hours Days Months Days of the week Command


*/15 9-17 * * 1-3,5 log_my_activity

Cron is a dinosaur.

Modern work ow managers:

Luigi (Spotify, 2011, Python-based)

Azkaban (LinkedIn, 2009, Java-based)

Air ow (Airbnb, 2015, Python-based)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Apache Air ow ful lls modern engineering needs
1. Create and visualize complex work ows,

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Apache Air ow ful lls modern engineering needs
1. Create and visualize complex work ows,

2. Monitor and log work ows,

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Apache Air ow ful lls modern engineering needs
1. Create and visualize complex work ows,

2. Monitor and log work ows,

3. Scales horizontally.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


The Directed Acyclic Graph (DAG)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


The Directed Acyclic Graph (DAG)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


The Directed Acyclic Graph in code
from airflow import DAG

my_dag = DAG(

dag_id="publish_logs",

schedule_interval="* * * * *",

start_date=datetime(2010, 1, 1)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Classes of operators
The Air ow task:

An instance of an Operator class


Inherits from BaseOperator -> Must implement execute() method.

Performs a speci c action (delegation):


BashOperator -> run bash command/script

PythonOperator -> run Python script

SparkSubmitOperator -> submit a Spark job with a cluster

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Expressing dependencies between operators
dag = DAG(…)
task1 = BashOperator(…)
task2 = PythonOperator(…)
task3 = PythonOperator(…)

task1.set_downstream(task2)
task3.set_upstream(task2)

# equivalent, but shorter:


# task1 >> task2
# task3 << task2

# Even clearer:
# task1 >> task2 >> task3

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Building a data
pipeline with Air ow
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Air ow’s BashOperator
Executes bash commands

Air ow adds logging, retry options and metrics over running this yourself.

from airflow.operators.bash_operator import BashOperator

bash_task = BashOperator(

task_id='greet_world',

dag=dag,

bash_command='echo "Hello, world!"'

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Air ow’s PythonOperator
Executes Python callables

from airflow.operators.python_operator import PythonOperator


from my_library import my_magic_function

python_task = PythonOperator(
dag=dag,
task_id='perform_magic',

python_callable=my_magic_function,

op_kwargs={"snowflake": "*", "amount": 42}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Running PySpark from Air ow
BashOperator: SSHOperator

spark_master = ( from airflow.contrib.operators\


"spark://" .ssh_operator import SSHOperator
"spark_standalone_cluster_ip"
":7077") task = SSHOperator(
task_id='ssh_spark_submit',
command = ( dag=dag,
"spark-submit " command=command,
"--master {master} " ssh_conn_id='spark_master_ssh'
"--py-files package1.zip " )
"/path/to/app.py"

).format(master=spark_master)

BashOperator(bash_command=command, …)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Running PySpark from Air ow
SparkSubmitOperator SSHOperator

from airflow.contrib.operators\ from airflow.contrib.operators\


.spark_submit_operator \ .ssh_operator import SSHOperator
import SparkSubmitOperator
task = SSHOperator(
spark_task = SparkSubmitOperator( task_id='ssh_spark_submit',
task_id='spark_submit_id', dag=dag,
dag=dag, command=command,
application="/path/to/app.py", ssh_conn_id='spark_master_ssh'
py_files="package1.zip", )
conn_id='spark_default'
)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Deploying Air ow
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Installing and con guring Air ow
export AIRFLOW_HOME=~/airflow
pip install apache-airflow
airflow initdb

[core]
# lots of other configuration settings
# …

# The executor class that airflow should u


# Choices include SequentialExecutor,
# LocalExecutor, CeleryExecutor, DaskExecu
# KubernetesExecutor
executor = SequentialExecutor

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Setting up for production
dags: place to store the dags (con gurable)

tests: unit test the possible deployment,


possibly ensure consistency across DAGs

plugins: store custom operators and hooks

connections, pools, variables: provide a location


for various con guration les you can import into
Air ow.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Example Air ow deployment test
from airflow.models import DagBag

def test_dagbag_import():
"""Verify that Airflow will be able to import all DAGs in the repository."""
dagbag = DagBag()

number_of_failures = len(dagbag.import_errors)
assert number_of_failures == 0, \
"There should be no DAG failures. Got: %s" % dagbag.import_errors

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Transferring DAGs and plugins

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Transferring DAGs and plugins

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Transferring DAGs and plugins

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Transferring DAGs and plugins

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Final thoughts
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
What you learned
De ne purpose of components of data platforms

Write an ingestion pipeline using Singer

Create and deploy pipelines for big data in Spark

Con gure automated testing using CircleCI

Manage and deploy a full data pipeline with Air ow

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Additional resources
External resources DataCamp courses

Singer: https://www.singer.io/ Software engineering:


https://www.datacamp.com/courses/software
Apache Spark: https://spark.apache.org/
engineering-for-data-scientists-in-python
Pytest: https://pytest.org/en/latest/
Spark:
Flake8: http:// ake8.pycqa.org/en/latest/ https://www.datacamp.com/courses/cleaning-
Circle CI: - https://circleci.com/ data-with-apache-spark-in-python (and other
courses)
Apache Air ow:
https://air ow.apache.org/ Unit testing: link yet to be revealed

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Congratulations! ????
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

You might also like