Building Data Pipelines in Python
Building Data Pipelines in Python
Pipelines in Python
Marco Bonzanini
ETL
{ { Extract
Transform
Load
Clean
Augment
Join
Good Data Pipelines
Easy to
{Reproduce
Productise
Towards Good Data Pipelines
Towards Good Data Pipelines (a)
$ ./do_something.sh
$ ./do_something_else.sh
$ ./extract_some_data.sh
$ ./join_some_other_data.sh
...
Script soups kill replicability
Anti-pattern: the master script
$ cat ./run_everything.sh
./do_something.sh
./do_something_else.sh
./extract_some_data.sh
./join_some_other_data.sh
$ ./run_everything.sh
Towards Good Data Pipelines (d)
Break it Down
setup.py and conda
Towards Good Data Pipelines (e)
Automated Testing
i.e. why scientists don’t write unit tests
Intermezzo
f1 = fscore(p, r)
min_bound, max_bound = sorted([p, r])
assert min_bound <= f1 <= max_bound
Testing: I’m almost done
• Unit tests vs Defensive Programming
• Say no to tautologies
• Say no to vanity tests
• The Python ecosystem is rich:
py.test, nosetests, hypothesis, coverage.py, …
</rant>
Towards Good Data Pipelines (f)
Orchestration
Don’t re-invent the wheel
You need a workflow manager
Think:
GNU Make + Unix pipes + Steroids
Intro to Luigi
class MyTask(luigi.Task):
def requires(self):
return [SomeTask()]
def output(self):
return luigi.LocalTarget(…)
def run(self):
mylib.run()
Luigi Target: output of a task
class MyTarget(luigi.Target):
def exists(self):
... # return bool
Static Analysis
The Joy of Duck Typing
If it looks like a duck,
swims like a duck,
and quacks like a duck,
then it probably is a duck.
— somebody on the Web
>>> 1.0 == 1 == True
True
>>> 1 + True
2
>>> '1' * 2
'11'
>>> '1' + 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Can't convert 'int' object
to str implicitly
PEP 3107 — Function Annotations
(since Python 3.0)
• speakerdeck.com/marcobonzanini
• github.com/bonzanini
• marcobonzanini.com
• @MarcoBonzanini