Building Data Pipelines - 3
Building Data Pipelines - 3
Building Data Pipelines - 3
of tests
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Oliver Willekens
Data Engineer at Data Minded
Software tends to change
Common reasons for change:
writing tests
running tests
writing tests
running tests
writing tests
running tests
writing tests
running tests
Oliver Willekens
Data Engineer at Data Minded
Our earlier Spark application is an ETL pipeline
unit_prices_with_ratings = (prices_with_ratings
.join(…) # transform
.withColumn(…)) # transform
def calculate_unit_price_in_euro(df):
return df.withColumn(
"unit_price_in_euro",
col("price") / col("quantity") * col("exchange_rate_to_euro"))
def calculate_unit_price_in_euro(df):
return df.withColumn(
"unit_price_in_euro",
col("price") / col("quantity") * col("exchange_rate_to_euro"))
unit_prices_with_ratings = (
calculate_unit_price_in_euro(
link_with_exchange_rates(prices, exchange_rates)
)
)
assertDataFrameEqual(result, expected)
3. Creating small and well-named functions leads to more reusability and easier testing.
Oliver Willekens
Data Engineer at Data Minded
Running a test suite
Execute tests in Python, with one of:
unittest pytest
doctest nose
Examples:
cd ~/workspace/my_good_python_project
pytest .
# Lots of output…
== 19 passed, 2 warnings in 36.80 seconds ==
cd ~/workspace/my_bad_python_project
pytest .
# Lots of output…
== 3 failed, 1 passed in 6.72 seconds ==
Solution:
Automation
How:
Continuous Integration:
Continuous Delivery:
Create “artifacts” (deliverables like documentation, but also programs) that can be deployed into
production without breaking things.
Example:
Often:
jobs:
test:
1. checkout code
docker:
- image: circleci/python:3.6.4 2. install test & build requirements
steps:
3. run tests
- checkout
- run: pip install -r requirements.txt 4. package/build the software artefacts
- run: pytest .
5. deploy the artefacts (update docs / install
app / …)