Building Data Pipelines - 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29
At a glance
Powered by AI
The key takeaways are about the importance of testing code and building testable pipelines. Testing helps ensure stability when software changes and prevents bugs.

Some common reasons for testing code are to prevent introducing breaking changes, improve the chance that code is correct in the future, and raise confidence that code is working as expected.

Benefits of testing code include preventing bugs, documenting code behavior, and allowing code to change safely over time as needs evolve without breaking existing functionality.

On the importance

of tests
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Software tends to change
Common reasons for change:

new functionality desired

bugs need to get squashed

performance needs to be improved

Core functionality rarely evolves

How to ensure stability in light of changes?

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Rationale behind testing
improves chance of code being correct in the future
prevent introducing breaking changes

raises con dence (not a guarantee) that code is correct now


assert actuals match expectations

most up-to-date documentation


form of documentation that is always in sync with what’s running

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


The test pyramid: where to invest your efforts
Testing takes time  

thinking what to test

writing tests

running tests

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.


distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


The test pyramid: where to invest your efforts
Testing takes time  

thinking what to test

writing tests

running tests

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.


distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


The test pyramid: where to invest your efforts
Testing takes time  

thinking what to test

writing tests

running tests

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.


distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


The test pyramid: where to invest your efforts
Testing takes time  

thinking what to test

writing tests

running tests

Testing has a high return on investment

when targeted at the correct layer

when testing the non-trivial parts, e.g.


distance between 2 coordinates ? uppercasing
a rst name © Martin Fowler “TestPyramid”

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Let’s have this sink
in!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Writing unit tests for
PySpark
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Our earlier Spark application is an ETL pipeline

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Our earlier Spark application is an ETL pipeline

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Our earlier Spark application is an ETL pipeline

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Separate transform from extract and load
prices_with_ratings = spark.read.csv(…) # extract
exchange_rates = spark.read.csv(…) # extract

unit_prices_with_ratings = (prices_with_ratings
.join(…) # transform
.withColumn(…)) # transform

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Solution: construct DataFrames in-memory
# Extract the data - depends on input/output (network access,
df = spark.read.csv(path_to_file) lesystem permissions, …)

- unclear how big the data is

- unclear what data goes in

from pyspark.sql import Row + inputs are clear


purchase = Row("price",
"quantity", + data is close to where it is being used (“code-
"product") proximity”)
record = purchase(12.99, 1, "cake")
df = spark.createDataFrame((record,))

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Create small, reusable and well-named functions
unit_prices_with_ratings = (prices_with_ratings
.join(exchange_rates, ["currency", "date"])
.withColumn("unit_price_in_euro",
col("price") / col("quantity")
* col("exchange_rate_to_euro"))

def link_with_exchange_rates(prices, rates):


return prices.join(rates, ["currency", "date"])

def calculate_unit_price_in_euro(df):
return df.withColumn(
"unit_price_in_euro",
col("price") / col("quantity") * col("exchange_rate_to_euro"))

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Create small, reusable and well-named functions
def link_with_exchange_rates(prices, rates):
return prices.join(rates, ["currency", "date"])

def calculate_unit_price_in_euro(df):
return df.withColumn(
"unit_price_in_euro",
col("price") / col("quantity") * col("exchange_rate_to_euro"))

unit_prices_with_ratings = (
calculate_unit_price_in_euro(
link_with_exchange_rates(prices, exchange_rates)
)
)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Testing a single unit
def test_calculate_unit_price_in_euro():
record = dict(price=10,
quantity=5,
exchange_rate_to_euro=2.)
df = spark.createDataFrame([Row(**record)])

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Testing a single unit
def test_calculate_unit_price_in_euro():
record = dict(price=10,
quantity=5,
exchange_rate_to_euro=2.)
df = spark.createDataFrame([Row(**record)])
result = calculate_unit_price_in_euro(df)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Testing a single unit
def test_calculate_unit_price_in_euro():
record = dict(price=10,
quantity=5,
exchange_rate_to_euro=2.)
df = spark.createDataFrame([Row(**record)])
result = calculate_unit_price_in_euro(df)

expected_record = Row(**record, unit_price_in_euro=4.)


expected = spark.createDataFrame([expected_record])

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Testing a single unit
def test_calculate_unit_price_in_euro():
record = dict(price=10,
quantity=5,
exchange_rate_to_euro=2.)
df = spark.createDataFrame([Row(**record)])
result = calculate_unit_price_in_euro(df)

expected_record = Row(**record, unit_price_in_euro=4.)


expected = spark.createDataFrame([expected_record])

assertDataFrameEqual(result, expected)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Take home messages
1. Interacting with external data sources is costly

2. Creating in-memory DataFrames makes testing easier


the data is in plain sight,

focus is on just a small number of examples.

3. Creating small and well-named functions leads to more reusability and easier testing.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Continuous testing
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Running a test suite
Execute tests in Python, with one of:

in stdlib 3rd party

unittest pytest

doctest nose

Core task: assert or raise

Examples:

assert computed == expected

with pytest.raises(ValueError): # pytest specific

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Manually triggering tests
In a Unix shell:

cd ~/workspace/my_good_python_project
pytest .
# Lots of output…
== 19 passed, 2 warnings in 36.80 seconds ==

cd ~/workspace/my_bad_python_project
pytest .
# Lots of output…
== 3 failed, 1 passed in 6.72 seconds ==

Note: Spark increases time to run unit tests.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Automating tests
Problem:

forget to run unit tests when making changes

Solution:

Automation

How:

Git -> con gure hooks

Con gure CI/CD pipeline to run tests automatically

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


CI/CD
 

Continuous Integration:

get code changes integrated with the master branch regularly.  

Continuous Delivery:

Create “artifacts” (deliverables like documentation, but also programs) that can be deployed into
production without breaking things.

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Con guring a CI/CD tool
CircleCI looks for .circleci/con g.yml.

Example:

Often:
jobs:
test:
1. checkout code
docker:
- image: circleci/python:3.6.4 2. install test & build requirements
steps:
3. run tests
- checkout
- run: pip install -r requirements.txt 4. package/build the software artefacts
- run: pytest .
5. deploy the artefacts (update docs / install
app / …)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

You might also like