0% found this document useful (0 votes)
180 views

Building Data Pipelines - 1

This document discusses components of a data platform, including ingesting data using Singer, applying data cleaning operations, gaining insights with PySpark, testing and deploying pipelines. It covers Singer concepts like specifying schemas and streaming data as records and state messages. Examples demonstrate describing data schemas, serializing to JSON, running ingestion pipelines by chaining taps and targets, and using state messages to track progress.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
180 views

Building Data Pipelines - 1

This document discusses components of a data platform, including ingesting data using Singer, applying data cleaning operations, gaining insights with PySpark, testing and deploying pipelines. It covers Singer concepts like specifying schemas and streaming data as records and state messages. Examples demonstrate describing data schemas, serializing to JSON, running ingestion pipelines by chaining taps and targets, and using state messages to track progress.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Components of a

data platform
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Course contents
ingest data using Singer

apply common data cleaning operations

gain insights by combining data with PySpark

test your code automatically

deploy Spark transformation pipelines

=> intro to data engineering pipelines

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Data is valuable

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Democratizing data increases insights

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Genesis of the data

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Operational data is stored in the landing zone

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Cleaned data prevents rework

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


The business layer provides most insights

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Pipelines move data from one zone to another

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Let’s reason!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Introduction to data
ingestion with Singer
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Singer’s core concepts
Aim: “The open-source standard for writing scripts
that move data”

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets


=> language independent

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Singer’s core concepts
Aim: “The open-source standard for writing scripts
that move data”

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets


=> language independent

communicate over streams:


schema (metadata)

state (process metadata)

record (data)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Singer’s core concepts
Aim: “The open-source standard for writing scripts
that move data”

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets


=> language independent

communicate over streams:


schema (metadata)

state (process metadata)

record (data)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Describing the data through its schema
columns = ("id", "name", "age", "has_children")
users = {(1, "Adrian", 32, False),
(2, "Ruanne", 28, False),
(3, "Hillary", 29, True)}

json_schema = {

"properties": {"age": {"maximum": 130,


"minimum": 1,
"type": "integer"},

"has_children": {"type": "boolean"},


"id": {"type": "integer"},
"name": {"type": "string"}},

"$id": "http://yourdomain.com/schemas/my_user_schema.json",
"$schema": "http://json-schema.org/draft-07/schema#"}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Describing the data through its schema
import singer

singer.write_schema(schema=json_schema,

stream_name='DC_employees',

key_properties=["id"])

{"type": "SCHEMA", "stream": "DC_employees", "schema": {"properties":


{"age": {"maximum": 130, "minimum": 1, "type": "integer"}, "has_children":
{"type": "boolean"}, "id": {"type": "integer"}, "name": {"type": "string"}},
"$id": "http://yourdomain.com/schemas/my_user_schema.json",
"$schema": "http://json-schema.org/draft-07/schema#"}, "key_properties": ["id"]}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Serializing JSON
import json

json.dumps(json_schema["properties"]["age"])

'{"maximum": 130, "minimum": 1, "type": "integer"}'

with open("foo.json", mode="w") as fh:


json.dump(obj=json_schema, fp=fh) # writes the json-serialized object
# to the open file handle

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Running an ingestion
pipeline with Singer
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Streaming record messages
columns = ("id", "name", "age", "has_children")
users = {(1, "Adrian", 32, False),
(2, "Ruanne", 28, False),
(3, "Hillary", 29, True)}

singer.write_record(stream_name="DC_employees",
record=dict(zip(columns, users.pop())))

{"type": "RECORD", "stream": "DC_employees", "record": {"id": 1, "name": "Adrian", "age": 32, "has_children": false}}

fixed_dict = {"type": "RECORD", "stream": "DC_employees"}


record_msg = {**fixed_dict, "record": dict(zip(columns, users.pop()))}
print(json.dumps(record_msg))

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Chaining taps and targets
# Module: my_tap.py
import singer

singer.write_schema(stream_name="foo", schema=…)

singer.write_records(stream_name="foo", records=…)

Ingestion pipeline: Pipe the tap’s output into a Singer target, using the | symbol (Linux & MacOS)

python my_tap.py | target-csv


python my_tap.py | target-csv --config userconfig.cfg
my-packaged-tap | target-csv --config userconfig.cfg

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Modular ingestion pipelines
my-packaged-tap | target-csv
my-packaged-tap | target-google-sheets
my-packaged-tap | target-postgresql --config conf.json

tap-custom-google-scraper | target-postgresql --config headlines.json

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Keeping track with state messages

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Keeping track with state messages
id name last_updated_on

1 Adrian 2019-06-14T14:00:04.000+02:00

2 Ruanne 2019-06-16T18:33:21.000+02:00

3 Hillary 2019-06-14T10:05:12.000+02:00

singer.write_state(value={"max-last-updated-on": some_variable})

Run this tap-mydelta on 2019-06-14 at 12:00:00.000+02:00 (2nd row wasn’t yet present then):

{"type": "STATE", "value": {"max-last-updated-on": "2019-06-14T10:05:12.000+02:00"}}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON


Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

You might also like