0% found this document useful (0 votes)

232 views25 pages

Building Data Pipelines - 1

This document discusses components of a data platform, including ingesting data using Singer, applying data cleaning operations, gaining insights with PySpark, testing and deploying pipelines. It covers Singer concepts like specifying schemas and streaming data as records and state messages. Examples demonstrate describing data schemas, serializing to JSON, running ingestion pipelines by chaining taps and targets, and using state messages to track progress.

Uploaded by

Gusti Adli Anshari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

232 views25 pages

Building Data Pipelines - 1

Uploaded by

Gusti Adli Anshari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Components of a

data platform
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Course contents
ingest data using Singer

apply common data cleaning operations

gain insights by combining data with PySpark

test your code automatically

deploy Spark transformation pipelines

=> intro to data engineering pipelines

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Data is valuable

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Democratizing data increases insights

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Genesis of the data

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Operational data is stored in the landing zone

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Cleaned data prevents rework

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

The business layer provides most insights

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Pipelines move data from one zone to another

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s reason!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Introduction to data
ingestion with Singer
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Singer’s core concepts
Aim: “The open-source standard for writing scripts
that move data”

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

=> language independent

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer’s core concepts
Aim: “The open-source standard for writing scripts
that move data”

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

=> language independent

communicate over streams:

schema (metadata)

state (process metadata)

record (data)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer’s core concepts
Aim: “The open-source standard for writing scripts
that move data”

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

=> language independent

communicate over streams:

schema (metadata)

state (process metadata)

record (data)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Describing the data through its schema
columns = ("id", "name", "age", "has_children")
users = {(1, "Adrian", 32, False),
(2, "Ruanne", 28, False),
(3, "Hillary", 29, True)}

json_schema = {

"properties": {"age": {"maximum": 130,

"minimum": 1,
"type": "integer"},

"has_children": {"type": "boolean"},

"id": {"type": "integer"},
"name": {"type": "string"}},

"$id": "http://yourdomain.com/schemas/my_user_schema.json",
"$schema": "http://json-schema.org/draft-07/schema#"}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Describing the data through its schema
import singer

singer.write_schema(schema=json_schema,

stream_name='DC_employees',

key_properties=["id"])

{"type": "SCHEMA", "stream": "DC_employees", "schema": {"properties":

{"age": {"maximum": 130, "minimum": 1, "type": "integer"}, "has_children":
{"type": "boolean"}, "id": {"type": "integer"}, "name": {"type": "string"}},
"$id": "http://yourdomain.com/schemas/my_user_schema.json",
"$schema": "http://json-schema.org/draft-07/schema#"}, "key_properties": ["id"]}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Serializing JSON
import json

json.dumps(json_schema["properties"]["age"])

'{"maximum": 130, "minimum": 1, "type": "integer"}'

with open("foo.json", mode="w") as fh:

json.dump(obj=json_schema, fp=fh) # writes the json-serialized object
# to the open file handle

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N
Running an ingestion
pipeline with Singer
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Oliver Willekens
Data Engineer at Data Minded
Streaming record messages
columns = ("id", "name", "age", "has_children")
users = {(1, "Adrian", 32, False),
(2, "Ruanne", 28, False),
(3, "Hillary", 29, True)}

singer.write_record(stream_name="DC_employees",
record=dict(zip(columns, users.pop())))

{"type": "RECORD", "stream": "DC_employees", "record": {"id": 1, "name": "Adrian", "age": 32, "has_children": false}}

fixed_dict = {"type": "RECORD", "stream": "DC_employees"}

record_msg = {**fixed_dict, "record": dict(zip(columns, users.pop()))}
print(json.dumps(record_msg))

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Chaining taps and targets
# Module: my_tap.py
import singer

singer.write_schema(stream_name="foo", schema=…)

singer.write_records(stream_name="foo", records=…)

Ingestion pipeline: Pipe the tap’s output into a Singer target, using the | symbol (Linux & MacOS)

python my_tap.py | target-csv

python my_tap.py | target-csv --config userconfig.cfg
my-packaged-tap | target-csv --config userconfig.cfg

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Modular ingestion pipelines
my-packaged-tap | target-csv
my-packaged-tap | target-google-sheets
my-packaged-tap | target-postgresql --config conf.json

tap-custom-google-scraper | target-postgresql --config headlines.json

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Keeping track with state messages

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Keeping track with state messages
id name last_updated_on

1 Adrian 2019-06-14T14:00:04.000+02:00

2 Ruanne 2019-06-16T18:33:21.000+02:00

3 Hillary 2019-06-14T10:05:12.000+02:00

singer.write_state(value={"max-last-updated-on": some_variable})

Run this tap-mydelta on 2019-06-14 at 12:00:00.000+02:00 (2nd row wasn’t yet present then):

{"type": "STATE", "value": {"max-last-updated-on": "2019-06-14T10:05:12.000+02:00"}}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Let’s practice!
B U I L D I N G D ATA E N G I N E E R I N G P I P E L I N E S I N P Y T H O N

Introduction To Power System Simulator For Engineering
100% (1)
Introduction To Power System Simulator For Engineering
1,048 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
Airflow - Notes
No ratings yet
Airflow - Notes
82 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Snowflake
No ratings yet
Snowflake
122 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
PySpark VS SQL Interview Questions
100% (1)
PySpark VS SQL Interview Questions
16 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
How To Work With Apache Airflow
No ratings yet
How To Work With Apache Airflow
111 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
KENPAVE Demonstration: 1. Software Download 2. Software Setup 3. Software Overview 4. Example
No ratings yet
KENPAVE Demonstration: 1. Software Download 2. Software Setup 3. Software Overview 4. Example
19 pages
Kafka Sparkstreaming
No ratings yet
Kafka Sparkstreaming
75 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
DBT Cloud Advanced Architecture Guide
0% (1)
DBT Cloud Advanced Architecture Guide
4 pages
Apache Spark Component Guide
No ratings yet
Apache Spark Component Guide
84 pages
HDFS Interview Questions
No ratings yet
HDFS Interview Questions
29 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
Software Engineering
No ratings yet
Software Engineering
18 pages
Interactive Visual Data Exploration With Spark in Databricks Cloud
No ratings yet
Interactive Visual Data Exploration With Spark in Databricks Cloud
26 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Python For Web
No ratings yet
Python For Web
17 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Data-Engineering Course Structure
No ratings yet
Data-Engineering Course Structure
9 pages
Mahesh - Big Data Engineer
No ratings yet
Mahesh - Big Data Engineer
5 pages
UsbFix Report
No ratings yet
UsbFix Report
752 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
Bhaskar ADE - Altimetrik
No ratings yet
Bhaskar ADE - Altimetrik
3 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
Data Engineering Quick Reference
No ratings yet
Data Engineering Quick Reference
9 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Apache Airflow
50% (2)
Apache Airflow
8 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
Senior Data Engineer Resume Example
No ratings yet
Senior Data Engineer Resume Example
1 page
MultiThreading Fundamentals
No ratings yet
MultiThreading Fundamentals
87 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
AWS Boto - 1
No ratings yet
AWS Boto - 1
55 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Gogh - Color Scheme
No ratings yet
Gogh - Color Scheme
170 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Snow SQL
No ratings yet
Snow SQL
3 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
CNS 320 Week7 Lecture
100% (1)
CNS 320 Week7 Lecture
62 pages
MBA in Python - 4
No ratings yet
MBA in Python - 4
41 pages
Entity Framework Object-Relational Mapping: Author: Nemanja Kojic, Mscee
No ratings yet
Entity Framework Object-Relational Mapping: Author: Nemanja Kojic, Mscee
47 pages
Building Data Pipelines - 4
No ratings yet
Building Data Pipelines - 4
38 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Class Vi Chapter 1 Hemsheela
No ratings yet
Class Vi Chapter 1 Hemsheela
14 pages
Net Admin Installation Guide Supplement EN
No ratings yet
Net Admin Installation Guide Supplement EN
12 pages
7338 3cd9 PDF
No ratings yet
7338 3cd9 PDF
3 pages
Introduction and Refresher: Alex Scriven
No ratings yet
Introduction and Refresher: Alex Scriven
26 pages
RADAR Documentation 0v6p9
No ratings yet
RADAR Documentation 0v6p9
90 pages
Prompt Wise Apps Script Codes
No ratings yet
Prompt Wise Apps Script Codes
32 pages
MBA in Python - 1
No ratings yet
MBA in Python - 1
32 pages
Basic Excel 101
No ratings yet
Basic Excel 101
49 pages
Use Approvals For Movement Requests
No ratings yet
Use Approvals For Movement Requests
15 pages
LVM Linux
No ratings yet
LVM Linux
3 pages
Chapter 2: Analytics On Spreadsheets
No ratings yet
Chapter 2: Analytics On Spreadsheets
5 pages
Building Data Pipelines - 3
No ratings yet
Building Data Pipelines - 3
29 pages
Ai Cheats
No ratings yet
Ai Cheats
7 pages
Pip, Crontab
No ratings yet
Pip, Crontab
39 pages
F
No ratings yet
F
24 pages
File Aimboot For Free Fire
No ratings yet
File Aimboot For Free Fire
1 page
Code Generation Issues: Compilation 2014
No ratings yet
Code Generation Issues: Compilation 2014
19 pages
FortiAuthenticator-6.4.0-SAML 2FA Interoperability Guide
No ratings yet
FortiAuthenticator-6.4.0-SAML 2FA Interoperability Guide
33 pages
Physical Work Environment - Color
No ratings yet
Physical Work Environment - Color
12 pages
Hands-On Lab - Create and Load Tables Using SQL Scripts
No ratings yet
Hands-On Lab - Create and Load Tables Using SQL Scripts
8 pages
The Internals of PostgreSQL - Chapter 10 Base Backup & Point-in-Time Recovery
No ratings yet
The Internals of PostgreSQL - Chapter 10 Base Backup & Point-in-Time Recovery
2 pages
Demitri Muna (Dated: 20 September 2016) : Preprint Typeset Using L Tex Style Aastex6 V. 1.0
No ratings yet
Demitri Muna (Dated: 20 September 2016) : Preprint Typeset Using L Tex Style Aastex6 V. 1.0
8 pages
How To Enable RPC Over HTTP Connectivity
No ratings yet
How To Enable RPC Over HTTP Connectivity
7 pages
Lakshman Resume 2
No ratings yet
Lakshman Resume 2
4 pages
Top 40 Python Interview Questions & Answers: 1) What Is Python? What Are The Benefits of Using Python?
No ratings yet
Top 40 Python Interview Questions & Answers: 1) What Is Python? What Are The Benefits of Using Python?
7 pages
TPC Gusti-12522198
No ratings yet
TPC Gusti-12522198
1 page
ICAD 9.4 Product Sheet
No ratings yet
ICAD 9.4 Product Sheet
3 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Building Data Pipelines - 1

Uploaded by

Building Data Pipelines - 1

Uploaded by

Components of a

apply common data cleaning operations

gain insights by combining data with PySpark

test your code automatically

deploy Spark transformation pipelines

=> intro to data engineering pipelines

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

communicate over streams:

state (process metadata)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

Singer is a speci cation

data exchange format: JSON

extract and load with taps and targets

communicate over streams:

state (process metadata)

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

"properties": {"age": {"maximum": 130,

"has_children": {"type": "boolean"},

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

{"type": "SCHEMA", "stream": "DC_employees", "schema": {"properties":

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

'{"maximum": 130, "minimum": 1, "type": "integer"}'

with open("foo.json", mode="w") as fh:

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

fixed_dict = {"type": "RECORD", "stream": "DC_employees"}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

python my_tap.py | target-csv

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

tap-custom-google-scraper | target-postgresql --config headlines.json

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

{"type": "STATE", "value": {"max-last-updated-on": "2019-06-14T10:05:12.000+02:00"}}

BUILDING DATA ENGINEERING PIPELINES IN PYTHON

You might also like