0% found this document useful (0 votes)

27 views36 pages

Data Engineering Toolbox

det

Uploaded by

xavier mokhtar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views36 pages

Data Engineering Toolbox

det

Uploaded by

xavier mokhtar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Databases

I N T R O D U C T I O N T O D ATA E N G I N E E R I N G

Vincent Vankrunkelsven
Data Engineer @ DataCamp
What are databases?

Holds data

Organizes data

Retrieve/Search data through DBMS

A usually large collection of data organized

especially for rapid search and retrieval.

INTRODUCTION TO DATA ENGINEERING

Databases and file storage
Databases File systems

Very organized Less organized

Functionality like search, replication, ... Simple, less added functionality

INTRODUCTION TO DATA ENGINEERING

Structured and unstructured data
Structured: database schema

Relational database

Semi-structured { "key": "value"}

JSON

Unstructured: schemaless, more like les

Videos, photos

INTRODUCTION TO DATA ENGINEERING

SQL and NoSQL
SQL NoSQL
Tables Non-relational databases

Database schema Structured or unstructured

Relational databases Key-value stores (e.g. caching)

Document DB (e.g. JSON objects)

INTRODUCTION TO DATA ENGINEERING

SQL: The database schema
-- Create Customer Table
CREATE TABLE "Customer" (
"id" SERIAL NOT NULL,
"first_name" varchar,
"last_name" varchar,
PRIMARY KEY ("id")
);

-- Create Order Table

CREATE TABLE "Order" (
"id" SERIAL NOT NULL,
-- Join both tables on foreign key
"customer_id" integer REFERENCES "Customer",
SELECT * FROM "Customer"
"product_name" varchar,
INNER JOIN "Order"
"product_price" integer,
ON "customer_id" = "Customer"."id";
PRIMARY KEY ("id")
);

id | first_name | ... | product_price

1 | Vincent | ... | 10

INTRODUCTION TO DATA ENGINEERING

SQL: Star schema
The star schema consists of one or more fact tables referencing any number of dimension
tables.

Facts: things that happened (eg. Product Orders)

Dimensions: information on the world (eg. Customer Information)

1 Wikipedia: h ps://en.wikipedia.org/wiki/Star_schema

INTRODUCTION TO DATA ENGINEERING

Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
What is parallel
computing
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G

Vincent Vankrunkelsven
Data Engineer @ DataCamp
Idea behind parallel computing
Basis of modern data processing tools

Memory

Processing power

Idea

Split task into subtasks

Distribute subtasks over several computers

Work together to nish task

INTRODUCTION TO DATA ENGINEERING

The tailor shop
Running a tailor shop

Goal: 100 shirts

Best tailor nishes shirt / 20 minutes

Other tailors do shirt / 1 hour

Multiple tailors working together > best tailor

INTRODUCTION TO DATA ENGINEERING

Benefits of parallel computing
Processing power

Memory: partition the dataset

RAM memory chip:

INTRODUCTION TO DATA ENGINEERING

Risks of parallel computing
Overhead due to communication

Parallel slowdown:

Task needs to be large

Need several processing units

INTRODUCTION TO DATA ENGINEERING

An example

INTRODUCTION TO DATA ENGINEERING

multiprocessing.Pool

from multiprocessing import Pool

def take_mean_age(year_and_group):
year, group = year_and_group
return pd.DataFrame({"Age": group["Age"].mean()}, index=[year])

with Pool(4) as p:
results = p.map(take_mean_age, athlete_events.groupby("Year"))

result_df = pd.concat(results)

INTRODUCTION TO DATA ENGINEERING

dask

import dask.dataframe as dd

# Partition dataframe into 4

athlete_events_dask = dd.from_pandas(athlete_events, npartitions = 4)

# Run parallel computations on each partition

result_df = athlete_events_dask.groupby('Year').Age.mean().compute()

INTRODUCTION TO DATA ENGINEERING

Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Parallel computation
frameworks
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G

Vincent Vankrunkelsven
Data Engineer @ DataCamp
INTRODUCTION TO DATA ENGINEERING
HDFS

INTRODUCTION TO DATA ENGINEERING

MapReduce

INTRODUCTION TO DATA ENGINEERING

Hive

Runs on Hadoop

Structured Query Language: Hive SQL

Initially MapReduce, now other tools

INTRODUCTION TO DATA ENGINEERING

Hive: an example

SELECT year, AVG(age)

FROM views.athlete_events
GROUP BY year

INTRODUCTION TO DATA ENGINEERING

Avoid disk writes

Maintained by Apache So ware Foundation

INTRODUCTION TO DATA ENGINEERING

Resilient distributed datasets (RDD)

Spark relies on them

Similar to list of tuples

Transformations: .map() or .filter()

Actions: .count() or .first()

INTRODUCTION TO DATA ENGINEERING

PySpark

Python interface to Spark

DataFrame abstraction

Looks similar to Pandas

INTRODUCTION TO DATA ENGINEERING

PySpark: an example
# Load the dataset into athlete_events_spark first SELECT year, AVG(age)
FROM views.athlete_events
(athlete_events_spark GROUP BY year
.groupBy('Year')
.mean('Age')
.show())

INTRODUCTION TO DATA ENGINEERING

Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Workflow scheduling
frameworks
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G

Vincent Vankrunkelsven
Data Engineer @ DataCamp
An example pipeline

How to schedule?

Manually

cron scheduling tool

What about dependencies?

INTRODUCTION TO DATA ENGINEERING

DAGs
Directed Acyclic Graph

Set of nodes

Directed edges

No cycles

INTRODUCTION TO DATA ENGINEERING

The tools for the job

Linux's cron

Spotify's Luigi

Apache Air ow

INTRODUCTION TO DATA ENGINEERING

Created at Airbnb

DAGs

Python

INTRODUCTION TO DATA ENGINEERING

Airflow: an example DAG

INTRODUCTION TO DATA ENGINEERING

Airflow: an example in code
# Create the DAG object
dag = DAG(dag_id="example_dag", ..., schedule_interval="0 * * * *")

# Define operations
start_cluster = StartClusterOperator(task_id="start_cluster", dag=dag)
ingest_customer_data = SparkJobOperator(task_id="ingest_customer_data", dag=dag)
ingest_product_data = SparkJobOperator(task_id="ingest_product_data", dag=dag)
enrich_customer_data = PythonOperator(task_id="enrich_customer_data", ..., dag = dag)

# Set up dependency flow

start_cluster.set_downstream(ingest_customer_data)
ingest_customer_data.set_downstream(enrich_customer_data)
ingest_product_data.set_downstream(enrich_customer_data)

INTRODUCTION TO DATA ENGINEERING

Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G

Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
100% (2)
Data Engineering For Machine Learning Pipelines From Python Libraries To ML P
582 pages
Introduction To Data Engineering
100% (1)
Introduction To Data Engineering
23 pages
CH1 - Introduction To Data Engineering
No ratings yet
CH1 - Introduction To Data Engineering
36 pages
Become A Data Engineer
100% (2)
Become A Data Engineer
14 pages
CM MD70 Bank Interface
100% (1)
CM MD70 Bank Interface
24 pages
Storing Data in Data Engineering
No ratings yet
Storing Data in Data Engineering
39 pages
Data Structures: Hadrien Lacroix
No ratings yet
Data Structures: Hadrien Lacroix
39 pages
Data Engineering For Everyone 2
No ratings yet
Data Engineering For Everyone 2
39 pages
L1. Introduction To de (Ncu) PDF
No ratings yet
L1. Introduction To de (Ncu) PDF
30 pages
Python For Data Science 2025 Slides
No ratings yet
Python For Data Science 2025 Slides
364 pages
Extract, Transform and Load (ETL)
No ratings yet
Extract, Transform and Load (ETL)
31 pages
Lecture Notes Ch1
No ratings yet
Lecture Notes Ch1
24 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
1 Intro
No ratings yet
1 Intro
33 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
DS231 Module 3 PDF
No ratings yet
DS231 Module 3 PDF
41 pages
Data Engineering New
No ratings yet
Data Engineering New
3 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
e17fba4f-3376-462f-beef-e2fc3fcd25cf
No ratings yet
e17fba4f-3376-462f-beef-e2fc3fcd25cf
2 pages
DE Week-1, Lecture
No ratings yet
DE Week-1, Lecture
3 pages
DA Full
No ratings yet
DA Full
738 pages
Big Data
No ratings yet
Big Data
51 pages
DS231 Week 3
No ratings yet
DS231 Week 3
41 pages
Data Engineering and Big Data: Hadrien Lacroix
No ratings yet
Data Engineering and Big Data: Hadrien Lacroix
79 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
WK 3
No ratings yet
WK 3
29 pages
Lecture 1.1 - Introduction To DE
No ratings yet
Lecture 1.1 - Introduction To DE
27 pages
Data-Engineering Compressed
No ratings yet
Data-Engineering Compressed
20 pages
Unit 1 - DA - Introduction To Data Science
No ratings yet
Unit 1 - DA - Introduction To Data Science
70 pages
Lec1 - Intro To Data Engg
No ratings yet
Lec1 - Intro To Data Engg
26 pages
Unit 2 - BD - Big Data Technology Foundations
No ratings yet
Unit 2 - BD - Big Data Technology Foundations
76 pages
Data Engineering - JVM Institute - Coding - Data Science
No ratings yet
Data Engineering - JVM Institute - Coding - Data Science
14 pages
Big Data Chapter-I - New
No ratings yet
Big Data Chapter-I - New
49 pages
Fundamentals of Data Engineering Concepts
No ratings yet
Fundamentals of Data Engineering Concepts
219 pages
Data Engineering For Everyone 3
No ratings yet
Data Engineering For Everyone 3
81 pages
Big Data
No ratings yet
Big Data
4 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
13 pages
Data Engineering Questions Answers 1679109980
No ratings yet
Data Engineering Questions Answers 1679109980
26 pages
Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!
No ratings yet
Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!
31 pages
Database Applications Cy S 125242: DR - Layla Abdour
No ratings yet
Database Applications Cy S 125242: DR - Layla Abdour
32 pages
DataCamp - Data Engineer
No ratings yet
DataCamp - Data Engineer
2 pages
4 Data Engineering
No ratings yet
4 Data Engineering
34 pages
Case Study DataCamp
No ratings yet
Case Study DataCamp
30 pages
Slide 1
No ratings yet
Slide 1
45 pages
L1 - Introduction and Data EcoSystem
No ratings yet
L1 - Introduction and Data EcoSystem
42 pages
MIT1 204S10 Lec01
No ratings yet
MIT1 204S10 Lec01
12 pages
Module 1-BDA
No ratings yet
Module 1-BDA
82 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
Data Engineering Unit-1
No ratings yet
Data Engineering Unit-1
16 pages
CS 441 Handouts
No ratings yet
CS 441 Handouts
300 pages
De Unit-V
No ratings yet
De Unit-V
46 pages
Lec1 Special
No ratings yet
Lec1 Special
21 pages
Complete Step-By-Step Roadmap To Learn Data Engineering in 2025
No ratings yet
Complete Step-By-Step Roadmap To Learn Data Engineering in 2025
13 pages
Chapter 1 What Is Data Engineering PDF
No ratings yet
Chapter 1 What Is Data Engineering PDF
79 pages
Unit 1 Introduction to Data Engineering
No ratings yet
Unit 1 Introduction to Data Engineering
32 pages
Introduction To Dbms
No ratings yet
Introduction To Dbms
37 pages
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
From Everand
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
Anand Vemula
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
C# 2010 Coding Briefs Data Access
From Everand
C# 2010 Coding Briefs Data Access
Kevin Hough
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Header
No ratings yet
Header
1 page
AD 57 Annex C1 Equipment Checklist For Practical Assessment of ETO
No ratings yet
AD 57 Annex C1 Equipment Checklist For Practical Assessment of ETO
6 pages
Insulation Resistance, Refrigeration, Mainswitchboard, CKT Breaker, Relay
No ratings yet
Insulation Resistance, Refrigeration, Mainswitchboard, CKT Breaker, Relay
24 pages
ETO Principal Interview
No ratings yet
ETO Principal Interview
27 pages
1) - What Is PLC?: Software
No ratings yet
1) - What Is PLC?: Software
11 pages
Environment Monitoring System For DC (Checklist) : Security Access Door
No ratings yet
Environment Monitoring System For DC (Checklist) : Security Access Door
2 pages
Python For Reverse Engineering and Malware Analysis
100% (1)
Python For Reverse Engineering and Malware Analysis
3 pages
Mandate Form LA-1186359
No ratings yet
Mandate Form LA-1186359
1 page
IDV-01-Course Overview
No ratings yet
IDV-01-Course Overview
100 pages
CS504 Software Engineering-1 2011 Mid Term Mcqs Solved With References by Moaaz
100% (1)
CS504 Software Engineering-1 2011 Mid Term Mcqs Solved With References by Moaaz
16 pages
MP CH 6
No ratings yet
MP CH 6
4 pages
Azure Virtual Summit Overall Agenda
No ratings yet
Azure Virtual Summit Overall Agenda
3 pages
Abdelhamid Djebbar Resume
No ratings yet
Abdelhamid Djebbar Resume
2 pages
MFD 3 00 340 ECDIS Utilites
No ratings yet
MFD 3 00 340 ECDIS Utilites
98 pages
Protastructure Design Guide Revit Integration
No ratings yet
Protastructure Design Guide Revit Integration
21 pages
Programming in Python
No ratings yet
Programming in Python
186 pages
Introduction To Computers and Windows Theory
No ratings yet
Introduction To Computers and Windows Theory
2 pages
Smart Trolley (Updated)
No ratings yet
Smart Trolley (Updated)
28 pages
User Manual Final
No ratings yet
User Manual Final
6 pages
Feasability Study: A Preliminary Stage For Developping Innovation Projects
No ratings yet
Feasability Study: A Preliminary Stage For Developping Innovation Projects
34 pages
CCNA 4 Final Exam Answers 2012
No ratings yet
CCNA 4 Final Exam Answers 2012
22 pages
Tivoli - Revised Script
No ratings yet
Tivoli - Revised Script
3 pages
Alo Agri Deck SM
No ratings yet
Alo Agri Deck SM
10 pages
BSNL
No ratings yet
BSNL
12 pages
CS 1996
No ratings yet
CS 1996
19 pages
Msbte Winter 2020 TH Exam
No ratings yet
Msbte Winter 2020 TH Exam
3 pages
Getting Started: Brief Step-By-Step Guide: (PDF File)
No ratings yet
Getting Started: Brief Step-By-Step Guide: (PDF File)
5 pages
Kinvey How To Make An App Mobile Html5
100% (1)
Kinvey How To Make An App Mobile Html5
36 pages
Rit 39
No ratings yet
Rit 39
19 pages
GEK 97310 GEK 97310 Addendum I Addendum I
No ratings yet
GEK 97310 GEK 97310 Addendum I Addendum I
32 pages
Module Details SARTHI - MKCL CSMS - DEEP
No ratings yet
Module Details SARTHI - MKCL CSMS - DEEP
5 pages
Testability Measures SCOAP Bushnell
No ratings yet
Testability Measures SCOAP Bushnell
33 pages
New Rich Text Document
No ratings yet
New Rich Text Document
12 pages
Pandas Cheet Sheet
No ratings yet
Pandas Cheet Sheet
1 page

Data Engineering Toolbox

Uploaded by

Data Engineering Toolbox

Uploaded by

Databases

Retrieve/Search data through DBMS

A usually large collection of data organized

INTRODUCTION TO DATA ENGINEERING

Very organized Less organized

Functionality like search, replication, ... Simple, less added functionality

INTRODUCTION TO DATA ENGINEERING

Semi-structured { "key": "value"}

Unstructured: schemaless, more like les

INTRODUCTION TO DATA ENGINEERING

Database schema Structured or unstructured

Relational databases Key-value stores (e.g. caching)

Document DB (e.g. JSON objects)

INTRODUCTION TO DATA ENGINEERING

-- Create Order Table

id | first_name | ... | product_price

INTRODUCTION TO DATA ENGINEERING

Facts: things that happened (eg. Product Orders)

Dimensions: information on the world (eg. Customer Information)

INTRODUCTION TO DATA ENGINEERING

Split task into subtasks

Distribute subtasks over several computers

Work together to nish task

INTRODUCTION TO DATA ENGINEERING

Goal: 100 shirts

Best tailor nishes shirt / 20 minutes

Other tailors do shirt / 1 hour

Multiple tailors working together > best tailor

INTRODUCTION TO DATA ENGINEERING

Memory: partition the dataset

RAM memory chip:

INTRODUCTION TO DATA ENGINEERING

Task needs to be large

Need several processing units

INTRODUCTION TO DATA ENGINEERING

INTRODUCTION TO DATA ENGINEERING

from multiprocessing import Pool

INTRODUCTION TO DATA ENGINEERING

# Partition dataframe into 4

# Run parallel computations on each partition

INTRODUCTION TO DATA ENGINEERING

INTRODUCTION TO DATA ENGINEERING

INTRODUCTION TO DATA ENGINEERING

Structured Query Language: Hive SQL

Initially MapReduce, now other tools

INTRODUCTION TO DATA ENGINEERING

SELECT year, AVG(age)

INTRODUCTION TO DATA ENGINEERING

Maintained by Apache So ware Foundation

INTRODUCTION TO DATA ENGINEERING

Spark relies on them

Similar to list of tuples

Transformations: .map() or .filter()

Actions: .count() or .first()

INTRODUCTION TO DATA ENGINEERING

Python interface to Spark

Looks similar to Pandas

INTRODUCTION TO DATA ENGINEERING

INTRODUCTION TO DATA ENGINEERING

cron scheduling tool

What about dependencies?

INTRODUCTION TO DATA ENGINEERING

INTRODUCTION TO DATA ENGINEERING

INTRODUCTION TO DATA ENGINEERING

INTRODUCTION TO DATA ENGINEERING

INTRODUCTION TO DATA ENGINEERING

# Set up dependency flow

INTRODUCTION TO DATA ENGINEERING

You might also like