0% found this document useful (0 votes)
25 views

Data Engineering Toolbox

det

Uploaded by

xavier mokhtar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Data Engineering Toolbox

det

Uploaded by

xavier mokhtar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Databases

I N T R O D U C T I O N T O D ATA E N G I N E E R I N G

Vincent Vankrunkelsven
Data Engineer @ DataCamp
What are databases?

Holds data

Organizes data

Retrieve/Search data through DBMS

A usually large collection of data organized


especially for rapid search and retrieval.

INTRODUCTION TO DATA ENGINEERING


Databases and file storage
Databases File systems

Very organized Less organized

Functionality like search, replication, ... Simple, less added functionality

INTRODUCTION TO DATA ENGINEERING


Structured and unstructured data
Structured: database schema

Relational database

Semi-structured { "key": "value"}

JSON

Unstructured: schemaless, more like les

Videos, photos

INTRODUCTION TO DATA ENGINEERING


SQL and NoSQL
SQL NoSQL
Tables Non-relational databases

Database schema Structured or unstructured

Relational databases Key-value stores (e.g. caching)

Document DB (e.g. JSON objects)

INTRODUCTION TO DATA ENGINEERING


SQL: The database schema
-- Create Customer Table
CREATE TABLE "Customer" (
"id" SERIAL NOT NULL,
"first_name" varchar,
"last_name" varchar,
PRIMARY KEY ("id")
);

-- Create Order Table


CREATE TABLE "Order" (
"id" SERIAL NOT NULL,
-- Join both tables on foreign key
"customer_id" integer REFERENCES "Customer",
SELECT * FROM "Customer"
"product_name" varchar,
INNER JOIN "Order"
"product_price" integer,
ON "customer_id" = "Customer"."id";
PRIMARY KEY ("id")
);

id | first_name | ... | product_price


1 | Vincent | ... | 10

INTRODUCTION TO DATA ENGINEERING


SQL: Star schema
The star schema consists of one or more fact tables referencing any number of dimension
tables.

Facts: things that happened (eg. Product Orders)

Dimensions: information on the world (eg. Customer Information)


1 Wikipedia: h ps://en.wikipedia.org/wiki/Star_schema

INTRODUCTION TO DATA ENGINEERING


Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
What is parallel
computing
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G

Vincent Vankrunkelsven
Data Engineer @ DataCamp
Idea behind parallel computing
Basis of modern data processing tools

Memory

Processing power

Idea

Split task into subtasks

Distribute subtasks over several computers

Work together to nish task

INTRODUCTION TO DATA ENGINEERING


The tailor shop
Running a tailor shop

Goal: 100 shirts

Best tailor nishes shirt / 20 minutes

Other tailors do shirt / 1 hour

Multiple tailors working together > best tailor

INTRODUCTION TO DATA ENGINEERING


Benefits of parallel computing
Processing power

Memory: partition the dataset

RAM memory chip:

INTRODUCTION TO DATA ENGINEERING


Risks of parallel computing
Overhead due to communication

Parallel slowdown:

Task needs to be large

Need several processing units

INTRODUCTION TO DATA ENGINEERING


An example

INTRODUCTION TO DATA ENGINEERING


multiprocessing.Pool

from multiprocessing import Pool

def take_mean_age(year_and_group):
year, group = year_and_group
return pd.DataFrame({"Age": group["Age"].mean()}, index=[year])

with Pool(4) as p:
results = p.map(take_mean_age, athlete_events.groupby("Year"))

result_df = pd.concat(results)

INTRODUCTION TO DATA ENGINEERING


dask

import dask.dataframe as dd

# Partition dataframe into 4


athlete_events_dask = dd.from_pandas(athlete_events, npartitions = 4)

# Run parallel computations on each partition


result_df = athlete_events_dask.groupby('Year').Age.mean().compute()

INTRODUCTION TO DATA ENGINEERING


Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Parallel computation
frameworks
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G

Vincent Vankrunkelsven
Data Engineer @ DataCamp
INTRODUCTION TO DATA ENGINEERING
HDFS

INTRODUCTION TO DATA ENGINEERING


MapReduce

INTRODUCTION TO DATA ENGINEERING


Hive

Runs on Hadoop

Structured Query Language: Hive SQL

Initially MapReduce, now other tools

INTRODUCTION TO DATA ENGINEERING


Hive: an example

SELECT year, AVG(age)


FROM views.athlete_events
GROUP BY year

INTRODUCTION TO DATA ENGINEERING


Avoid disk writes

Maintained by Apache So ware Foundation

INTRODUCTION TO DATA ENGINEERING


Resilient distributed datasets (RDD)

Spark relies on them

Similar to list of tuples

Transformations: .map() or .filter()

Actions: .count() or .first()

INTRODUCTION TO DATA ENGINEERING


PySpark

Python interface to Spark

DataFrame abstraction

Looks similar to Pandas

INTRODUCTION TO DATA ENGINEERING


PySpark: an example
# Load the dataset into athlete_events_spark first SELECT year, AVG(age)
FROM views.athlete_events
(athlete_events_spark GROUP BY year
.groupBy('Year')
.mean('Age')
.show())

INTRODUCTION TO DATA ENGINEERING


Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Workflow scheduling
frameworks
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G

Vincent Vankrunkelsven
Data Engineer @ DataCamp
An example pipeline

How to schedule?

Manually

cron scheduling tool

What about dependencies?

INTRODUCTION TO DATA ENGINEERING


DAGs
Directed Acyclic Graph

Set of nodes

Directed edges

No cycles

INTRODUCTION TO DATA ENGINEERING


The tools for the job

Linux's cron

Spotify's Luigi

Apache Air ow

INTRODUCTION TO DATA ENGINEERING


Created at Airbnb

DAGs

Python

INTRODUCTION TO DATA ENGINEERING


Airflow: an example DAG

INTRODUCTION TO DATA ENGINEERING


Airflow: an example in code
# Create the DAG object
dag = DAG(dag_id="example_dag", ..., schedule_interval="0 * * * *")

# Define operations
start_cluster = StartClusterOperator(task_id="start_cluster", dag=dag)
ingest_customer_data = SparkJobOperator(task_id="ingest_customer_data", dag=dag)
ingest_product_data = SparkJobOperator(task_id="ingest_product_data", dag=dag)
enrich_customer_data = PythonOperator(task_id="enrich_customer_data", ..., dag = dag)

# Set up dependency flow


start_cluster.set_downstream(ingest_customer_data)
ingest_customer_data.set_downstream(enrich_customer_data)
ingest_product_data.set_downstream(enrich_customer_data)

INTRODUCTION TO DATA ENGINEERING


Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G

You might also like