Databases
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Vincent Vankrunkelsven
Data Engineer @ DataCamp
What are databases?
Holds data
Organizes data
Retrieve/Search data through DBMS
A usually large collection of data organized
especially for rapid search and retrieval.
INTRODUCTION TO DATA ENGINEERING
Databases and file storage
Databases File systems
Very organized Less organized
Functionality like search, replication, ... Simple, less added functionality
INTRODUCTION TO DATA ENGINEERING
Structured and unstructured data
Structured: database schema
Relational database
Semi-structured { "key": "value"}
JSON
Unstructured: schemaless, more like les
Videos, photos
INTRODUCTION TO DATA ENGINEERING
SQL and NoSQL
SQL NoSQL
Tables Non-relational databases
Database schema Structured or unstructured
Relational databases Key-value stores (e.g. caching)
Document DB (e.g. JSON objects)
INTRODUCTION TO DATA ENGINEERING
SQL: The database schema
-- Create Customer Table
CREATE TABLE "Customer" (
"id" SERIAL NOT NULL,
"first_name" varchar,
"last_name" varchar,
PRIMARY KEY ("id")
);
-- Create Order Table
CREATE TABLE "Order" (
"id" SERIAL NOT NULL,
-- Join both tables on foreign key
"customer_id" integer REFERENCES "Customer",
SELECT * FROM "Customer"
"product_name" varchar,
INNER JOIN "Order"
"product_price" integer,
ON "customer_id" = "Customer"."id";
PRIMARY KEY ("id")
);
id | first_name | ... | product_price
1 | Vincent | ... | 10
INTRODUCTION TO DATA ENGINEERING
SQL: Star schema
The star schema consists of one or more fact tables referencing any number of dimension
tables.
Facts: things that happened (eg. Product Orders)
Dimensions: information on the world (eg. Customer Information)
1 Wikipedia: h ps://en.wikipedia.org/wiki/Star_schema
INTRODUCTION TO DATA ENGINEERING
Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
What is parallel
computing
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Vincent Vankrunkelsven
Data Engineer @ DataCamp
Idea behind parallel computing
Basis of modern data processing tools
Memory
Processing power
Idea
Split task into subtasks
Distribute subtasks over several computers
Work together to nish task
INTRODUCTION TO DATA ENGINEERING
The tailor shop
Running a tailor shop
Goal: 100 shirts
Best tailor nishes shirt / 20 minutes
Other tailors do shirt / 1 hour
Multiple tailors working together > best tailor
INTRODUCTION TO DATA ENGINEERING
Benefits of parallel computing
Processing power
Memory: partition the dataset
RAM memory chip:
INTRODUCTION TO DATA ENGINEERING
Risks of parallel computing
Overhead due to communication
Parallel slowdown:
Task needs to be large
Need several processing units
INTRODUCTION TO DATA ENGINEERING
An example
INTRODUCTION TO DATA ENGINEERING
multiprocessing.Pool
from multiprocessing import Pool
def take_mean_age(year_and_group):
year, group = year_and_group
return pd.DataFrame({"Age": group["Age"].mean()}, index=[year])
with Pool(4) as p:
results = p.map(take_mean_age, athlete_events.groupby("Year"))
result_df = pd.concat(results)
INTRODUCTION TO DATA ENGINEERING
dask
import dask.dataframe as dd
# Partition dataframe into 4
athlete_events_dask = dd.from_pandas(athlete_events, npartitions = 4)
# Run parallel computations on each partition
result_df = athlete_events_dask.groupby('Year').Age.mean().compute()
INTRODUCTION TO DATA ENGINEERING
Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Parallel computation
frameworks
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Vincent Vankrunkelsven
Data Engineer @ DataCamp
INTRODUCTION TO DATA ENGINEERING
HDFS
INTRODUCTION TO DATA ENGINEERING
MapReduce
INTRODUCTION TO DATA ENGINEERING
Hive
Runs on Hadoop
Structured Query Language: Hive SQL
Initially MapReduce, now other tools
INTRODUCTION TO DATA ENGINEERING
Hive: an example
SELECT year, AVG(age)
FROM views.athlete_events
GROUP BY year
INTRODUCTION TO DATA ENGINEERING
Avoid disk writes
Maintained by Apache So ware Foundation
INTRODUCTION TO DATA ENGINEERING
Resilient distributed datasets (RDD)
Spark relies on them
Similar to list of tuples
Transformations: .map() or .filter()
Actions: .count() or .first()
INTRODUCTION TO DATA ENGINEERING
PySpark
Python interface to Spark
DataFrame abstraction
Looks similar to Pandas
INTRODUCTION TO DATA ENGINEERING
PySpark: an example
# Load the dataset into athlete_events_spark first SELECT year, AVG(age)
FROM views.athlete_events
(athlete_events_spark GROUP BY year
.groupBy('Year')
.mean('Age')
.show())
INTRODUCTION TO DATA ENGINEERING
Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Workflow scheduling
frameworks
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Vincent Vankrunkelsven
Data Engineer @ DataCamp
An example pipeline
How to schedule?
Manually
cron scheduling tool
What about dependencies?
INTRODUCTION TO DATA ENGINEERING
DAGs
Directed Acyclic Graph
Set of nodes
Directed edges
No cycles
INTRODUCTION TO DATA ENGINEERING
The tools for the job
Linux's cron
Spotify's Luigi
Apache Air ow
INTRODUCTION TO DATA ENGINEERING
Created at Airbnb
DAGs
Python
INTRODUCTION TO DATA ENGINEERING
Airflow: an example DAG
INTRODUCTION TO DATA ENGINEERING
Airflow: an example in code
# Create the DAG object
dag = DAG(dag_id="example_dag", ..., schedule_interval="0 * * * *")
# Define operations
start_cluster = StartClusterOperator(task_id="start_cluster", dag=dag)
ingest_customer_data = SparkJobOperator(task_id="ingest_customer_data", dag=dag)
ingest_product_data = SparkJobOperator(task_id="ingest_product_data", dag=dag)
enrich_customer_data = PythonOperator(task_id="enrich_customer_data", ..., dag = dag)
# Set up dependency flow
start_cluster.set_downstream(ingest_customer_data)
ingest_customer_data.set_downstream(enrich_customer_data)
ingest_product_data.set_downstream(enrich_customer_data)
INTRODUCTION TO DATA ENGINEERING
Let's practice!
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G