Data Engineering Toolbox
Data Engineering Toolbox
I N T R O D U C T I O N T O D ATA E N G I N E E R I N G
Vincent Vankrunkelsven
Data Engineer @ DataCamp
What are databases?
Holds data
Organizes data
Relational database
JSON
Videos, photos
Vincent Vankrunkelsven
Data Engineer @ DataCamp
Idea behind parallel computing
Basis of modern data processing tools
Memory
Processing power
Idea
Parallel slowdown:
def take_mean_age(year_and_group):
year, group = year_and_group
return pd.DataFrame({"Age": group["Age"].mean()}, index=[year])
with Pool(4) as p:
results = p.map(take_mean_age, athlete_events.groupby("Year"))
result_df = pd.concat(results)
import dask.dataframe as dd
Vincent Vankrunkelsven
Data Engineer @ DataCamp
INTRODUCTION TO DATA ENGINEERING
HDFS
Runs on Hadoop
DataFrame abstraction
Vincent Vankrunkelsven
Data Engineer @ DataCamp
An example pipeline
How to schedule?
Manually
Set of nodes
Directed edges
No cycles
Linux's cron
Spotify's Luigi
Apache Air ow
DAGs
Python
# Define operations
start_cluster = StartClusterOperator(task_id="start_cluster", dag=dag)
ingest_customer_data = SparkJobOperator(task_id="ingest_customer_data", dag=dag)
ingest_product_data = SparkJobOperator(task_id="ingest_product_data", dag=dag)
enrich_customer_data = PythonOperator(task_id="enrich_customer_data", ..., dag = dag)