0% found this document useful (0 votes)
3 views

Lab_01 - Data Engineering Practice

The document outlines a three-part process for data ingestion, processing, and orchestration using real-world datasets, specifically the New York Taxi Trips Data. It includes tasks for downloading the dataset, loading it into a database, transforming the data with Pandas and SQL, and automating the process with Apache Airflow. Additional resources for datasets and tutorials are provided, along with suggestions for further exploration in cloud deployment and real-time data ingestion.

Uploaded by

Tuấn Đỗ Anh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lab_01 - Data Engineering Practice

The document outlines a three-part process for data ingestion, processing, and orchestration using real-world datasets, specifically the New York Taxi Trips Data. It includes tasks for downloading the dataset, loading it into a database, transforming the data with Pandas and SQL, and automating the process with Apache Airflow. Additional resources for datasets and tutorials are provided, along with suggestions for further exploration in cloud deployment and real-time data ingestion.

Uploaded by

Tuấn Đỗ Anh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1: Data Ingestion & Storage

Task 1: Download a Real-world Dataset

Dataset: New York Taxi Trips Data


Download: NYC Taxi Data (Parquet format)
Alternative: Kaggle Datasets (Download CSV datasets)

Task 2: Load Data into a Local Database

• Install and Use PostgreSQL (or SQLite) as a database.

• Write a Python script to load data into the database.

Resources:

• PostgreSQL Installation Guide

• Pandas to PostgreSQL (Tutorial)

• SQLite Quickstart

Practice Steps:
Install PostgreSQL or SQLite.
Use Pandas to read the dataset.
Write a Python script to insert data into the database.
2: Data Processing & Transformation
Task 3: Transform Data Using Pandas & SQL

• Filter out invalid data (e.g., negative trip distances).

• Convert datetime columns into proper formats.

• Aggregate data (e.g., average fare per trip).

Resources:

• SQL Basics (W3Schools)

• Pandas Data Transformations

Practice Steps:
Write SQL queries to clean the data.
Perform aggregations using Pandas.
3: Data Orchestration with Apache Airflow
Task 4: Automate Data Processing with Airflow

• Install Apache Airflow (pip install apache-airflow).

• Create an Airflow DAG (Directed Acyclic Graph) to automate:

• Ingesting data from the dataset.

• Transforming data using SQL.

• Storing results in a database.

Resources:

• Airflow Quickstart Guide

• Airflow DAGs Tutorial

Practice Steps:
Install Airflow and configure it.
Write a DAG to automate data ingestion & transformation.
Schedule the DAG to run every fixed interval e.g.: 5 minute or every hour:
Additional Resources for Downloading Notebooks &
Datasets
Open Datasets

1. Kaggle – https://www.kaggle.com/datasets

2. Google Dataset Search – https://datasetsearch.research.google.com/

3. AWS Open Data – https://registry.opendata.aws/

4. NYC Taxi Data – https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Jupyter Notebooks & Tutorials

1. DataTalksClub Data Engineering


Zoomcamp – https://github.com/DataTalksClub/data-engineering-zoomcamp

2. Data Engineering Notebooks


(GitHub) – https://github.com/awesomedata/awesome-public-datasets

3. Pandas & SQL Practice Notebooks – https://github.com/jakevdp/Pandas-Tutorial

4. Apache Airflow
Examples – https://github.com/apache/airflow/tree/main/airflow/example_dags

What You Will Have Built in 3 Labs Above:

Ingested a real dataset into a database (PostgreSQL).


Transformed & cleaned data using Pandas & SQL.
Automated data processing with Apache Airflow.
Created a reproducible data pipeline for ML.

📌 What's Next?
If you have more time, try these:
Deploy your pipeline on the cloud (AWS/GCP/Azure).
Use Kafka for real-time data ingestion.
Implement a Feature Store with Feast.

You might also like