0% found this document useful (0 votes)
8 views42 pages

Document 4

The document outlines a series of data engineering experiments conducted by a B. Tech CS(AIML) student, detailing tasks such as data modeling, preprocessing, ETL pipeline implementation, and using Airflow for task scheduling. It includes specific examples of data cleaning techniques, API and database data extraction, and the creation of a Directed Acyclic Graph (DAG) for ETL processes. The document serves as a comprehensive guide for practical applications in data engineering.

Uploaded by

Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views42 pages

Document 4

The document outlines a series of data engineering experiments conducted by a B. Tech CS(AIML) student, detailing tasks such as data modeling, preprocessing, ETL pipeline implementation, and using Airflow for task scheduling. It includes specific examples of data cleaning techniques, API and database data extraction, and the creation of a Directed Acyclic Graph (DAG) for ETL processes. The document serves as a comprehensive guide for practical applications in data engineering.

Uploaded by

Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Data Engineering lab

{Bcsg : 0034}

Submitted by:
Submitted to:
Archit Agrawal Dr. Sachin Kumar Yadav
(Assistant Professor)

Branch/Sec:
B. Tech CS(AIML) - CC

Roll.no: 12
University roll.no: 2415500096
Seria Experiment Name Date Signature
l
num
ber
1 Data modeling using ER
diagram
2 Data preprocessing -
Cleaning data, noise
elimination, feature
selection
and dimension reduction
3 Schema design for OLAP
and
OLTP
4 ETL Pipeline
implementation- Extract
data from various
sources. Transform data
using cleaning,
aggregation, and
enrichment techniques.
Load data into different
storage solutions.
5 Building ETL Pipelines -
Extracting data from API
and
databases
6 Data Transformation
7 Setting up AirFlow
8 Creating Directed Acyclic
Graph using AirFlow
9 Data cleaning
10 Data storage using
NoSQL
Experiment 1
Data modeling using ER diagram

● Entities C Attributes (Enhanced Version)

1. Student student_id: int

«PK» name: string

gender: enum {Male, Female,

Other} dob: date

email: string

phone: string

address: string

admission_year:

int

department_id: int «FK»

2. Course course_id: int

«PK» title: string

credits: int

semester_offered:

string department_id:

int «FK»

3. Faculty faculty_id: int

«PK» name: string


email: string
phone: string

designation: string

department_id: int «FK»

4. Department department_id: int

«PK» name: string

building: string

hod_id: int «FK» (optional, can be a faculty_id)

5. Enrollment enrollment_id: int

«PK» student_id: int «FK»

course_id: int «FK»

semester: string

year: int

grade: string

6. Teaches faculty_id: int

«FK» course_id: int «FK»

semester: string

year: int

● Relationships A Student belongs to one

Department. A Faculty belongs to one

Department.
A Course is offered by one Department.

A Student enrolls in many Courses via

Enrollment. A Faculty teaches many

Courses via Teaches.

● Improvements Over the Given Diagram: Added year


attribute in Enrollment and Teaches for better tracking.

Used enum for gender for data validation.

Added admission_year for Student

analytics.

hod_id in Department allows tracking Head of Department.

Added designation for Faculty to make the schema more


informative.

⬛ How to Create in StarUML: Open StarUML > File > New >
ERD Model.

Use Entity from the palette to create the six entities

above. Define attributes for each entity (right-click > Add

Attribute).

Use Relationship to connect entities with proper cardinality


(1:1, 1:N).

Label relationships like enrolls, teaches, belongs to,

etc. Use Note elements to explain certain

design choices
📊 Steps for Data Preprocessing (with Examples)

1. Data Cleaning

 Handling Missing Values


Example used: SimpleImputer(strategy='mean')
to replace nulls with mean values.

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(strategy='mean')
data_cleaned =
pd.DataFrame(imputer.fit_transform(data),
columns=data.columns)

 Checking Missing Data

data.isnull().sum()

 Initial Data Check

data.head()
data.tail()
data.info()
data.describe(
)

2. Noise Elimination

 Visual Inspection for Noise s Outliers

import matplotlib.pyplot as plt


columns_to_visualize = ['CRIM', 'ZN', 'INDUS',
'AGE', 'LSTAT', 'MEDV']

fig, axes =
plt.subplots(len(columns_to_visualize), 2,
figsize=(12, 20))
fig.suptitle('Data Visualization: Uncleaned vs
Cleaned Data', fontsize=16)

for i, col in enumerate(columns_to_visualize):


axes[i, 0].hist(data[col].dropna(),
bins=30,
color='skyblue', edgecolor='black')
axes[i, 0].set_title(f'Uncleaned: {col}')
axes[i, 0].set_xlabel(col)
axes[i,
0].set_ylabel('Frequency')

axes[i, 1].hist(data_cleaned[col], bins=30,


color='salmon', edgecolor='black')
axes[i, 1].set_title(f'Cleaned:
{col}') axes[i, 1].set_xlabel(col)
axes[i, 1].set_ylabel('Frequency')

plt.tight_layout(rect=[0, 0, 1,
0.96]) plt.show()

📊 Steps for Data Preprocessing (with Examples)

1. Data Cleaning

 Handling Missing Value

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(strategy='mean')
data_cleaned
pd.DataFrame(imputer.fit_transform(data)
, columns=data.columns)

 Checking Missing Data

data.isnull().sum()

 Initial Data Check

data.head()
data.tail()
data.info()
data.describe(
)

2. Noise Elimination

 Visual Inspection for Noise s Outliers

import matplotlib.pyplot as plt


columns_to_visualize = ['CRIM', 'ZN', 'INDUS',
'AGE', 'LSTAT', 'MEDV']

fig, axes =
plt.subplots(len(columns_to_visualize), 2,
figsize=(12, 20))
fig.suptitle('Data Visualization: Uncleaned vs
Cleaned Data', fontsize=16)

for i, col in enumerate(columns_to_visualize):


axes[i, 0].hist(data[col].dropna(),
bins=30,
color='skyblue', edgecolor='black')
axes[i, 0].set_title(f'Uncleaned: {col}')
axes[i, 0].set_xlabel(col)
axes[i, 0].set_ylabel('Frequency')

axes[i, 1].hist(data_cleaned[col], bins=30,


color='salmon', edgecolor='black')
axes[i, 1].set_title(f'Cleaned:
{col}') axes[i, 1].set_xlabel(col)
axes[i, 1].set_ylabel('Frequency')

plt.tight_layout(rect=[0, 0, 1,
0.96]) plt.show()
o
1. Extract Data from APIimport requests

# API URL and headers api_url =


"https://api.example.com/data"
headers = {'Authorization': 'Bearer
your_token'}
2. Step 2: Make API Request

3. def get_data_from_api(api_url,
headers=None, params=None):
response = requests.get(api_url,
headers=headers, params=params)
return response.json() # or
response.text for non-JSON data

4. Step 3: Handle Pagination (if needed)

5. def fetch_all_pages(api_url,
headers=None): data = []
page = 1
while True:
params = {'page': page}
response =
requests.get(api_url,
headers=headers,
params=params) result
= response.json()
data.extend(result['data']) # Adjust
depending on API response structure
if 'next' not in result or
result['next'] is None:
brea
k page
+= 1
return data
C. Step 4: Store the Extracted Data Temporarily

7. import pandas as pd

data = fetch_all_pages(api_url,
headers) df = pd.DataFrame(data)
df.to_csv('extracted_data.csv', index=False) #
Store as CSV temporarily

8.

G. 2. Extract Data from Database

10. Step 1: Set Up Database Connection

11. import psycopg2

def connect_to_postgres():
conn =
psycopg2.connect(
dbname="your_db",
user="your_user",
password="your_password",
host="localhost"
)
return conn
12. Step 2: Write SǪL Ǫuery

13. SELECT * FROM sales WHERE sale_date >=


'2023- 01-01';

14. Step 3: Execute the Ǫuery

15. def extract_data_from_db(query,


conn): cur = conn.cursor()
cur.execute(query)
data =
cur.fetchall()
cur.close()
return data

1C. Step 4: Store Data Temporarily

17. query = "SELECT * FROM sales WHERE


sale_date
>= '2023-01-01';"
conn = connect_to_postgres()
data = extract_data_from_db(query,
conn) df = pd.DataFrame(data,
columns=['SaleID', 'Product',
'Amount', 'SaleDate'])
df.to_csv('db_extracted_data.csv',
index=False) # Store as CSV temporarily
18. Step 5: Handle Large Datasets (Optional,
with Pagination)

19. def extract_in_chunks(query,


conn, chunk_size=1000):
offset = 0
while True:
paginated_query = f"{query} LIMIT
{chunk_size} OFFSET {offset}"
data =
extract_data_from_db(paginated_query,
conn) if not data:
break
yield data # Use 'yield' to return
data in chunks
offset += chunk_size

20.
21. Extract from API
api_data = fetch_all_pages(api_url,
headers) api_df = pd.DataFrame(api_data)
api_df.to_csv('api_extracted_data.csv',
index=False)

# Extract from Database


db_conn = connect_to_postgres()
db_query = "SELECT * FROM sales WHERE
sale_date
>= '2023-01-01';"
db_data = extract_data_from_db(db_query,
db_conn) db_df = pd.DataFrame(db_data,
columns=['SaleID',
'Product', 'Amount', 'SaleDate'])
db_df.to_csv('db_extracted_data.csv',
index=False)

22.
Example of a basic Airflow DAG for ETL
pipeline
from airflow import DAG
from airflow.operators.python_operator import
PythonOperator
from datetime import datetime

def extract_data():
# Put the extraction code
here pass

def transform_data():
# Put the transformation code
here pass

def load_data():
# Put the loading code here
pass

default_args =
{ 'owner':
'airflow',
'retries': 1,
'start_date': datetime(2023, 1, 1),
}

dag = DAG('etl_pipeline',
default_args=default_args,
schedule_interval='@daily')

extract_task =
PythonOperator(task_id='extract_data',
python_callable=extract_data, dag=dag)
transform_task =
PythonOperator(task_id='transform_data',
python_callable=transform_data, dag=dag)
load_task =
PythonOperator(task_id='load_data',
python_callable=load_data, dag=dag)

extract_task >> transform_task >> load_task #


Define task dependencies

23.

24. Final Example: Full ETL Flow

25. ```python import requests import psycopg2


import pandas as pd

26. Extract Data from API

27. def fetch_data_from_api(api_url,


headers=None): response = requests.get(api_url,
headers=headers) return response.json()
28. Extract Data from Database

29. def connect_to_postgres(): return


psycopg2.connect(dbname="your_db",
user="your_user", password="your_password",
host="localhost")
30. def extract_data_from_db(query, conn):
cur = conn.cursor() cur.execute(query) return
cur.fetchall()

31. Main ETL Extraction

32. def main_etl(): # API Extraction api_data =


fetch_data_from_api("https://api.example.com/data"
, headers={'Authorization': 'Bearer
your_token'}) api_df = pd.DataFrame(api_data)
api_df.to_csv('api_extracted_data.csv',
index=False)
33. # Database Extraction
db_conn =
connect_to_postgres()
db_query = "SELECT * FROM sales WHERE sale_date
>= '2023-01-01';"
db_data = extract_data_from_db(db_query,
db_conn) db_df = pd.DataFrame(db_data,
columns=['SaleID', 'Product', 'Amount',
'SaleDate'])
db_df.to_csv('db_extracted_data.csv',
index=False)
34. if name == "main": main_etl()

ata # Example: Enrich with external data


(join with product categories)
external_data =
pd.read_csv('product_categories.csv')
df = df.merge(external_data, on='ProductID',
how='left')

--username admin \
--firstname First \
--lastname Last \
--email [email protected] \
--role Admin \
--password adminpassword
from airflow import DAG
from airflow.operators.python import
PythonOperator
from datetime import datetime
# Function for extracting data (e.g., from API
or database)
def extract_data():
# Put the extraction code here (e.g.,
API call or database query)
pass

# Function for transforming


data def transform_data():
# Put the data transformation code here
(e.g., cleaning, aggregation)
pass

# Function for loading data (e.g., store


transformed data)
def load_data():
# Put the loading code here (e.g., save to
database, file)
pass

# Define default arguments for the DAG


default_args = {
'owner': 'airflow',
'retries': 1,
'start_date': datetime(2023, 1, 1), #
Start date for the DAG
}

# Define the
DAG dag = DAG(
'etl_pipeline', default_args=default_args,
schedule_interval='@daily', # Run the DAG
daily
catchup=False # Avoid backfilling if the
DAG is triggered late
)

# Define the tasks and their dependencies


extract_task = PythonOperator(
task_id='extract_data',
python_callable=extract_data
, dag=dag
)

transform_task = PythonOperator(
task_id='transform_data',
python_callable=transform_da
ta, dag=dag
)

load_task = PythonOperator(
task_id='load_data',
python_callable=load_data,
dag=dag
)

# Set task dependencies


extract_task >> transform_task >> load_task
1. Step Run Airflow Web Server

2. Start the Airflow web server to access the UI for


monitoring the pipeline.airflow webserver --port
8080

3. Visit http://localhost:8080 to access the Airflow UI.

4. Step C: Run Airflow Scheduler

5. Start the scheduler that will trigger the DAG


based on the schedule interval.airflow
scheduler

C. Step 7: Trigger DAG Manually (optional)

7. You can trigger the DAG manually from the


Airflow UI or by using the CLI.airflow dags
trigger etl_pipeline

8. Step 8: Monitor the DAG

9. Monitor task execution in the


Airflow UI
(http://localhost:8080).
a. View logs, track task status, and check for errors.
10.

11. Final Example of DAG Structure

from airflow import DAG


from airflow.operators.python import
PythonOperator from datetime import datetime
# Example of ETL steps as functions

print("Extracting data...") # Replace with


actual extraction code

def transform_data():
print("Transforming data...") # Replace
with actual transformation code

def load_data():
print("Loading data...") # Replace with actual
loading code

# Default args for DAG


default_args = {
'owner': 'airflow',
'retries': 1,
'start_date': datetime(2023, 1, 1),
}

# Define the DAG


dag = DAG(
'etl_pipeline',
default_args=default_args,
schedule_interval='@daily', # Run
daily catchup=False
)

# Define tasks
extract_task = PythonOperator(
task_id='extract_data',
python_callable=extract_data,
dag=dag
)

transform_task = PythonOperator(
task_id='transform_data',
python_callable=transform_dat
a, dag=dag
)

load_task = PythonOperator(
task_id='load_data',
python_callable=load_data,
dag=dag
)

# Set task dependencies


extract_task >> transform_task >> load_task
1. Set Up Apache Airflow**
Before creating a DAG, make sure Apache Airflow is
installed and configured properly:

1. Install Apache Airflow:

pip install apache-airflow

2. Initialize the Airflow database (Airflow uses a


metadata database to track tasks):

airflow db init

3. Create a user for accessing the Airflow UI:

airflow users create


\
--username admin \
--firstname First \
--lastname Last \
--email [email protected] \
--role Admin \
--password adminpassword

Step 1: Define the DAG

Create a Python file to define your DAG. The file should be


located in the Airflow DAG folder (e.g., ~/airflow/dags/).

Example DAG file: etl_dag.py

from airflow import DAG


from airflow.operators.python import
PythonOperator
from datetime import
datetime import pandas as
pd
import numpy as np

# Data cleaning function


def clean_data():
# Example data for cleaning
data = {
'Name': ['Alice', 'Bob', 'Charlie', np.nan,
'Eve']
, 'Age': [25, np.nan, 30, 29, 22],
'City': ['New York', 'Los Angeles',
'Chicago',

df = pd.DataFrame(data)

# Remove rows where any column has NaN values


df_cleaned = df.dropna()

# Replace any missing value with a default


value (example)
# df_cleaned = df.fillna({'Age': 0, 'City':
'Unknown'})

# Print the cleaned


data print("Cleaned
Data:")
print(df_cleaned)
return df_cleaned

# Default args for the DAG


default_args =
{ 'owner':
'airflow',
'retries': 1,
'start_date': datetime(2023, 1, 1),
}

# Define the DAG


dag = DAG(
'etl_pipeline_with_cleaning',
default_args=default_args,
description='A simple ETL pipeline with data
cleaning',
schedule_interval='@daily', # Run daily
catchup=False,
)

# Task: Data Cleaning


clean_task =
PythonOperator(
task_id='clean_data',
python_callable=clean_data,
dag=dag
)

4. Running the DAG

Start Airflow Web Server:

airflow webserver --port 8080


a. Open your browser and visit
http://localhost:8080
to access the Airflow UI.

Extract Data (Example API or Database Extraction):


def extract_data():
# Code to extract data from API or database
print("Extracting data...")
return "Extracted Data"

extract_task = PythonOperator(
task_id='extract_data',
python_callable=extract_data,
dag=dag
)

Load Data (Example loading into a database or CSV file):

def load_data():
# Code to load the cleaned data to a
target location (e.g., database, CSV)
print("Loading data...")

load_task = PythonOperator(
task_id='load_data',
python_callable=load_data,
dag=dag
)
Set Task Dependencies:
extract_task >> clean_task >> load_task

You might also like