Document 4
Document 4
{Bcsg : 0034}
Submitted by:
Submitted to:
Archit Agrawal Dr. Sachin Kumar Yadav
(Assistant Professor)
Branch/Sec:
B. Tech CS(AIML) - CC
Roll.no: 12
University roll.no: 2415500096
Seria Experiment Name Date Signature
l
num
ber
1 Data modeling using ER
diagram
2 Data preprocessing -
Cleaning data, noise
elimination, feature
selection
and dimension reduction
3 Schema design for OLAP
and
OLTP
4 ETL Pipeline
implementation- Extract
data from various
sources. Transform data
using cleaning,
aggregation, and
enrichment techniques.
Load data into different
storage solutions.
5 Building ETL Pipelines -
Extracting data from API
and
databases
6 Data Transformation
7 Setting up AirFlow
8 Creating Directed Acyclic
Graph using AirFlow
9 Data cleaning
10 Data storage using
NoSQL
Experiment 1
Data modeling using ER diagram
email: string
phone: string
address: string
admission_year:
int
credits: int
semester_offered:
string department_id:
int «FK»
designation: string
building: string
semester: string
year: int
grade: string
semester: string
year: int
Department.
A Course is offered by one Department.
analytics.
⬛ How to Create in StarUML: Open StarUML > File > New >
ERD Model.
Attribute).
design choices
📊 Steps for Data Preprocessing (with Examples)
1. Data Cleaning
data.isnull().sum()
data.head()
data.tail()
data.info()
data.describe(
)
2. Noise Elimination
fig, axes =
plt.subplots(len(columns_to_visualize), 2,
figsize=(12, 20))
fig.suptitle('Data Visualization: Uncleaned vs
Cleaned Data', fontsize=16)
plt.tight_layout(rect=[0, 0, 1,
0.96]) plt.show()
1. Data Cleaning
data.isnull().sum()
data.head()
data.tail()
data.info()
data.describe(
)
2. Noise Elimination
fig, axes =
plt.subplots(len(columns_to_visualize), 2,
figsize=(12, 20))
fig.suptitle('Data Visualization: Uncleaned vs
Cleaned Data', fontsize=16)
plt.tight_layout(rect=[0, 0, 1,
0.96]) plt.show()
o
1. Extract Data from APIimport requests
3. def get_data_from_api(api_url,
headers=None, params=None):
response = requests.get(api_url,
headers=headers, params=params)
return response.json() # or
response.text for non-JSON data
5. def fetch_all_pages(api_url,
headers=None): data = []
page = 1
while True:
params = {'page': page}
response =
requests.get(api_url,
headers=headers,
params=params) result
= response.json()
data.extend(result['data']) # Adjust
depending on API response structure
if 'next' not in result or
result['next'] is None:
brea
k page
+= 1
return data
C. Step 4: Store the Extracted Data Temporarily
7. import pandas as pd
data = fetch_all_pages(api_url,
headers) df = pd.DataFrame(data)
df.to_csv('extracted_data.csv', index=False) #
Store as CSV temporarily
8.
def connect_to_postgres():
conn =
psycopg2.connect(
dbname="your_db",
user="your_user",
password="your_password",
host="localhost"
)
return conn
12. Step 2: Write SǪL Ǫuery
20.
21. Extract from API
api_data = fetch_all_pages(api_url,
headers) api_df = pd.DataFrame(api_data)
api_df.to_csv('api_extracted_data.csv',
index=False)
22.
Example of a basic Airflow DAG for ETL
pipeline
from airflow import DAG
from airflow.operators.python_operator import
PythonOperator
from datetime import datetime
def extract_data():
# Put the extraction code
here pass
def transform_data():
# Put the transformation code
here pass
def load_data():
# Put the loading code here
pass
default_args =
{ 'owner':
'airflow',
'retries': 1,
'start_date': datetime(2023, 1, 1),
}
dag = DAG('etl_pipeline',
default_args=default_args,
schedule_interval='@daily')
extract_task =
PythonOperator(task_id='extract_data',
python_callable=extract_data, dag=dag)
transform_task =
PythonOperator(task_id='transform_data',
python_callable=transform_data, dag=dag)
load_task =
PythonOperator(task_id='load_data',
python_callable=load_data, dag=dag)
23.
--username admin \
--firstname First \
--lastname Last \
--email [email protected] \
--role Admin \
--password adminpassword
from airflow import DAG
from airflow.operators.python import
PythonOperator
from datetime import datetime
# Function for extracting data (e.g., from API
or database)
def extract_data():
# Put the extraction code here (e.g.,
API call or database query)
pass
# Define the
DAG dag = DAG(
'etl_pipeline', default_args=default_args,
schedule_interval='@daily', # Run the DAG
daily
catchup=False # Avoid backfilling if the
DAG is triggered late
)
transform_task = PythonOperator(
task_id='transform_data',
python_callable=transform_da
ta, dag=dag
)
load_task = PythonOperator(
task_id='load_data',
python_callable=load_data,
dag=dag
)
def transform_data():
print("Transforming data...") # Replace
with actual transformation code
def load_data():
print("Loading data...") # Replace with actual
loading code
# Define tasks
extract_task = PythonOperator(
task_id='extract_data',
python_callable=extract_data,
dag=dag
)
transform_task = PythonOperator(
task_id='transform_data',
python_callable=transform_dat
a, dag=dag
)
load_task = PythonOperator(
task_id='load_data',
python_callable=load_data,
dag=dag
)
airflow db init
df = pd.DataFrame(data)
extract_task = PythonOperator(
task_id='extract_data',
python_callable=extract_data,
dag=dag
)
def load_data():
# Code to load the cleaned data to a
target location (e.g., database, CSV)
print("Loading data...")
load_task = PythonOperator(
task_id='load_data',
python_callable=load_data,
dag=dag
)
Set Task Dependencies:
extract_task >> clean_task >> load_task