Data Engineering Assignment Report

The document discusses a healthcare analytics project that aims to analyze US health insurance plans using the Apache Airflow platform. The objectives are to ingest and transform the dataset, perform data quality checks, and analyze the data to understand market dynamics. It describes the dataset, data ingestion and ETL process using Apache Airflow including framework selection, installation, and integration with tools like Docker and Visual Studio Code. It also covers the data transformation logic using Python scripts within the Airflow workflow and concepts of Apache Airflow like DAGs, operators, tasks, and task dependencies.

Uploaded by

Ranjita Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

222 views9 pages

Data Engineering Assignment Report

Uploaded by

Ranjita Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

INTRODUCTION

Background

The healthcare analytics project is all about using smart data techniques to look closely at
information from the US Health Insurance Marketplace. We want to understand and analyze details
about health and dental plans that are available to people. The information we're looking at is found
in the Health Insurance Marketplace Public Use Files. It's like digging into a treasure trove of data to
gain insights into the plans offered through the Marketplace.

Objectives

 Ingest and transform the processed dataset using a chosen framework.

 Implement data quality checks to ensure accuracy.
 Conduct data analysis to understand market dynamics, plan rates, benefits, and variations
across different factors.

DATA SOURCE
Dataset Information
The Health Insurance Marketplace Public Use Files, made by CMS, have info on health and dental
plans for individuals and small businesses in the US Marketplace. They're like a data goldmine for
understanding available insurance options.

Processed Data Components

The processed version of the data includes six CSV files with the following components:

 BenefitsCostSharing.csv
 BusinessRules.csv
 Network.csv
 PlanAttributes.csv
 Rate.csv
 ServiceArea.csv

INGESTION AND ETL

Framework Selection: Apache Airflow

Apache Airflow Overview:

Apache Airflow is a free platform for creating, scheduling, and tracking workflows using Directed
Acyclic Graphs (DAGs). It lets you organize and execute tasks in a specific sequence, making it great
for managing intricate data workflows.
Installation Process:
1. Install Python:
 Ensure that Python is installed on your system.
 Apache Airflow is compatible with both Python 2 and Python 3, but it's recommended
to use Python 3 for the latest versions.
2. Install Apache Airflow:
 You can install Apache Airflow using pip, a package installer for Python. Run
the following command:
Bash command
pip install apache-airflow
3. Initialize Airflow Database:
 Initialize the metadata database used by Airflow to store its configuration settings and
job metadata.
Bash command
airflow db init
4. Start the Airflow Web Server:
 Start the web server, which provides the Airflow user interface.
Bash command
airflow webserver --port 8080
5. Start the Scheduler:
 Start the scheduler, which orchestrates the execution of tasks defined in your DAGs.
Bash command
airflow scheduler
6. Access the Airflow UI:
 Open a web browser and navigate to http://localhost:8080 to access the Airflow UI.

Tools and OS Requirements:

Operating System:
 Apache Airflow is platform-agnostic and can be installed on various operating
systems, including Linux and Windows.
Python Virtual Environment (Optional but Recommended):
 It's good practice to create a virtual environment to isolate the dependencies of your project.
Bash command
python3 -m venv myenv source myenv/bin/activate # On Unix
Additional Tools (if needed for specific tasks):
 Database:
 If you plan to use a specific database backend (e.g., PostgreSQL, MySQL, SQLite), you
may need to install the corresponding database software and Python drivers.
 Additional Python Libraries:
 Depending on the specific data processing tasks in your ETL process, you may need
to install additional Python libraries. For example:
Bash command
pip install pandas numpy sqlalchemy
 Other Dependencies:
 Some tasks might require additional tools or libraries. Ensure you install them
based on your project requirements.
Docker Integration:

1. Install Docker:
 Ensure Docker is installed on your system. You can download and install Docker from
the official website:
2. Create Dockerfile:
 Create a Dockerfile in your project directory to define the Docker image for your Apache
Airflow environment. An example Dockerfile might look like this:

FROM apache/airflow:2.1.2 USER root RUN pip install pandas numpy sqlalchemy USER airflow
3. Build Docker Image:
 In the same directory as your Dockerfile, run the following command to build the Docker
image:
Bash command
docker build -t my_airflow_image .
4. Run Apache Airflow in Docker Container:
 Start Apache Airflow within a Docker container using the built image:
Bash command
docker run -p 8080:8080 my_airflow_image
 Access the Airflow UI at http://localhost:8080.

Visual Studio Code Integration:

1. Install Visual Studio Code:

 Download and install Visual Studio Code from the official website:
2. Install Docker Extension:
 Install the "Docker" extension for Visual Studio Code to manage Docker containers and
images directly from the VS Code interface.
3. Install Python Extension:
 Install the "Python" extension for Visual Studio Code to enhance Python development within
the editor.
4. Connect to Docker from VS Code:
 Open the Docker extension in VS Code, and it will automatically detect running Docker
containers and images.
5. Develop and Debug in VS Code:
 Write your Python scripts, Apache Airflow DAGs, and ETL logic within VS Code.
 Utilize VS Code's debugging features for Python development.
6. Docker Compose:
 If your project involves multiple services or components, consider using Docker Compose to
define and run multi-container Docker applications.
 Create a docker-compose.yml file to specify your services, volumes, and networks.
 Use the docker-compose CLI to manage the lifecycle of your application.
TRANSFORMATION LOGIC
The transformation process involves tasks like data cleaning, normalization, and feature engineering.
To carry out these tasks, Python scripts will be integrated into the Apache Airflow workflow.
Data Cleaning:
Handling Missing Values:
 Find and address any missing or null values in the dataset. Depending on the situation and the
columns affected, you can either fill in the missing values using statistical methods or decide to
remove the corresponding rows or columns.
Outlier Detection and Removal:
 Detect and manage outliers that could impact the analysis. This may require using statistical
techniques or domain expertise to identify and eliminate the outliers from the dataset.
Normalization:
Scaling Numerical Features:
 Standardize numerical features to ensure a consistent scale. Methods like Min-Max scaling or Z-
score normalization can be employed to prevent any particular feature from overshadowing
others during analysis.
Categorical Variable Encoding:
 Encode categorical variables using methods like one-hot encoding or label encoding. This step
is vital for machine learning models, ensuring that categorical variables are formatted
appropriately for analysis.
Python Scripts within Apache Airflow Workflow:

Define Python Operators:

 Leverage Apache Airflow to create Python Operators, each linked to a distinct
transformation task. These operators will run Python scripts containing the
transformation logic.
Order of Execution:
 Define the order in which these Python Operators should execute within the workflow.
Ensure dependencies are established so that transformations are performed in a logical
sequence.
Parameterization:
 Take advantage of Apache Airflow's parameterization features for a flexible workflow.
This involves using parameters such as file paths, column names, or other configurations
needed by the Python scripts.
Error Handling:
 Incorporate error-handling mechanisms in Python scripts to gracefully manage
unexpected issues during transformations. This guarantees the workflow can recover
from failures, offering informative error messages for effective debugging.
Logging and Monitoring:

Logging:
 Integrate logging in Python scripts to record important events, errors, or relevant
information. This facilitates troubleshooting and monitoring of the ETL process.
Monitoring:
 Set up monitoring tools within Apache Airflow to track the progress of the workflow. This
includes checking for successful task completion, identifying failures, and triggering alerts
if needed.

APACHE AIRFLOW
Apache Airflow is an open-source platform that orchestrates complex workflows represented as
Directed Acyclic Graphs (DAGs). Primarily employed for managing and scheduling ETL (Extract,
Transform, Load) processes, it streamlines the flow of data operations.
Key Concepts:

1. DAG (Directed Acyclic Graph):

 In Apache Airflow, a DAG is a set of tasks with specified dependencies, outlining the
orchestrated workflow.
 It is created in a Python script, encompassing tasks, operators, and their relationships.
2. Operators:
 Operators in Apache Airflow define individual steps in a workflow, each carrying out a
specific action like executing SQL queries or running Python scripts.
 Examples of common operators are PythonOperator, BashOperator, SqlSensor, etc.
3. Tasks:
 In Apache Airflow, a task is an occurrence of an operator, signifying a distinct unit of
work in a DAG.
 Tasks are part of a DAG, and the connections between tasks determine the sequence of
the workflow.
4. Task Dependencies:
 Dependencies between tasks are specified in the DAG definition. A task can depend
on the success or failure of one or more tasks before it can be executed.
 Dependencies are established using the set_upstream() and set_downstream() methods.
Python Code for creating DAG file in Apache Airflow
1 import pandas as pd
2 from sqlalchemy import create_engine
3 from datetime import datetime,
4 timedelta from airflow import DAG
5 from airflow.operators.python_operator import PythonOperator
6
7 def load_csv():
8 # List of file names to be loaded
9 file_names = [ r"C:\Users\HP\Downloads\dataset\
10 BenefitsCostSharing.csv", r"C:\Users\HP\Downloads\dataset\
11 BusinessRules.csv", r"C:\Users\HP\Downloads\dataset\Network.csv",
12 r"C:\Users\HP\Downloads\dataset\PlanAttributes.csv", r"C:\Users\HP\
13 Downloads\dataset\Rate.csv", r"C:\Users\HP\Downloads\dataset\
14 ServiceArea.csv"]
15
16 # Load all CSV files into a list of DataFrames
17 dfs = [pd.read_csv(file) for file in file_names]
18
19 # Concatenate the DataFrames into a single
20 DataFrame combined_df = pd.concat(dfs,
21 ignore_index=True) return combined_df
22
23
24 def perform_transformation(**kwargs):
25 ti = kwargs['ti']
26 raw_data = ti.xcom_pull(task_ids='load_task')
27
28 # Perform your transformations
29 transformed_data = raw_data.copy()
30
31 # Add a new column with a complex calculation
32 transformed_data['new_calculated_column'] = (
33 transformed_data['existing_column'] * 3 + transformed_data['another_column']
34 )
35
36 # Handle missing values (replace NaN with a default value)
37 transformed_data['existing_column'].fillna(0, inplace=True)
38
39 # Apply a custom function to a column
40 def custom_function(value):
41 # Example: Apply a function to each value in a column
42 return value + 10
43
44 transformed_data['another_column'] =
transformed_data['another_column'].apply(custom_function)
45
46 # Push the transformed data to XCom for later use
47 ti.xcom_push(key='transformed_data', value=transformed_data)
48
49 def store_to_mysql(**kwargs):
50 ti = kwargs['ti']
51 transformed_data = ti.xcom_pull(task_ids='transform_task')
52
53 # Replace 'your_mysql_connection_string' with your actual MySQL connection string
54 engine =
create_engine('mysql+mysqlconnector://username:password@localhost:3306/your_database')
55
56 # Upload transformed data to MySQL table
57 transformed_data.to_sql('Health Insurance Marketplace ', con=engine,
index=False, if_exists='replace')
58
59 default_args = {
60 'owner': 'airflow',
61 'start_date': datetime(2023, 12, 14),
62 'depends_on_past': False,
63 'retries': 1,
64 'retry_delay': timedelta(minutes=5),
65 }
66
67 dag = DAG(
68 'transform_dag',
69 default_args=default_args,
70 description='A DAG for data transformation and storage',
71 schedule_interval=timedelta(days=1),
72 )
73
74 load_task = PythonOperator(
75 task_id='load_task',
76 python_callable=load_csv,
77 dag=dag,
78 )
79
80 transform_task = PythonOperator(
81 task_id='transform_task',
82 python_callable=perform_transformation,
83 provide_context=True,
84 dag=dag,
85 )
86
87 store_task = PythonOperator(
88 task_id='store_task',
89 python_callable=store_to_mysql,
90 provide_context=True,
91 dag=dag,
92 )
93
94 # Set task dependencies
95 load_task >> transform_task >> store_task
Workflow:

1. Extraction Task (extract_task):

 Defines the logic to extract data from the source, such as a database, API, or file.
2. Transformation Task (transform_task):
 Performs data transformations on the extracted data. This could involve
cleaning, aggregating, or reshaping the data.
3. Loading Task (load_task):
 Loads the transformed data into the target destination, such as a data warehouse
or database.
4. Dependencies:
 extract_task must complete successfully before transform_task can run, and
similarly, transform_task must complete before load_task.

Running the DAG:

 Save the script and place it in the DAGs directory configured in your Airflow installation.
 Airflow will automatically detect and schedule the DAG based on the defined
schedule_interval .
 Monitor the progress and logs through the Airflow web UI (http://localhost:8080 by
default).

LEARNING OUTCOMES
1. ETL Processes Understanding:
 Gain in-depth knowledge of Extract, Transform, Load (ETL) processes and the crucial role
of orchestrating workflows in data engineering.
2. Apache Airflow Mastery:
 Develop proficiency in utilizing Apache Airflow as a powerful tool for workflow
orchestration.
 Learn the art of defining Directed Acyclic Graphs (DAGs) for intricate workflows.
3. Python Scripting for Data Processing:
 Hone skills in crafting Python scripts tailored for data processing tasks within the Apache
Airflow framework.
 Explore the use of Python operators for diverse data manipulation operations.
4. Data Quality Assurance:
 Implement robust data quality checks to uphold the accuracy and reliability of processed
data.
5. Framework Integration Expertise:
 Gain hands-on experience in seamlessly integrating Apache Airflow with other frameworks,
tools, or services, enhancing the overall ETL pipeline.
6. Docker Integration Proficiency:
 Learn the ins and outs of using Docker for containerization, ensuring consistent and
reproducible environments.
7. Version Control Competence:
 Utilize version control tools like Git for efficient code management and collaborative
teamwork.
8. Documentation Skills:
 Practice creating thorough project documentation, including README files and reports, to
effectively communicate project details.
9. Project Management Aptitude:
 Acquire project management skills by effectively organizing and overseeing the
development, testing, and deployment phases of ETL projects.
10.Troubleshooting and Debugging Skills:
 Develop expertise in identifying and resolving issues within Apache Airflow workflows,
Python scripts, and during the ETL process.
11.Collaboration and Communication Excellence:
 Enhance collaborative and communication skills through interactions with team members,
stakeholders, and the broader data engineering community.
12.Practical Experience Focus:
 Obtain hands-on, practical experience in constructing end-to-end ETL workflows, providing
a realistic grasp of data engineering challenges and effective solutions.

azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Azure Data Engineering Course
No ratings yet
Azure Data Engineering Course
20 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Siva
No ratings yet
Siva
4 pages
Roadmap To Become An Azure Data Engineer 2024
No ratings yet
Roadmap To Become An Azure Data Engineer 2024
3 pages
Srilakshi M Resume
No ratings yet
Srilakshi M Resume
6 pages
Winpe Boot
No ratings yet
Winpe Boot
14 pages
ADF Course Content
No ratings yet
ADF Course Content
11 pages
CD - Members CD - Bookings CD - Facilities: Column Name Data Type Column Name Data Type Column Name
100% (1)
CD - Members CD - Bookings CD - Facilities: Column Name Data Type Column Name Data Type Column Name
6 pages
Databricks How To Data Import PDF
No ratings yet
Databricks How To Data Import PDF
16 pages
Data Engineer Profiles
No ratings yet
Data Engineer Profiles
5 pages
Azure DataEngineer Training
No ratings yet
Azure DataEngineer Training
13 pages
Data Engineering 101 Learning Path
No ratings yet
Data Engineering 101 Learning Path
26 pages
Anil Kumar: Data Engineer
No ratings yet
Anil Kumar: Data Engineer
8 pages
Koustav BigData Resume
No ratings yet
Koustav BigData Resume
2 pages
Interview Series ADF Part-1
No ratings yet
Interview Series ADF Part-1
17 pages
AyushiPatra Resume
No ratings yet
AyushiPatra Resume
1 page
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
Jarupula Praveen
No ratings yet
Jarupula Praveen
7 pages
Ajay_Resume_VLaF
No ratings yet
Ajay_Resume_VLaF
2 pages
Azure DataBricks Interview Questions
No ratings yet
Azure DataBricks Interview Questions
17 pages
Swathi - DE Data Engineer
No ratings yet
Swathi - DE Data Engineer
5 pages
Nagarjuna Hadoop Resume
No ratings yet
Nagarjuna Hadoop Resume
7 pages
ADF Notes
No ratings yet
ADF Notes
1 page
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Mourya K Data Engineer
No ratings yet
Mourya K Data Engineer
7 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Resume Prashant Agarwal Dec 2023 V6
No ratings yet
Resume Prashant Agarwal Dec 2023 V6
2 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Lead Data Engineer Resume Example
No ratings yet
Lead Data Engineer Resume Example
1 page
Hanumantha Rao Resume-1 (4391)
No ratings yet
Hanumantha Rao Resume-1 (4391)
4 pages
4 Data-Testing PDF
No ratings yet
4 Data-Testing PDF
79 pages
Python ML Resume Template Coding Mafia
No ratings yet
Python ML Resume Template Coding Mafia
1 page
Data Resume snowflake (1) (1)
No ratings yet
Data Resume snowflake (1) (1)
7 pages
New Snowflake Questions
No ratings yet
New Snowflake Questions
4 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
ANSAR HAYAT BigData Architect
No ratings yet
ANSAR HAYAT BigData Architect
3 pages
Resume Aakash Nagpal
No ratings yet
Resume Aakash Nagpal
1 page
Vijay Kanth - Azure Data Engineer
No ratings yet
Vijay Kanth - Azure Data Engineer
2 pages
Hemanta Katwal
No ratings yet
Hemanta Katwal
7 pages
Resume Mohit
No ratings yet
Resume Mohit
6 pages
Pyspark
No ratings yet
Pyspark
31 pages
Sampath Polishetty BigData Consultant
No ratings yet
Sampath Polishetty BigData Consultant
7 pages
Top 50 Data Warehousing Interview Questions & Answers
No ratings yet
Top 50 Data Warehousing Interview Questions & Answers
8 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Resume - Tanmoy Munshi PDF
No ratings yet
Resume - Tanmoy Munshi PDF
2 pages
By Ram Reddy by Ram Reddy: #209, Nilagiri Block, Adithya Enclave, Ameerpet, HYD @8801408841, 8790998182
No ratings yet
By Ram Reddy by Ram Reddy: #209, Nilagiri Block, Adithya Enclave, Ameerpet, HYD @8801408841, 8790998182
255 pages
Databricks Developer Resume
No ratings yet
Databricks Developer Resume
3 pages
Saraswati K DA
No ratings yet
Saraswati K DA
6 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Akhil Data+Engineer1
No ratings yet
Akhil Data+Engineer1
5 pages
Talend Interview Questions
No ratings yet
Talend Interview Questions
5 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
Azure Data Factory Interview Questions and Aswers
No ratings yet
Azure Data Factory Interview Questions and Aswers
5 pages
Databricks
No ratings yet
Databricks
4 pages
Data Engineering by AWS
100% (1)
Data Engineering by AWS
11 pages
Sanjana Data Engineer
No ratings yet
Sanjana Data Engineer
4 pages
150 Data Engineering Interview Questions PDF
No ratings yet
150 Data Engineering Interview Questions PDF
8 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet
Docusign DOCUMENT AUTOMATION CHEAT SHEET
No ratings yet
Docusign DOCUMENT AUTOMATION CHEAT SHEET
6 pages
Multiplexing Notes
No ratings yet
Multiplexing Notes
5 pages
Billing Project Complete C++
No ratings yet
Billing Project Complete C++
8 pages
Andreas Fertig - Fast and Small
No ratings yet
Andreas Fertig - Fast and Small
19 pages
Accelerator VMware Backup Best Practices - 2
No ratings yet
Accelerator VMware Backup Best Practices - 2
22 pages
ESAB iCNC Machine Control
No ratings yet
ESAB iCNC Machine Control
16 pages
Anti-Smuggling System For Trees in Forest Using Flex Sensor and Zigbee
No ratings yet
Anti-Smuggling System For Trees in Forest Using Flex Sensor and Zigbee
4 pages
App-A-Thon: 12th June 2020 - 24th July 2020
No ratings yet
App-A-Thon: 12th June 2020 - 24th July 2020
4 pages
Care For Network and Computer Hardware TVET
80% (5)
Care For Network and Computer Hardware TVET
16 pages
Master Arm Manual V1.2.0
No ratings yet
Master Arm Manual V1.2.0
6 pages
Bridge Bacnet Ethernet para Modbus RTU
No ratings yet
Bridge Bacnet Ethernet para Modbus RTU
4 pages
Report Control
No ratings yet
Report Control
17 pages
Market Place Install Guide
No ratings yet
Market Place Install Guide
20 pages
Quanta QK1 - A
No ratings yet
Quanta QK1 - A
44 pages
User Manual: 40PFT6709 40PFT6709S 50PFT6709 50PFT6709S
No ratings yet
User Manual: 40PFT6709 40PFT6709S 50PFT6709 50PFT6709S
84 pages
IT Grundschutz Kompendium 2019 WD
No ratings yet
IT Grundschutz Kompendium 2019 WD
786 pages
Pickerlaza: A Website On Iot Based Garbage Collection System
No ratings yet
Pickerlaza: A Website On Iot Based Garbage Collection System
20 pages
Information Technology in Education: Procedia Computer Science December 2011
No ratings yet
Information Technology in Education: Procedia Computer Science December 2011
6 pages
22 5COSC020W LECT03 Mapping
No ratings yet
22 5COSC020W LECT03 Mapping
34 pages
Undergraduate Computer Science Resume
100% (1)
Undergraduate Computer Science Resume
4 pages
Rta-Os3.0 Vrta Virtual Ecu User Guide
No ratings yet
Rta-Os3.0 Vrta Virtual Ecu User Guide
293 pages
Python Assignment
No ratings yet
Python Assignment
2 pages
21 01 2022 06 39 45 Net
No ratings yet
21 01 2022 06 39 45 Net
7 pages
Managing User Accounts and User Environments in Oracle Solaris 11.4
No ratings yet
Managing User Accounts and User Environments in Oracle Solaris 11.4
46 pages
Blue Pumpkin (Database+Source Code+Lib+Excel)
No ratings yet
Blue Pumpkin (Database+Source Code+Lib+Excel)
25 pages
793319975-1z0-819-Exam-Free-Actual-Q-As-Page-1-ExamTopics-2-1-2-5 45
No ratings yet
793319975-1z0-819-Exam-Free-Actual-Q-As-Page-1-ExamTopics-2-1-2-5 45
42 pages
HP NC360T PCI Express Dual Port Gigabit Server Adapter-C04163767
No ratings yet
HP NC360T PCI Express Dual Port Gigabit Server Adapter-C04163767
12 pages
6.Text Processing and Pattern Searching
No ratings yet
6.Text Processing and Pattern Searching
33 pages