0% found this document useful (0 votes)

10 views

Airflow_Assignment_1

The document outlines an assignment to automate a workflow using Apache Airflow for processing daily CSV files from a GCP bucket with a Dataproc PySpark job, ultimately saving the transformed data into a Hive table. Key tasks include setting up GCP and Airflow, configuring a DAG, implementing file sensing, cluster creation, job execution, and cluster deletion. Evaluation criteria focus on proper configuration, successful execution, and data transformation, with submission requirements including the DAG file and a brief report.

Uploaded by

Keshav Durgampudi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Airflow_Assignment_1

Uploaded by

Keshav Durgampudi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Airflow Assignment: GCP Dataproc PySpark Job

Objective: Automate a workflow using Apache Airflow to process daily

incoming CSV files from a GCP bucket using a Dataproc PySpark job
and save the transformed data into a Hive table.

Tasks:

1. Setup:

ills
● Create a Google Cloud Platform (GCP) bucket to store
the daily CSV files.

Sk
● Set up an Apache Airflow environment and ensure
GCP and Dataproc plugins/hooks are available.

2. DAG Configuration:
a
● Create a new DAG gcp_dataproc_pyspark_dag.
at
● Schedule the DAG to run once a day.
● Ensure catchup is set to False: catchup=False.
D

3. File Sensor Task:

● Add a GCSObjectExistenceSensor task to check for

the presence of the daily CSV file in the GCP bucket.
ro

● Configure the task to poke for the file every 5 minutes

for a maximum of 12 hours.
G

4. Dataproc Cluster Creation Task:

● Use the DataprocClusterCreateOperator to create a

new Dataproc cluster.
● Define and configure the cluster specifications as
needed.

5. PySpark Job Execution Task:

● Upload your PySpark script to GCP (either in a bucket
or Cloud Storage).
● Use the DataProcPySparkOperator to execute the
PySpark script on the created Dataproc cluster.
● The PySpark script should:
○ Read the daily CSV file from the GCP bucket.
○ Perform some logical transformations on the
data.
○ Write the transformed data into a Hive table.

ills
6. Dataproc Cluster Deletion Task:

Sk
● Use the DataprocClusterDeleteOperator to delete the
Dataproc cluster once the PySpark job is successfully
completed.
a
7. DAG Dependency Configuration:
at
● Set the task dependencies using the set_upstream and
set_downstream methods or the bitshift operators (>>
D

and <<).
● Ensure that the DAG tasks run in the correct sequence.
w

Evaluation Criteria:
ro

● Proper configuration and structuring of the Airflow DAG.

● Successful execution and scheduling of the DAG.
● Correct sensing of the daily CSV file.
G

● Successful creation and deletion of the Dataproc cluster.

● Successful execution of the PySpark job with the desired
transformation.
● Proper writing of the transformed data to the Hive table.

Tips:
● Remember to configure the necessary GCP connection in
the Airflow web UI.
● Ensure you handle exceptions and potential issues in the
workflow, such as cluster creation failures or script execution
errors.
● Log important steps and outputs for easier debugging.

Submission:

● Submit the DAG python file (gcp_dataproc_pyspark_dag.py).

ills
● Provide a brief report explaining the workflow, any
challenges faced, and their solutions.
● Include screenshots of the successful DAG runs and the

Sk
resulting data in the Hive table.
a
at
D
w
ro
G

Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Best Practices Apache Airflow
No ratings yet
Best Practices Apache Airflow
28 pages
Vishwa SrDataEngineer Resume
No ratings yet
Vishwa SrDataEngineer Resume
4 pages
Sharjah Wte Project Executive Summary of Esia Report Final Rev2
No ratings yet
Sharjah Wte Project Executive Summary of Esia Report Final Rev2
24 pages
Polymeric Chemicals
100% (1)
Polymeric Chemicals
22 pages
Airflow_Assignment_2
No ratings yet
Airflow_Assignment_2
2 pages
2.2 create an Airflow DAG to read and write files using the PythonOperator
No ratings yet
2.2 create an Airflow DAG to read and write files using the PythonOperator
3 pages
Airflow
No ratings yet
Airflow
7 pages
Apache Airflow - A Python Hands-On Guide
No ratings yet
Apache Airflow - A Python Hands-On Guide
9 pages
Async Tasks With Apache Airflow
No ratings yet
Async Tasks With Apache Airflow
111 pages
Airflow Notes
No ratings yet
Airflow Notes
5 pages
DE Weekly Learning Update Fakrul
No ratings yet
DE Weekly Learning Update Fakrul
7 pages
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet
Sri 3
No ratings yet
Sri 3
8 pages
Prasant CV
No ratings yet
Prasant CV
1 page
Deploy any website on google cloud platform
From Everand
Deploy any website on google cloud platform
AJ Books
No ratings yet
Hemanth K_9 yrs_Sr. Data Engineer
No ratings yet
Hemanth K_9 yrs_Sr. Data Engineer
8 pages
Project documentation
No ratings yet
Project documentation
36 pages
BashOperatorWithAirflow-FinalAssignment
No ratings yet
BashOperatorWithAirflow-FinalAssignment
4 pages
Kiran_Data Engineer
No ratings yet
Kiran_Data Engineer
6 pages
Feature Store
No ratings yet
Feature Store
19 pages
Data Engineering Roadmap
No ratings yet
Data Engineering Roadmap
2 pages
Abhilash_Resume (1)
No ratings yet
Abhilash_Resume (1)
5 pages
Nagaraju Bachu
No ratings yet
Nagaraju Bachu
6 pages
Airflow Notes
No ratings yet
Airflow Notes
10 pages
Airflow Dag Bash
No ratings yet
Airflow Dag Bash
6 pages
PRJ DE 1 Supporting DOC
No ratings yet
PRJ DE 1 Supporting DOC
17 pages
DE - Test
No ratings yet
DE - Test
5 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
11 pages
JayshreeGaikwad GCP ETL
No ratings yet
JayshreeGaikwad GCP ETL
3 pages
Sindhusha Boyapati (1)
No ratings yet
Sindhusha Boyapati (1)
4 pages
Pooja
No ratings yet
Pooja
3 pages
PR Ofessional Summary: Data Frames and RDD's
No ratings yet
PR Ofessional Summary: Data Frames and RDD's
6 pages
Shiva Data_ Resume
No ratings yet
Shiva Data_ Resume
6 pages
Sharath Res
No ratings yet
Sharath Res
7 pages
Ankit Data Engineer Resume
No ratings yet
Ankit Data Engineer Resume
8 pages
Harsh G Miyan
No ratings yet
Harsh G Miyan
5 pages
R01 1
No ratings yet
R01 1
7 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
Swapnik DE
No ratings yet
Swapnik DE
6 pages
Cours 6 - TP
No ratings yet
Cours 6 - TP
2 pages
Varun resume h1b (1)
No ratings yet
Varun resume h1b (1)
3 pages
Q: How Do You Use Snowflake To Design and Develop Data Warehouse Solutions?
No ratings yet
Q: How Do You Use Snowflake To Design and Develop Data Warehouse Solutions?
1 page
Backend Developer - 1238 - JD
No ratings yet
Backend Developer - 1238 - JD
2 pages
01 Spark
No ratings yet
01 Spark
7 pages
Developing Elegant Workflows in Python Code With Apache Airflow
100% (1)
Developing Elegant Workflows in Python Code With Apache Airflow
35 pages
Project Overview
No ratings yet
Project Overview
2 pages
Anil Kumar: Data Engineer
No ratings yet
Anil Kumar: Data Engineer
8 pages
Ravi Teja AWS Data Engineer
No ratings yet
Ravi Teja AWS Data Engineer
8 pages
Apache airflow
No ratings yet
Apache airflow
10 pages
Ratodyasinh_Parmar_Resume (1)(1)
No ratings yet
Ratodyasinh_Parmar_Resume (1)(1)
4 pages
Apache Airflow For Data Science
No ratings yet
Apache Airflow For Data Science
23 pages
1
No ratings yet
1
6 pages
AIRFLOW
No ratings yet
AIRFLOW
4 pages
ABHINAY VARMA PINNAMARAJU - Data Engineering
No ratings yet
ABHINAY VARMA PINNAMARAJU - Data Engineering
6 pages
Dice Resume CV PAVAN SRI HARSHA LAGHUVARAPU
No ratings yet
Dice Resume CV PAVAN SRI HARSHA LAGHUVARAPU
4 pages
Cloud Based Developer ManishaAwachare (3y 5m)
No ratings yet
Cloud Based Developer ManishaAwachare (3y 5m)
3 pages
Aslam, Mohammad Email: Phone: Big Data/Cloud Developer
No ratings yet
Aslam, Mohammad Email: Phone: Big Data/Cloud Developer
6 pages
Abhinav Puskuru - GCP Data Engineer
No ratings yet
Abhinav Puskuru - GCP Data Engineer
5 pages
Scenario_Based_Airflow_Interview_Questions
No ratings yet
Scenario_Based_Airflow_Interview_Questions
4 pages
Ravali Data Engineer GCP
No ratings yet
Ravali Data Engineer GCP
8 pages
JPC - 15553 - Bhavyasri Tanneeru
No ratings yet
JPC - 15553 - Bhavyasri Tanneeru
8 pages
Ni-Kshay Informant DBT
No ratings yet
Ni-Kshay Informant DBT
29 pages
Philosophical Foundations of Curriculum Development
No ratings yet
Philosophical Foundations of Curriculum Development
34 pages
Syllabus For BITSAT-2019 Part I: Physics: Annexure
No ratings yet
Syllabus For BITSAT-2019 Part I: Physics: Annexure
12 pages
Mathematics - Year 6: Name: Class
No ratings yet
Mathematics - Year 6: Name: Class
9 pages
Sample 5 - Demand Letter (Atty. Rachelle Billones)
No ratings yet
Sample 5 - Demand Letter (Atty. Rachelle Billones)
2 pages
We Do Not Know Who The Agent Is:: II-Grammar: Passive Voice 2.1 Use
No ratings yet
We Do Not Know Who The Agent Is:: II-Grammar: Passive Voice 2.1 Use
5 pages
Arenes Answers
No ratings yet
Arenes Answers
26 pages
Writing Skills Ppt
No ratings yet
Writing Skills Ppt
19 pages
Brochure Ag
No ratings yet
Brochure Ag
68 pages
Electrochemistry - English
No ratings yet
Electrochemistry - English
6 pages
PMP4
No ratings yet
PMP4
5 pages
Unit 8 (Done) Copy
No ratings yet
Unit 8 (Done) Copy
7 pages
Manual For Uperfect 184t01
No ratings yet
Manual For Uperfect 184t01
10 pages
Company Name Web Link (The Link of Contact of Company)
No ratings yet
Company Name Web Link (The Link of Contact of Company)
11 pages
Foreword: Internship Report/ Designer: Nikasha Tawadey
67% (9)
Foreword: Internship Report/ Designer: Nikasha Tawadey
21 pages
UG CE-2017 (Sec-A) : Programme - Fall Semester 2019 (Wef 23 Sep 2019)
No ratings yet
UG CE-2017 (Sec-A) : Programme - Fall Semester 2019 (Wef 23 Sep 2019)
3 pages
A Case Study On Developing Software Using OOPS
100% (2)
A Case Study On Developing Software Using OOPS
3 pages
This Video: Plagiarism Webquest
No ratings yet
This Video: Plagiarism Webquest
4 pages
Family Energy Rebate Application Form On Supply (v7)
No ratings yet
Family Energy Rebate Application Form On Supply (v7)
4 pages
Biology Jamb Syllabus
No ratings yet
Biology Jamb Syllabus
16 pages
B1 Grammar Test Units 1-6
No ratings yet
B1 Grammar Test Units 1-6
3 pages
Full download The Wealth of Refugees: How Displaced People Can Build Economies Alexander Betts pdf docx
100% (1)
Full download The Wealth of Refugees: How Displaced People Can Build Economies Alexander Betts pdf docx
47 pages
Approaches To Second Language Acquisition: Richard Towell and Roger Hawkins
No ratings yet
Approaches To Second Language Acquisition: Richard Towell and Roger Hawkins
288 pages
TQM
100% (1)
TQM
4 pages
Irby v. Bittick, 44 F.3d 949, 11th Cir. (1995)
No ratings yet
Irby v. Bittick, 44 F.3d 949, 11th Cir. (1995)
14 pages
Oxford Distraction Grounding
No ratings yet
Oxford Distraction Grounding
6 pages
Test of Nutrients in Food
No ratings yet
Test of Nutrients in Food
9 pages
Standard Equipment/Optional Equipment Features
No ratings yet
Standard Equipment/Optional Equipment Features
2 pages

Airflow_Assignment_1

Uploaded by

Airflow_Assignment_1

Uploaded by

Airflow Assignment: GCP Dataproc PySpark Job

Objective: Automate a workflow using Apache Airflow to process daily

3. File Sensor Task:

● Add a GCSObjectExistenceSensor task to check for

● Configure the task to poke for the file every 5 minutes

4. Dataproc Cluster Creation Task:

● Use the DataprocClusterCreateOperator to create a

5. PySpark Job Execution Task:

● Proper configuration and structuring of the Airflow DAG.

● Successful creation and deletion of the Dataproc cluster.

● Submit the DAG python file (gcp_dataproc_pyspark_dag.py).

You might also like