0% found this document useful (0 votes)
10 views

Airflow_Assignment_1

The document outlines an assignment to automate a workflow using Apache Airflow for processing daily CSV files from a GCP bucket with a Dataproc PySpark job, ultimately saving the transformed data into a Hive table. Key tasks include setting up GCP and Airflow, configuring a DAG, implementing file sensing, cluster creation, job execution, and cluster deletion. Evaluation criteria focus on proper configuration, successful execution, and data transformation, with submission requirements including the DAG file and a brief report.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Airflow_Assignment_1

The document outlines an assignment to automate a workflow using Apache Airflow for processing daily CSV files from a GCP bucket with a Dataproc PySpark job, ultimately saving the transformed data into a Hive table. Key tasks include setting up GCP and Airflow, configuring a DAG, implementing file sensing, cluster creation, job execution, and cluster deletion. Evaluation criteria focus on proper configuration, successful execution, and data transformation, with submission requirements including the DAG file and a brief report.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Airflow Assignment: GCP Dataproc PySpark Job

Objective: Automate a workflow using Apache Airflow to process daily


incoming CSV files from a GCP bucket using a Dataproc PySpark job
and save the transformed data into a Hive table.

Tasks:

1. Setup:

ills
● Create a Google Cloud Platform (GCP) bucket to store
the daily CSV files.

Sk
● Set up an Apache Airflow environment and ensure
GCP and Dataproc plugins/hooks are available.

2. DAG Configuration:
a
● Create a new DAG gcp_dataproc_pyspark_dag.
at
● Schedule the DAG to run once a day.
● Ensure catchup is set to False: catchup=False.
D

3. File Sensor Task:


w

● Add a GCSObjectExistenceSensor task to check for


the presence of the daily CSV file in the GCP bucket.
ro

● Configure the task to poke for the file every 5 minutes


for a maximum of 12 hours.
G

4. Dataproc Cluster Creation Task:

● Use the DataprocClusterCreateOperator to create a


new Dataproc cluster.
● Define and configure the cluster specifications as
needed.

5. PySpark Job Execution Task:


● Upload your PySpark script to GCP (either in a bucket
or Cloud Storage).
● Use the DataProcPySparkOperator to execute the
PySpark script on the created Dataproc cluster.
● The PySpark script should:
○ Read the daily CSV file from the GCP bucket.
○ Perform some logical transformations on the
data.
○ Write the transformed data into a Hive table.

ills
6. Dataproc Cluster Deletion Task:

Sk
● Use the DataprocClusterDeleteOperator to delete the
Dataproc cluster once the PySpark job is successfully
completed.
a
7. DAG Dependency Configuration:
at
● Set the task dependencies using the set_upstream and
set_downstream methods or the bitshift operators (>>
D

and <<).
● Ensure that the DAG tasks run in the correct sequence.
w

Evaluation Criteria:
ro

● Proper configuration and structuring of the Airflow DAG.


● Successful execution and scheduling of the DAG.
● Correct sensing of the daily CSV file.
G

● Successful creation and deletion of the Dataproc cluster.


● Successful execution of the PySpark job with the desired
transformation.
● Proper writing of the transformed data to the Hive table.

Tips:
● Remember to configure the necessary GCP connection in
the Airflow web UI.
● Ensure you handle exceptions and potential issues in the
workflow, such as cluster creation failures or script execution
errors.
● Log important steps and outputs for easier debugging.

Submission:

● Submit the DAG python file (gcp_dataproc_pyspark_dag.py).

ills
● Provide a brief report explaining the workflow, any
challenges faced, and their solutions.
● Include screenshots of the successful DAG runs and the

Sk
resulting data in the Hive table.
a
at
D
w
ro
G

You might also like