Airflow_Assignment_1
Airflow_Assignment_1
Tasks:
1. Setup:
ills
● Create a Google Cloud Platform (GCP) bucket to store
the daily CSV files.
Sk
● Set up an Apache Airflow environment and ensure
GCP and Dataproc plugins/hooks are available.
2. DAG Configuration:
a
● Create a new DAG gcp_dataproc_pyspark_dag.
at
● Schedule the DAG to run once a day.
● Ensure catchup is set to False: catchup=False.
D
ills
6. Dataproc Cluster Deletion Task:
Sk
● Use the DataprocClusterDeleteOperator to delete the
Dataproc cluster once the PySpark job is successfully
completed.
a
7. DAG Dependency Configuration:
at
● Set the task dependencies using the set_upstream and
set_downstream methods or the bitshift operators (>>
D
and <<).
● Ensure that the DAG tasks run in the correct sequence.
w
Evaluation Criteria:
ro
Tips:
● Remember to configure the necessary GCP connection in
the Airflow web UI.
● Ensure you handle exceptions and potential issues in the
workflow, such as cluster creation failures or script execution
errors.
● Log important steps and outputs for easier debugging.
Submission:
ills
● Provide a brief report explaining the workflow, any
challenges faced, and their solutions.
● Include screenshots of the successful DAG runs and the
Sk
resulting data in the Hive table.
a
at
D
w
ro
G