Simple AWS ETL Project
Simple AWS ETL Project
Goal: To extract customer order data from a CSV source, transform it, and load it into a
Redshift data warehouse for analysis.
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
job.init(args['JOB_NAME'], args)
datasource0 =
glueContext.create_dynamic_frame.from_catalog(database =
"orders_db", table_name = "orders", transformation_ctx =
"datasource0")
df = datasource0.toDF()
transformed_df = df.withColumn("order_date",
to_date(col("order_date"))) \
.withColumn("total_amount", col("quantity") *
col("price"))
transformed_dynamic_frame =
glueContext.create_dynamic_frame.from_df(transformed_df,
transformation_ctx = "transformed_dynamic_frame")
glueContext.write_dynamic_frame.from_options(frame =
transformed_dynamic_frame, connection_type = "s3",
connection_options = {"path": "s3://your-project-transformed-
bucket/"}, format = "parquet")
job.commit()
By walking through this project and highlighting these testing points, you'll demonstrate a solid
understanding of ETL testing principles and AWS services. Good luck!