0% found this document useful (0 votes)
15 views3 pages

Simple AWS ETL Project

The document outlines a project to create an ETL pipeline for customer order data, involving extraction from a CSV file, transformation using AWS Glue, and loading into a Redshift data warehouse. It details the steps including setting up an S3 bucket, configuring Glue crawlers and jobs, and using AWS Athena for querying. Additionally, it provides interview points for ETL testing, covering data validation, quality checks, and performance testing.

Uploaded by

mmyybabybaby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views3 pages

Simple AWS ETL Project

The document outlines a project to create an ETL pipeline for customer order data, involving extraction from a CSV file, transformation using AWS Glue, and loading into a Redshift data warehouse. It details the steps including setting up an S3 bucket, configuring Glue crawlers and jobs, and using AWS Athena for querying. Additionally, it provides interview points for ETL testing, covering data validation, quality checks, and performance testing.

Uploaded by

mmyybabybaby
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Project: Customer Order Data ETL Pipeline

Goal: To extract customer order data from a CSV source, transform it, and load it into a
Redshift data warehouse for analysis.

1. Source System (CSV):


● Scenario: We'll simulate a source system by creating a simple CSV file named orders.csv.
● Content:
Code snippet
order_id,customer_id,order_date,product_id,quantity,price
1,101,2023-10-26,A123,2,25.00
2,102,2023-10-27,B456,1,50.00
3,101,2023-10-28,A123,3,25.00
4,103,2023-10-29,C789,1,100.00

2. AWS Lambda (Extraction) / S3 (Raw CSVs):


● Lambda Function (Optional):
○ For simplicity, we'll manually upload the orders.csv to an S3 bucket.
○ However, in a real-world scenario, you could create a Lambda function triggered by
events (e.g., file upload to a specific S3 location) to automate the extraction process.
○ The Lambda function would read the CSV from the source (e.g., another S3 bucket,
database), and store it in the "Raw" S3 bucket.
● S3 "Raw" Bucket:
○ Create an S3 bucket named your-project-raw-bucket.
○ Upload the orders.csv file to this bucket.

3. AWS Glue Crawlers (Schema Discovery):


● Crawler Configuration:
○ Create a Glue crawler.
○ Configure it to crawl the your-project-raw-bucket and point to the orders.csv file.
○ Specify a Glue database (e.g., orders_db) to store the discovered schema.
○ Run the crawler.
● Outcome:
○ The crawler will analyze the orders.csv file and create a table schema in the Glue Data
Catalog, defining the columns and data types.

4. AWS Glue Data Catalog:


● Verification:
○ Go to the Glue Data Catalog and verify that the orders_db database and the orders
table (created by the crawler) are present.
○ Inspect the table schema to ensure it matches the structure of your orders.csv file.

5. AWS Glue Jobs (Transformation):


● Job Creation:
○ Create a Glue job (Spark or Python Shell).
○ Use the Glue Data Catalog table (orders_db.orders) as the source.
○ Transformation Logic (Example):
■ Convert the order_date column to a date data type.
■ Calculate the total_amount column (quantity * price).
■ Example Python pyspark code:
Python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import col, to_date

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
job.init(args['JOB_NAME'], args)

datasource0 =
glueContext.create_dynamic_frame.from_catalog(database =
"orders_db", table_name = "orders", transformation_ctx =
"datasource0")
df = datasource0.toDF()

transformed_df = df.withColumn("order_date",
to_date(col("order_date"))) \
.withColumn("total_amount", col("quantity") *
col("price"))

transformed_dynamic_frame =
glueContext.create_dynamic_frame.from_df(transformed_df,
transformation_ctx = "transformed_dynamic_frame")

glueContext.write_dynamic_frame.from_options(frame =
transformed_dynamic_frame, connection_type = "s3",
connection_options = {"path": "s3://your-project-transformed-
bucket/"}, format = "parquet")

job.commit()

● S3 "Transformed" Bucket (Parquet):


○ Create an S3 bucket named your-project-transformed-bucket.
○ Run the Glue job.
○ The transformed data (in Parquet format) will be stored in this bucket.

6. AWS Glue Jobs (Redshift Load):


● Job Creation:
○ Create another Glue job.
○ Use the transformed Parquet data in your-project-transformed-bucket as the source.
○ Redshift Connection:
■ Configure a connection to your Redshift cluster.
■ Specify the target Redshift table (e.g., orders_transformed).
○ Load Logic:
■ Use the Glue redshift connector to load the parquet data into the redshift table.
● Redshift (Data Warehouse):
○ Verify that the orders_transformed table is created in your Redshift cluster and
contains the transformed data.

7. AWS Athena / BI Tools (Querying and Analysis):


● AWS Athena:
○ Create an external table in Athena that points to the your-project-transformed-bucket
(Parquet data).
○ Run SQL queries to analyze the data (e.g., find the total sales per customer).
● BI Tools (Optional):
○ Connect a BI tool (e.g., Tableau, Power BI) to Redshift or Athena to create
visualizations and dashboards.

ETL Testing Interview Points:


● Source Data Validation: Explain how you would validate the orders.csv data (e.g., data
types, completeness, consistency).
● Data Quality Checks: Describe the data quality checks you would perform during the
transformation process (e.g., handling null values, data type conversions, business rule
validation).
● Schema Validation: Discuss how you would validate the schema discovered by the Glue
crawler and the schema of the transformed data.
● Data Reconciliation: Explain how you would reconcile the data between the source
orders.csv and the Redshift orders_transformed table (e.g., row counts, data sampling,
checksums).
● Performance Testing: Discuss how you would test the performance of the Glue jobs and
Redshift queries.
● Error Handling: Describe how you would handle errors during the ETL process (e.g.,
logging, alerting, retry mechanisms).
● Incremental Loads: Explain how you would implement incremental loads for new order
data.
● Data lineage: How would you track the data as it moves through the system.
● Security: How would you handle security of the data at rest and in transit.

By walking through this project and highlighting these testing points, you'll demonstrate a solid
understanding of ETL testing principles and AWS services. Good luck!

You might also like