Hackathon Retail
Hackathon Retail
1
1. Introduction:
In this hackathon, your challenge is to build a complete end-to-end batch data
pipeline that integrates multiple datasets from an online retail business. You will ingest,
cleanse, and enrich historical batch data using PySpark, then load it into a data warehouse
designed for fast, efficient analytics using SQL. The entire solution should be containerized
with Docker to ensure a reproducible deployment. As a recommendation, you may use the
Faker module to generate realistic simulated datasets if real-world data is not available.
2. Objectives:
- Data Ingestion:
Ingest multiple batch datasets such as customer records, transactions, products,
and reviews.
- Data Transformation:
Use PySpark to clean, join, and enrich the datasets, addressing anomalies and
missing values.
- Data Warehouse Modeling:
Design a relational data model (e.g., star schema) that supports efficient analytics.
- Data Loading:
Load the transformed data into a relational database for further analysis.
- Containerized Deployment:
Package the entire pipeline using Docker to guarantee reproducible execution on
any environment.
- Simulated Data Generation:
Utilize the Python Faker module to generate simulated data for testing and
demonstration purposes.
2
- Loading Process:
-Implement the process to load the cleansed and transformed data into a
relational database (PostgreSQL, MySQL, or SQLite).
- Optimize table structures (indices, constraints) to ensure fast query performance.
8. Dataset (Example):
You may work with or simulate the following datasets. If real data is not available, you are
encouraged to use the Python Faker module to generate realistic fake data for testing:
- customers.csv:
- `customer_id` (unique identifier)
- `name`
- `email`
- `signup_date`
- `location`
- transactions.csv:
- `transaction_id`
- `customer_id`
- `product_id`
- `transaction_date`
- `amount`
- products.csv:
- `product_id`
- `product_name`
- `category`
3
- `price`
- reviews.csv:
- `review_id`
- `customer_id`
- `product_id`
- `rating`
- `review_text`
- `review_date`
*(In addition to the provided fields, feel free to extend these datasets with more
realistic attributes using Faker.)*
9. Evaluation Criteria:
- Data Pipeline Quality:
- Accuracy and efficiency in data ingestion, cleaning, and transformation.
- Data Warehouse Design:
-Logical and efficient relational model design with well-implemented joins and
aggregations.
- Querying and Insights:
- Depth and relevance of SQL queries and the actionable insights derived.
- Code Quality & Documentation:
-Clean, modular code with clear comments, proper error handling, and
comprehensive documentation.
- Containerization:
- Ease of deployment via Docker and clarity in the container setup process.
- Simulated Data Approach:
-Effective usage of the Faker module to generate simulated data that thoroughly
tests the pipeline.
10. Deliverables:
- Source Code Repository:
A public GitHub repository containing:
- Python/PySpark scripts for data ingestion, transformation, and loading.
- SQL scripts for creating the relational schema and running queries.
- Docker Artifacts:
-A Dockerfile and, optionally, a docker-compose configuration to orchestrate the
services (the ETL pipeline and the relational database).
- README File:
- Detailed instructions on setup, usage, and replication of the environment.
- Presentation:
-A slide deck or video summarizing your solution, architectural decisions, and
key insights.
4
- Docker (with Docker Compose for multi-container setups).
- Simulated Data Generation:
- Python Faker module for generating realistic, simulated datasets.
- Analytics & Visualization:
- SQL for querying, and libraries like Pandas, Matplotlib, or Seaborn for analysis.
13. Judging:
- Integration & Cohesion:
- Effectiveness of the end-to-end ETL pipeline and overall data model design.
- Technical Execution:
- Robustness of data processing and the performance of the relational database.
- Analytical Depth:
- Insightfulness of SQL queries and the overall relevance of the business insights.
- Documentation & Deployment:
- Clarity of documentation, code quality, and ease of replication using Docker.
- Bonus Challenge Implementation:
-Additional credit for innovative approaches, performance optimizations, or
effective simulated data generation using Faker and the interactive visualization
bonus (e.g., Streamlit).
Good luck, and enjoy building a robust batch data pipeline that turns raw data into actionable
business insights using both real and simulated datasets!