0% found this document useful (0 votes)
96 views2 pages

Retail Sales Dataset Project Documen1

The Retail Sales Dataset Project aims to process and analyze retail sales data using Azure Data Lake Storage and Databricks, organizing data into four layers: Landing, Bronze, Silver, and Gold. The project involves data ingestion, cleansing, transformation, and integration with Databricks for efficient analysis. Future enhancements include implementing Delta Lake, automating the ingestion pipeline, and applying machine learning for demand forecasting.

Uploaded by

sekarmani111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views2 pages

Retail Sales Dataset Project Documen1

The Retail Sales Dataset Project aims to process and analyze retail sales data using Azure Data Lake Storage and Databricks, organizing data into four layers: Landing, Bronze, Silver, and Gold. The project involves data ingestion, cleansing, transformation, and integration with Databricks for efficient analysis. Future enhancements include implementing Delta Lake, automating the ingestion pipeline, and applying machine learning for demand forecasting.

Uploaded by

sekarmani111
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Retail Sales Dataset Project Document

1. Project Overview
This project focuses on processing and analyzing retail sales data using Azure Data Lake
Storage (ADLS) and Databricks. The goal is to ingest, transform, and organize data
efficiently across different storage layers (Landing, Bronze, Silver, and Gold) for insightful
analysis.

2. Data Source
 Source: Kaggle Retail Sales Dataset
 Data Type: CSV files

3. Data Storage Structure


3.1 Folder Structure in ADLS

 Landing: Raw data files uploaded from Kaggle.


 Bronze: Cleansed and standardized raw data.
 Silver: Transformed data with necessary enrichments.
 Gold: Final processed data for analysis and reporting.

4. Data Processing Workflow


1. Data Ingestion
o Download data from Kaggle.
o Split data into four files based on predefined logic.
o Upload files to the Landing folder in ADLS.
2. Data Movement and Transformation
o Move data from landing → Bronze.
o Apply basic cleansing (removing duplicates, handling missing values).
o Move data from Bronze → Silver.
o Perform transformations (date formatting, category standardization, revenue
calculations).
o Move data from Silver → Gold.
o Aggregate and optimize data for reporting and analytics.
3. Databricks Integration
o Connect Databricks with ADLS using an access key.
o Use Databricks notebooks to automate the ingestion and transformation
pipeline.

5. Technologies Used
 Cloud Storage: Azure Data Lake Storage (ADLS)
 Processing Framework: Databricks (Apache Spark)
 Data Source: Kaggle CSV files
 Authentication: Access key-based authentication

6. Expected Outcomes
 Well-organized data across Landing, Bronze, Silver, and Gold layers.
 Cleaned and transformed data ready for analysis.
 Efficient data pipeline for future scalability and automation.

7. Future Enhancements
 Implement Delta Lake for enhanced data reliability.
 Automate the ingestion pipeline using Azure Data Factory (ADF).
 Apply Machine Learning models for demand forecasting.

You might also like