Retail Store Part2 Data Engg Project
Retail Store Part2 Data Engg Project
io/data-engineering-project-retail-store-part-2-loading-the-data-
7c15c9c387e4
Bar Dadon
·
Follow
Published in
Dev Genius
·
7 min read
·
Mar 13, 2022
78
2
Introduction
This is the second part of the “Data Engineering Project — Retail Store”
series. After acquiring the whisky data for the retail store in part 1. The
next step is to continue the ETL process and load the data to the
organization’s central database.
Project Steps
The first part of the series focused on extracting the product data. At
this point, I have a dataframe with all the data about whiskey I could
scrape from https://www.thewhiskyexchange.com/.
First, Ill load the csv file that contains the products I extracted in part 1
into a dataframe.
Next, I’ll generate a column of unique product ids. This will act as the
primary key for the products table.
After generating random data, I can start with the design of the central
database. The database will act as the organization’s primary source of
data, and thus, we should strive to maximize the consistency and
integrity of the data within this database.
I’ll normalize each table to meet the requirements of the first, second,
and then third normal form(3NF).
I’ll start with normalizing the customer’s table. Let’s start by checking
if this table meets the requirements for 1NF.
1NF
Right now, each row is unique, and each cell contains exactly one
value, so this table already meets the requirements for the first normal
form(1NF).
2NF
I’ll create three separate tables from this table to meet the
requirements for 2NF.
After creating the new table i need to connect it to the customers table,
to do that, lets create a foreign key called country_ids
And now the customers table is in 2NF. Let’s see how the database
looks now:
3NF
Now let’s check if the customer’s table meets the third normal form.
The requirement for 3NF is that each column is dependent only on the
primary key. In this case, the primary key is customer_id. Now, to
understand if a column is dependent only on the primary key, we
should ask ourselves two questions:
Q1. “Given a primary key, can I tell the value of that column?”
Q2. “Given any other column, can I tell the value of that column?”
If the answer is no, we can safely say that this column is dependent
only on the primary key.
Q1. Given a customer id, can I know the name of that customer?
Q2. Given any other column, can I know the name of the customers?
1NF
Same as before, each row is unique, and each cell contains exactly one
value, so this table is already in the first normal form(1NF).
2NF
To check the 2NF requirements, let’s ask, does this table have values
that apply to multiple rows?
departments table
That’s for the 2NF. Let’s check the requirements for 3NF:
3NF
As you can see, each column is dependent only on the primary key,
which is employee_id, so this table meets the 3NF.
First, I’ll start with the tables without a foreign key: countries,
customer_cc, products, and departments.
Countries
Customer_cc
Products
Departments
Now I can create the rest of the tables, customers, employees and
payments.
Customers
Employees
Payments
Countries
Before populating the rest of the tables, I would like to show how this
looks in MySQL.
Let’s do a simple select query to show that this data has now moved
from my Python Jupyter notebook into MySQL.
After populating the rest of the tables, I’ll use only MySQL when I work
with the data. This is the whole point of this part of the project, to store
the data in a tool designed to query data, such as MySQL.
Let’s continue with the population of the tables. I’ll do the same for the
rest of the tables.
Customer_cc
Products
Departments
Customers
Employees
Payments
And that’s it. Now all the data is stored in MySQL in a normalized
schema that can be used to query and extract data. Let’s create an ERD
diagram using MySQL to see the finished schema:
That’s it for part 2 of the series.
In part 3, I’ll design a data warehouse that the organization will use as
a single source of data, used for analytical purposes and BI-related
decision making and reporting.
Python
Data Science
Data Engineer