0% found this document useful (0 votes)
13 views19 pages

Retail Store Part2 Data Engg Project

It has the data engineering project details for a retail store use case

Uploaded by

Arjun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Retail Store Part2 Data Engg Project

It has the data engineering project details for a retail store use case

Uploaded by

Arjun Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

https://blog.devgenius.

io/data-engineering-project-retail-store-part-2-loading-the-data-

7c15c9c387e4

Data Engineering Project


— Retail Store Part 2 —
Loading the Data

Bar Dadon
·
Follow
Published in

Dev Genius

·
7 min read
·
Mar 13, 2022

78
2
Introduction

This is the second part of the “Data Engineering Project — Retail Store”
series. After acquiring the whisky data for the retail store in part 1. The
next step is to continue the ETL process and load the data to the
organization’s central database.

Project Steps

1. Generate Random Data.

 In this part, I will use Python to generate random data


about various parts of the organization.

2. Design a Central RDBMS and apply normalization.

3. Load the data into the central RDBMS.

The first part of the series focused on extracting the product data. At
this point, I have a dataframe with all the data about whiskey I could
scrape from https://www.thewhiskyexchange.com/.

You can view the first part of the series


here: https://medium.com/@bdadon50/data-engineering-project-
retail-store-part-1-web-scraping-a99ac5d6d44c.

As a reminder, this is where I stopped last time.


Step #1 — Generate Random Data

First, I will generate random data about employees, customers,


payments, and products.

Imports and Functions

1. Generating Product Data

First, Ill load the csv file that contains the products I extracted in part 1
into a dataframe.

Next, I’ll generate a column of unique product ids. This will act as the
primary key for the products table.

Lets see a sample of the dataframe.

I will quickly generate completely random data about employees,


customers, and payments using Python in the following steps. If you
are not interested in the code, feel free to skip this part and
move straight to step #2 — Designing the Central RDBMS
and Normalizing the Data

2. Generating Employee Data

3. Generating Customer Data

3. Generating Payments Data


Step # 2 — Designing the Central RDBMS
and Normalizing the Data

After generating random data, I can start with the design of the central
database. The database will act as the organization’s primary source of
data, and thus, we should strive to maximize the consistency and
integrity of the data within this database.

To do that, I will normalize each table to minimize the redundancy in


the database and maximize the integrity of the database.

I’ll normalize each table to meet the requirements of the first, second,
and then third normal form(3NF).

Also, I will normalize the tables first in Python in Jupyter Notebook,


and only when everything is done will I load the entire schema to the
database.

The current database looks like this:


Normalizing Tables

Now that I have generated all the de-normalized tables, I can


normalize them to meet the demands for the third normalized
form(3NF).

Let’s quickly go over the requirements for each normalized form:


First Normalized Form(1NF)

1. Each row needs to be unique.

2. Each cell may contain only one value.

Second Normalized Form(2NF)

1. The table must already meet the requirements for 1NF.

2. The table should be separated from repeated values that


apply to multiple rows.

Third Normalized Form(3NF)

1. The table must already meet the requirements for 2NF.

2. Each entity need to be dependent only on the primary key

1. Normalizing the Customers Table

I’ll start with normalizing the customer’s table. Let’s start by checking
if this table meets the requirements for 1NF.

1NF
Right now, each row is unique, and each cell contains exactly one
value, so this table already meets the requirements for the first normal
form(1NF).

2NF

The requirement for the second normal form is to separate groups of


values that apply to multiple rows.

For example, country, country_code, and credit_providers do not have


a lot of cardinality and apply to multiple rows. This is an example of
redundancy in the table.

I’ll create three separate tables from this table to meet the
requirements for 2NF.

First Table — countries

After creating the new table i need to connect it to the customers table,
to do that, lets create a foreign key called country_ids

Now, I can drop the columns country and country_code from


customers without losing data(since I kept the data in a separate table
now)
Done. Let’s continue separating the customer’s table, this time ill
create a table called customer_cc, which will hold the information
about customers’ credit card providers.

Second Table — Customer_cc

Same as before, I need to connect it to customers. Let’s create a foreign


key called credit_provider_id.

Now, I can drop the column credit_provider from customers.

And now the customers table is in 2NF. Let’s see how the database
looks now:
3NF

Now let’s check if the customer’s table meets the third normal form.

The requirement for 3NF is that each column is dependent only on the
primary key. In this case, the primary key is customer_id. Now, to
understand if a column is dependent only on the primary key, we
should ask ourselves two questions:

Q1. “Given a primary key, can I tell the value of that column?”

If the answer is yes, then we move to the second question.

Q2. “Given any other column, can I tell the value of that column?”

If the answer is no, we can safely say that this column is dependent
only on the primary key.

For example, let’s check if the column first_name is dependent only on


the primary key. Let’s ask:

Q1. Given a customer id, can I know the name of that customer?

Answer: Yes. Knowing the customer id will pinpoint the customer’s


name.

Q2. Given any other column, can I know the name of the customers?

Answer: No. By looking at the last name, country, or any other


column, I cannot pinpoint the customer’s name. Multiple people may
have the same last name and the same country.
So, in conclusion, the column first_name does belong in that table,
provided I intend to normalize the customers table up to 3NF.

2. Normalizing the Employees Table

Let’s quickly do the same process for the employee’s table. as a


reminder, the table looks like this:

1NF

Same as before, each row is unique, and each cell contains exactly one
value, so this table is already in the first normal form(1NF).

2NF

To check the 2NF requirements, let’s ask, does this table have values
that apply to multiple rows?

Yes. Department is a repeated value.

There are only four departments in the organization, this column is


causing data redundancy. Let’s divide this table into two tables.
employees table and department table.

departments table

The new table looks like this:


After creating the new table, I need to connect it to employees. Same as
before, ill make a foreign key to join them.

Now, I can drop the column department from employees

That’s for the 2NF. Let’s check the requirements for 3NF:

3NF

The employee table now looks like this:

As you can see, each column is dependent only on the primary key,
which is employee_id, so this table meets the 3NF.

I normalized the rest of the database using the same


principles. This is what the new normalized database looks like:
Reminder: The steps up till now were done solely in Python. Now
that the tables are normalized, ill connect to the database and load the
data according to my design. The tool I used as my database is MySQL.

Step #3– Loading the data into the


central RDBMS
1. Connecting Python to MySQL

2. Creating a new Schema

3. Generating empty tables

First, I’ll start with the tables without a foreign key: countries,
customer_cc, products, and departments.

Countries

Customer_cc

Products

Departments

Now I can create the rest of the tables, customers, employees and
payments.

Customers

Employees

Payments

3. Populating the tables


To insert the dataframes I created in Python into MySQL, I need to
transform them. I’ll transform each table into a tuple, and I’ll insert
each row one by one using a for loop. I’ll commit the entire transaction
when the loop ends.

Countries

Before populating the rest of the tables, I would like to show how this
looks in MySQL.

Let’s do a simple select query to show that this data has now moved
from my Python Jupyter notebook into MySQL.
After populating the rest of the tables, I’ll use only MySQL when I work
with the data. This is the whole point of this part of the project, to store
the data in a tool designed to query data, such as MySQL.

Let’s continue with the population of the tables. I’ll do the same for the
rest of the tables.

Customer_cc

Products
Departments

Customers

Employees

Payments

And that’s it. Now all the data is stored in MySQL in a normalized
schema that can be used to query and extract data. Let’s create an ERD
diagram using MySQL to see the finished schema:
That’s it for part 2 of the series.
In part 3, I’ll design a data warehouse that the organization will use as
a single source of data, used for analytical purposes and BI-related
decision making and reporting.

Python

Data Science

Data Engineer

You might also like