0% found this document useful (0 votes)

13 views19 pages

Retail Store Part2 Data Engg Project

It has the data engineering project details for a retail store use case

Uploaded by

Arjun Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views19 pages

Retail Store Part2 Data Engg Project

It has the data engineering project details for a retail store use case

Uploaded by

Arjun Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

https://blog.devgenius.

io/data-engineering-project-retail-store-part-2-loading-the-data-

7c15c9c387e4

Data Engineering Project

— Retail Store Part 2 —
Loading the Data

Bar Dadon
·
Follow
Published in

Dev Genius

·
7 min read
·
Mar 13, 2022

78
2
Introduction

This is the second part of the “Data Engineering Project — Retail Store”
series. After acquiring the whisky data for the retail store in part 1. The
next step is to continue the ETL process and load the data to the
organization’s central database.

Project Steps

1. Generate Random Data.

 In this part, I will use Python to generate random data

about various parts of the organization.

2. Design a Central RDBMS and apply normalization.

3. Load the data into the central RDBMS.

The first part of the series focused on extracting the product data. At
this point, I have a dataframe with all the data about whiskey I could
scrape from https://www.thewhiskyexchange.com/.

You can view the first part of the series

here: https://medium.com/@bdadon50/data-engineering-project-
retail-store-part-1-web-scraping-a99ac5d6d44c.

As a reminder, this is where I stopped last time.

Step #1 — Generate Random Data

First, I will generate random data about employees, customers,

payments, and products.

Imports and Functions

1. Generating Product Data

First, Ill load the csv file that contains the products I extracted in part 1
into a dataframe.

Next, I’ll generate a column of unique product ids. This will act as the
primary key for the products table.

Lets see a sample of the dataframe.

I will quickly generate completely random data about employees,

customers, and payments using Python in the following steps. If you
are not interested in the code, feel free to skip this part and
move straight to step #2 — Designing the Central RDBMS
and Normalizing the Data

2. Generating Employee Data

3. Generating Customer Data

3. Generating Payments Data

Step # 2 — Designing the Central RDBMS
and Normalizing the Data

After generating random data, I can start with the design of the central
database. The database will act as the organization’s primary source of
data, and thus, we should strive to maximize the consistency and
integrity of the data within this database.

To do that, I will normalize each table to minimize the redundancy in

the database and maximize the integrity of the database.

I’ll normalize each table to meet the requirements of the first, second,
and then third normal form(3NF).

Also, I will normalize the tables first in Python in Jupyter Notebook,

and only when everything is done will I load the entire schema to the
database.

The current database looks like this:

Normalizing Tables

Now that I have generated all the de-normalized tables, I can

normalize them to meet the demands for the third normalized
form(3NF).

Let’s quickly go over the requirements for each normalized form:

First Normalized Form(1NF)

1. Each row needs to be unique.

2. Each cell may contain only one value.

Second Normalized Form(2NF)

1. The table must already meet the requirements for 1NF.

2. The table should be separated from repeated values that

apply to multiple rows.

Third Normalized Form(3NF)

1. The table must already meet the requirements for 2NF.

2. Each entity need to be dependent only on the primary key

1. Normalizing the Customers Table

I’ll start with normalizing the customer’s table. Let’s start by checking
if this table meets the requirements for 1NF.

1NF
Right now, each row is unique, and each cell contains exactly one
value, so this table already meets the requirements for the first normal
form(1NF).

2NF

The requirement for the second normal form is to separate groups of

values that apply to multiple rows.

For example, country, country_code, and credit_providers do not have

a lot of cardinality and apply to multiple rows. This is an example of
redundancy in the table.

I’ll create three separate tables from this table to meet the
requirements for 2NF.

First Table — countries

After creating the new table i need to connect it to the customers table,
to do that, lets create a foreign key called country_ids

Now, I can drop the columns country and country_code from

customers without losing data(since I kept the data in a separate table
now)
Done. Let’s continue separating the customer’s table, this time ill
create a table called customer_cc, which will hold the information
about customers’ credit card providers.

Second Table — Customer_cc

Same as before, I need to connect it to customers. Let’s create a foreign

key called credit_provider_id.

Now, I can drop the column credit_provider from customers.

And now the customers table is in 2NF. Let’s see how the database
looks now:
3NF

Now let’s check if the customer’s table meets the third normal form.

The requirement for 3NF is that each column is dependent only on the
primary key. In this case, the primary key is customer_id. Now, to
understand if a column is dependent only on the primary key, we
should ask ourselves two questions:

Q1. “Given a primary key, can I tell the value of that column?”

If the answer is yes, then we move to the second question.

Q2. “Given any other column, can I tell the value of that column?”

If the answer is no, we can safely say that this column is dependent
only on the primary key.

For example, let’s check if the column first_name is dependent only on

the primary key. Let’s ask:

Q1. Given a customer id, can I know the name of that customer?

Answer: Yes. Knowing the customer id will pinpoint the customer’s

name.

Q2. Given any other column, can I know the name of the customers?

Answer: No. By looking at the last name, country, or any other

column, I cannot pinpoint the customer’s name. Multiple people may
have the same last name and the same country.
So, in conclusion, the column first_name does belong in that table,
provided I intend to normalize the customers table up to 3NF.

2. Normalizing the Employees Table

Let’s quickly do the same process for the employee’s table. as a

reminder, the table looks like this:

1NF

Same as before, each row is unique, and each cell contains exactly one
value, so this table is already in the first normal form(1NF).

2NF

To check the 2NF requirements, let’s ask, does this table have values
that apply to multiple rows?

Yes. Department is a repeated value.

There are only four departments in the organization, this column is

causing data redundancy. Let’s divide this table into two tables.
employees table and department table.

departments table

The new table looks like this:

After creating the new table, I need to connect it to employees. Same as
before, ill make a foreign key to join them.

Now, I can drop the column department from employees

That’s for the 2NF. Let’s check the requirements for 3NF:

3NF

The employee table now looks like this:

As you can see, each column is dependent only on the primary key,
which is employee_id, so this table meets the 3NF.

I normalized the rest of the database using the same

principles. This is what the new normalized database looks like:
Reminder: The steps up till now were done solely in Python. Now
that the tables are normalized, ill connect to the database and load the
data according to my design. The tool I used as my database is MySQL.

Step #3– Loading the data into the

central RDBMS
1. Connecting Python to MySQL

2. Creating a new Schema

3. Generating empty tables

First, I’ll start with the tables without a foreign key: countries,
customer_cc, products, and departments.

Countries

Customer_cc

Products

Departments

Now I can create the rest of the tables, customers, employees and
payments.

Customers

Employees

Payments

3. Populating the tables

To insert the dataframes I created in Python into MySQL, I need to
transform them. I’ll transform each table into a tuple, and I’ll insert
each row one by one using a for loop. I’ll commit the entire transaction
when the loop ends.

Countries

Before populating the rest of the tables, I would like to show how this
looks in MySQL.

Let’s do a simple select query to show that this data has now moved
from my Python Jupyter notebook into MySQL.
After populating the rest of the tables, I’ll use only MySQL when I work
with the data. This is the whole point of this part of the project, to store
the data in a tool designed to query data, such as MySQL.

Let’s continue with the population of the tables. I’ll do the same for the
rest of the tables.

Customer_cc

Products
Departments

Customers

Employees

Payments

And that’s it. Now all the data is stored in MySQL in a normalized
schema that can be used to query and extract data. Let’s create an ERD
diagram using MySQL to see the finished schema:
That’s it for part 2 of the series.
In part 3, I’ll design a data warehouse that the organization will use as
a single source of data, used for analytical purposes and BI-related
decision making and reporting.

Python

Data Science

Data Engineer

Building Utilities Electrical
100% (1)
Building Utilities Electrical
103 pages
9756 Part3 (1) AC SYSTEM INVESTIGATION PDF
50% (2)
9756 Part3 (1) AC SYSTEM INVESTIGATION PDF
626 pages
TM-42 Manual (ENG) 20050707-97
83% (6)
TM-42 Manual (ENG) 20050707-97
8 pages
Discovering Computers: Digital Technology, Data, and Devices Chapter 12
No ratings yet
Discovering Computers: Digital Technology, Data, and Devices Chapter 12
56 pages
Databases Assignment PDF
No ratings yet
Databases Assignment PDF
22 pages
Unit-3 Second
No ratings yet
Unit-3 Second
28 pages
ITS570 Topic 5 - Database Control and Security (C12)
No ratings yet
ITS570 Topic 5 - Database Control and Security (C12)
78 pages
Influence of Raw Materials Characteristics On Pyroprocessing
No ratings yet
Influence of Raw Materials Characteristics On Pyroprocessing
19 pages
Database Normalisation Examples
No ratings yet
Database Normalisation Examples
43 pages
Chapter 3 Relational Data Model and Normalization
No ratings yet
Chapter 3 Relational Data Model and Normalization
41 pages
Unit 1dbms Merged
No ratings yet
Unit 1dbms Merged
30 pages
Banking Group 1
No ratings yet
Banking Group 1
37 pages
Download
No ratings yet
Download
108 pages
Normalization
No ratings yet
Normalization
34 pages
Normalization (V2)
No ratings yet
Normalization (V2)
29 pages
Chapter 9 Revision
No ratings yet
Chapter 9 Revision
19 pages
Database Fundamentals
No ratings yet
Database Fundamentals
35 pages
Unit 4
No ratings yet
Unit 4
6 pages
Concepts of Database Management Eighth Edition
No ratings yet
Concepts of Database Management Eighth Edition
41 pages
What Is RDBMS?: Microsoft Access Relational Database Management
No ratings yet
What Is RDBMS?: Microsoft Access Relational Database Management
9 pages
Database Management Systems
No ratings yet
Database Management Systems
15 pages
Database Design With Normalization
No ratings yet
Database Design With Normalization
30 pages
Database Lect5 FD
No ratings yet
Database Lect5 FD
66 pages
RDBMS Data Modelling Techniques
No ratings yet
RDBMS Data Modelling Techniques
39 pages
Banking Group 1
No ratings yet
Banking Group 1
37 pages
Unit-3 DBMS
No ratings yet
Unit-3 DBMS
63 pages
SQL Basics
No ratings yet
SQL Basics
6 pages
3.6.b Relational Databases and Normalisation
No ratings yet
3.6.b Relational Databases and Normalisation
10 pages
Database Normalization
No ratings yet
Database Normalization
28 pages
Normalization Is A Method For Organizing Data Elements in A Database Into Tables
No ratings yet
Normalization Is A Method For Organizing Data Elements in A Database Into Tables
12 pages
Database and SQL
No ratings yet
Database and SQL
65 pages
Database Design and Normalization
No ratings yet
Database Design and Normalization
27 pages
Intro - To-Database - Chapter No 4
No ratings yet
Intro - To-Database - Chapter No 4
45 pages
Normalization Lecture
No ratings yet
Normalization Lecture
47 pages
Normalization
No ratings yet
Normalization
39 pages
Normalisation 2025
No ratings yet
Normalisation 2025
74 pages
Datawarehousing Key Concepts - Latest
No ratings yet
Datawarehousing Key Concepts - Latest
5 pages
22it305 - Database Management Systems: 1.1 Relation For Every Entity
No ratings yet
22it305 - Database Management Systems: 1.1 Relation For Every Entity
11 pages
Astable Multivibrator
100% (1)
Astable Multivibrator
4 pages
Database Design Cheatsheet
No ratings yet
Database Design Cheatsheet
1 page
Normalization Unit 3
No ratings yet
Normalization Unit 3
30 pages
Chapter 13
No ratings yet
Chapter 13
31 pages
Normalization Module3 Presentation
No ratings yet
Normalization Module3 Presentation
14 pages
RDBMS
No ratings yet
RDBMS
46 pages
Report Anomalies and Normalization Summary
No ratings yet
Report Anomalies and Normalization Summary
5 pages
Database Fundamentals: INFM 603 - Information Technology and Organizational Context
No ratings yet
Database Fundamentals: INFM 603 - Information Technology and Organizational Context
35 pages
Data Modeling Advanced Concepts & Database Tables and Normalization
No ratings yet
Data Modeling Advanced Concepts & Database Tables and Normalization
7 pages
Database Normalization What Is Normalization?
No ratings yet
Database Normalization What Is Normalization?
5 pages
Dependency
No ratings yet
Dependency
47 pages
Database and SQL Queries D
No ratings yet
Database and SQL Queries D
68 pages
DBMS Module 2,4
No ratings yet
DBMS Module 2,4
12 pages
Computer Science & Engineering Micro Project: MLR Institute of Technology
No ratings yet
Computer Science & Engineering Micro Project: MLR Institute of Technology
33 pages
Database Normalization: MIS 520 - Database Theory Fall 2001 (Day) Lecture 4/5/6
No ratings yet
Database Normalization: MIS 520 - Database Theory Fall 2001 (Day) Lecture 4/5/6
32 pages
Database Design 2
No ratings yet
Database Design 2
7 pages
CHECK Constraint:: DROP A FOREIGN KEY Constraint
No ratings yet
CHECK Constraint:: DROP A FOREIGN KEY Constraint
5 pages
08 CSharp Lesson Transcript
No ratings yet
08 CSharp Lesson Transcript
10 pages
Page 25 Onward
No ratings yet
Page 25 Onward
6 pages
1.-For Example Below We Have One Big Table. Put The Table in Normalized Form
No ratings yet
1.-For Example Below We Have One Big Table. Put The Table in Normalized Form
16 pages
Database Fundamentals: INFM 603 - Information Technology and Organizational Context
No ratings yet
Database Fundamentals: INFM 603 - Information Technology and Organizational Context
35 pages
Normalization A
No ratings yet
Normalization A
29 pages
Relational Database Principles: Relational Databases, Based On Mathematical Set Theory. He
No ratings yet
Relational Database Principles: Relational Databases, Based On Mathematical Set Theory. He
7 pages
Normalization New
No ratings yet
Normalization New
44 pages
What's The Problem?: Relational Databases
No ratings yet
What's The Problem?: Relational Databases
14 pages
T-49C-CA MOD2 Operational Manual
No ratings yet
T-49C-CA MOD2 Operational Manual
52 pages
GTA: LCS Cheat Codes: Top Grand Theft Auto Cheats
No ratings yet
GTA: LCS Cheat Codes: Top Grand Theft Auto Cheats
3 pages
What Is Normalizationbykbs
No ratings yet
What Is Normalizationbykbs
17 pages
Reading Answer Sheet
No ratings yet
Reading Answer Sheet
1 page
G-06-Autonomous Database - Serverless and Dedicated-Transcript
No ratings yet
G-06-Autonomous Database - Serverless and Dedicated-Transcript
7 pages
7 The Internet
No ratings yet
7 The Internet
27 pages
MCR 20 Manual EN
No ratings yet
MCR 20 Manual EN
51 pages
Movilift Encoder Stand Alone Eng Ver. 2.0
No ratings yet
Movilift Encoder Stand Alone Eng Ver. 2.0
6 pages
Database Design: Logical Design-Part2
No ratings yet
Database Design: Logical Design-Part2
49 pages
Software Instruction
No ratings yet
Software Instruction
114 pages
School of Management Presidency University Bangalore
No ratings yet
School of Management Presidency University Bangalore
4 pages
AutoCAD Basics To Advanced Electrical BIM
No ratings yet
AutoCAD Basics To Advanced Electrical BIM
12 pages
SOP-Hitachi Drives
No ratings yet
SOP-Hitachi Drives
2 pages
FTMS TS p4
No ratings yet
FTMS TS p4
76 pages
Yuandongtian
No ratings yet
Yuandongtian
87 pages
Use of Robot Kits in Manufacturing Industry-CIM
No ratings yet
Use of Robot Kits in Manufacturing Industry-CIM
11 pages
Name: Muhammad Fawad ID NAME: 12699 Course: PF Lab: Using Using Using Using Namespace Class Static Void String
No ratings yet
Name: Muhammad Fawad ID NAME: 12699 Course: PF Lab: Using Using Using Using Namespace Class Static Void String
4 pages
Intel It Annual Performance Report 2021 2022 Paper
No ratings yet
Intel It Annual Performance Report 2021 2022 Paper
19 pages
Remote Reporting in The COVID 19 Era From Pilot S
No ratings yet
Remote Reporting in The COVID 19 Era From Pilot S
4 pages
Felins US 2000 Preventive Maintenance MB19
No ratings yet
Felins US 2000 Preventive Maintenance MB19
2 pages
TFP-GPC-EMEA - 10-21 - v2 14
No ratings yet
TFP-GPC-EMEA - 10-21 - v2 14
1 page
DGMD S17 Summer2019 Jun20
No ratings yet
DGMD S17 Summer2019 Jun20
6 pages
JD - Evalueserve - FS - Equity Strategy - BASBA
No ratings yet
JD - Evalueserve - FS - Equity Strategy - BASBA
2 pages
Final Checklist Thesis - Ab 1.03
No ratings yet
Final Checklist Thesis - Ab 1.03
4 pages
66fcb09978c0ee5e8bcd7e3e Munoxusunakadamosomilofo
No ratings yet
66fcb09978c0ee5e8bcd7e3e Munoxusunakadamosomilofo
2 pages
M100 - 4" Meters With Mechanical Register: Positive Displacement
No ratings yet
M100 - 4" Meters With Mechanical Register: Positive Displacement
2 pages
Pivot Tables In Depth For Microsoft Excel 2016
From Everand
Pivot Tables In Depth For Microsoft Excel 2016
Suljan Qeska
3.5/5 (3)

Retail Store Part2 Data Engg Project

Uploaded by

Retail Store Part2 Data Engg Project

Uploaded by

https://blog.devgenius.

Data Engineering Project

1. Generate Random Data.

 In this part, I will use Python to generate random data

2. Design a Central RDBMS and apply normalization.

3. Load the data into the central RDBMS.

You can view the first part of the series

As a reminder, this is where I stopped last time.

First, I will generate random data about employees, customers,

Imports and Functions

1. Generating Product Data

Lets see a sample of the dataframe.

I will quickly generate completely random data about employees,

2. Generating Employee Data

3. Generating Customer Data

3. Generating Payments Data

To do that, I will normalize each table to minimize the redundancy in

Also, I will normalize the tables first in Python in Jupyter Notebook,

The current database looks like this:

Now that I have generated all the de-normalized tables, I can

Let’s quickly go over the requirements for each normalized form:

1. Each row needs to be unique.

2. Each cell may contain only one value.

Second Normalized Form(2NF)

1. The table must already meet the requirements for 1NF.

2. The table should be separated from repeated values that

Third Normalized Form(3NF)

1. The table must already meet the requirements for 2NF.

2. Each entity need to be dependent only on the primary key

1. Normalizing the Customers Table

The requirement for the second normal form is to separate groups of

For example, country, country_code, and credit_providers do not have

First Table — countries

Now, I can drop the columns country and country_code from

Second Table — Customer_cc

Same as before, I need to connect it to customers. Let’s create a foreign

Now, I can drop the column credit_provider from customers.

If the answer is yes, then we move to the second question.

For example, let’s check if the column first_name is dependent only on

Answer: Yes. Knowing the customer id will pinpoint the customer’s

Answer: No. By looking at the last name, country, or any other

2. Normalizing the Employees Table

Let’s quickly do the same process for the employee’s table. as a

Yes. Department is a repeated value.

There are only four departments in the organization, this column is

The new table looks like this:

Now, I can drop the column department from employees

The employee table now looks like this:

I normalized the rest of the database using the same

Step #3– Loading the data into the

2. Creating a new Schema

3. Generating empty tables

3. Populating the tables

You might also like