0% found this document useful (0 votes)

12 views9 pages

Data201 A#3

The document describes cleaning 5 types of dirty data: missing data, duplicate data, inconsistent data formats, incorrect data types, and erroneous values. For each type of dirty data, it provides an example from a dataset, the process used to clean the data (deleting empty columns, removing duplicate entries, standardizing date formats, changing text to numeric types, and clustering similar values), and why cleaning that data is important for accurate and efficient analysis.

Uploaded by

shahoodfarooq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views9 pages

Data201 A#3

Uploaded by

shahoodfarooq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

DATA 201

ASSIGNMENT # 3
CLEANING DATA
Sydney Pratte
Lab 07
Shahood Farooq Ranjha
Process of cleaning the 1st of the 5 types of dirty entities

Figure 1.1 - Dirty Data

Figure 1.2 – Process of cleaning the data

Figure 1.3 – Clean Data

The data quality being cleaned here is Missing data. The name column is a redundant column, it is
empty we do not need it, as shown in figure 1.1. Having it in our dataset will increase the time to
analyze the entire dataset and might lead to inaccurate results/ conclusions. It is just sitting there
without any actual data in it, hence eliminating this column will increase the chances of our dataset
being of high quality and easy/quick to understand and analyze. I used Excel to clean this set of data
by deleting the entire name column (figure 1.2), the cleaned data is shown in Figure 1.3. (we can do
the same for the columns “keywords” and “language”.

Process of cleaning the 2nd of the 5 types of dirty entities

Figure 2.1 – Dirty Data

Figure 2.2 – Process of Cleaning the data

Sort the data alphabetically.

Delete the duplicates.

Figure 2.3 – Clean Data

The data quality being cleaned here is Entity Resolution. The sponsor column had a lot of duplicate
entries in other words more than two entries for the same thing. This is very important to clean, as
cleaning takes up most of the time in the entire data collection process, and if we do not eliminate
the duplicates, we will waste a lot of time analyzing the same sets of data, hence reducing our data’s
quality, and hindering us to reach our desired goal. We must take into consideration two most
important factors when presenting data, accuracy, and the time it took to analyze the data. If we
have duplicates, firstly our data won’t be accurate and will be very hard to clearly understand,
secondly, we will have to spend a lot of time analyzing the data to avoid storing duplicates. Hence, it
is very crucial that we eliminate duplicates in the cleaning process so that our data is accurate and
analyzing it is quicker/ more efficient. To clean the data, I first highlighted the duplicates using
conditional formatting on excel, then I eliminated those duplicates as shown in figure 2.2, finally
after eliminating I had a clean set of data shown in figure 2.3, although this data is not entirely clean
as we can see a question mark in the column, but I addressed one dirty entity while cleaning this
column which was “duplicates”, and that I did eliminate.

Process of cleaning the 3rd of the 5 types of dirty entities

Figure 3.1 – Dirty Data

Figure 3.1.1
Figure 3.2 – Process of cleaning the data

Figure 3.3 – Clean Data

The data quality being cleaned here is Data Integration. In the date column (figure 3.1) we can see
difference in the format (a type of type conversion) of how the dates are written as shown in figure
3.1.1. This could be due to collecting data from different websites each with a different way of
presenting dates, hence leading to variation in format when integrated together. This could be very
chaotic and messy if not cleaned as the individual cleaning the data might not be the one analyzing
it, meaning the analyzer would have to spend a good amount of time analyzing the dates to get
accurate results. This could also cause a problem while sorting the data, for example if we want to
find out which date occurred first, and the format is different for each date some have the year
written first, some the day and some the month, then the analyzer will not get correct results
leading to horrendous decisions being made and a lot of frustration. Due to these reasons, to save
time and money, we need to clean this column, for that I used open refine as shown in figure 3.2.
Figure 3.3 shows a screenshot of the cleaned date data, now all the dates have the same format,
making analyzing the data more effective at the same time increasing our data’s quality and
accuracy.
Process of cleaning the 4th of the 5 types of dirty entities

Figure 4.1 – Dirty Data

Figure 4.2 – Process of cleaning the data

Figure 4.3 – Clean Data

The data quality being cleaned here is Type Conversion. In the dish_count column we have numeric
sets of data however they are being presented in text format as shown in figure 4.1. This can lead to
huge errors in analyzing the data, for example if we are asked to find mean of the dish count, we
won’t be able to as the data needs to be in number format to run any formulas on it. To make
accurate conclusions and save time in the analyzing process it is critical that we clean the data and
change its format. The process of cleaning the data is shown in figure 4.2 and the final cleaned data
is shown in figure 4.3. Now that we have the data is numeric format, we can run formulas on it and
analyze the data appropriately. (Same goes for the page_count column)

Process of cleaning the 5th of the 5 types of dirty entities

Figure 5.1 – Dirty Data

Figure 5.2 – Process of cleaning the data

Select Cluster and Edit.

Cluster the data to make it more understandable.

Figure 5.3 – Clean Data

The data quality being cleaned here is Erroneous Values. The event column had a lot of fluctuation in
the way the events were presented for example some rows had DINNER whereas some had
[DINNER], as shown in figure 5.1. This incorrect format to present the same event could cause major
confusion while analyzing the data hence making the analyzing process time consuming leading to
increased costs. This could also cause problems such as inaccurate results when running a formula
for e.g., if we want to find out which event is the most occurring, we will get an inaccurate result as
our events are not formatted the correct way and we will get erroneous results, decreasing our
surveys quality and credibility. Hence it is very important to clean this error so we could save time
and get accurate results, the process to clean this data is shown in figure 5.2 and the cleaned data is
shown in figure 5.3, I used open refine to clean this data as the clustering feature is very efficient
and useful.

Proof of Funds Letter Template 06
70% (10)
Proof of Funds Letter Template 06
2 pages
380 Final Paper
No ratings yet
380 Final Paper
46 pages
Marketing Strategy - Air Fryer
No ratings yet
Marketing Strategy - Air Fryer
13 pages
Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Data Clean R
100% (1)
Data Clean R
11 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
A Quick and Easy Guide in Using SPSS for Linear Regression Analysis
From Everand
A Quick and Easy Guide in Using SPSS for Linear Regression Analysis
Jurex Gallo
No ratings yet
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Module 2 Clean Data For More Accurate Insights
No ratings yet
Module 2 Clean Data For More Accurate Insights
35 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
The Ultimate Guide To Data Cleaning With SQL 1738769035
No ratings yet
The Ultimate Guide To Data Cleaning With SQL 1738769035
36 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
Data Mining Group Assignment4
No ratings yet
Data Mining Group Assignment4
10 pages
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
No ratings yet
The Good and Bad Data: Poonam Kumari Poonamku@buffalo - Edu Oliver Kennedy Okennedy@buffalo - Edu
2 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
Stephen Love - 2.4.3 Assignment - Cleaning & Filtering Data
No ratings yet
Stephen Love - 2.4.3 Assignment - Cleaning & Filtering Data
13 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
SMA Expt 3
No ratings yet
SMA Expt 3
9 pages
Session 7 - Data Preprocessing and Transformation - 2025
No ratings yet
Session 7 - Data Preprocessing and Transformation - 2025
20 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
C-42 Exp 3 Sma
No ratings yet
C-42 Exp 3 Sma
8 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Lec 9
No ratings yet
Lec 9
1 page
Mod2 DM
No ratings yet
Mod2 DM
86 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Aspects of Data Quality (Excellent!)
No ratings yet
Aspects of Data Quality (Excellent!)
2 pages
Lec 7
No ratings yet
Lec 7
6 pages
M-II FDS U-II Questions
No ratings yet
M-II FDS U-II Questions
43 pages
Pre Processing
No ratings yet
Pre Processing
52 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data Cleansing
No ratings yet
Data Cleansing
5 pages
Data Cleaning in Power Query - Best Practices and Techniques
No ratings yet
Data Cleaning in Power Query - Best Practices and Techniques
20 pages
Data Cleaning Guide
No ratings yet
Data Cleaning Guide
4 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
54 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Cleaning and Preparing Data
No ratings yet
Cleaning and Preparing Data
12 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data 201 A#3
No ratings yet
Data 201 A#3
4 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Data Management Quiz
No ratings yet
Data Management Quiz
4 pages
cs614 Notes
No ratings yet
cs614 Notes
2 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Data Mining
No ratings yet
Data Mining
31 pages
UjwalBhattarai InternalAssignment
No ratings yet
UjwalBhattarai InternalAssignment
9 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Upgrading your skills with Access
From Everand
Upgrading your skills with Access
Rémy Lentzner
No ratings yet
Excel for Beginners: A Quick Reference and Step-by-Step Guide to Mastering Excel's Fundamentals, Formulas, Functions, Charts, Tables, and More with Practical Examples
From Everand
Excel for Beginners: A Quick Reference and Step-by-Step Guide to Mastering Excel's Fundamentals, Formulas, Functions, Charts, Tables, and More with Practical Examples
Thomas A. Bowden
No ratings yet
Special Techniques in Excel
From Everand
Special Techniques in Excel
David Fong
No ratings yet
Mar Frig
No ratings yet
Mar Frig
24 pages
Sustainability Report 2020 21
No ratings yet
Sustainability Report 2020 21
46 pages
Sample Resume - Law: Jenny Courthouse
No ratings yet
Sample Resume - Law: Jenny Courthouse
2 pages
SBBJ Fy 2008-2009
No ratings yet
SBBJ Fy 2008-2009
388 pages
Account Statement: Generated On Wednesday, March 02, 2022 2:54:18 PM
No ratings yet
Account Statement: Generated On Wednesday, March 02, 2022 2:54:18 PM
3 pages
Rent 2 Rent E-Book
No ratings yet
Rent 2 Rent E-Book
12 pages
Designing and Managing Integrated Marketing Communications: Presented By: Dan Kevin Medalla
No ratings yet
Designing and Managing Integrated Marketing Communications: Presented By: Dan Kevin Medalla
36 pages
Chapter 7 - Risk and The Cost of Capital
No ratings yet
Chapter 7 - Risk and The Cost of Capital
3 pages
Contractor Safety Program
No ratings yet
Contractor Safety Program
22 pages
Embassy DRHP 2010
No ratings yet
Embassy DRHP 2010
714 pages
Bias Brochure 2024
No ratings yet
Bias Brochure 2024
8 pages
And Our #1 Stock E-Commerce Stock Is... Shopify (Nyse: Shop)
No ratings yet
And Our #1 Stock E-Commerce Stock Is... Shopify (Nyse: Shop)
9 pages
Tabango Shell Refinery Employees Association vs. Pilipinas Shell Petroleum Corporation
No ratings yet
Tabango Shell Refinery Employees Association vs. Pilipinas Shell Petroleum Corporation
22 pages
Road Marking Service Agreement
No ratings yet
Road Marking Service Agreement
2 pages
Which of The Following Does The Discount Rate R
No ratings yet
Which of The Following Does The Discount Rate R
5 pages
I2i Ecosystem Report 2016 Final
No ratings yet
I2i Ecosystem Report 2016 Final
68 pages
All You Need To Know About Transfer of Property To NRI Children
No ratings yet
All You Need To Know About Transfer of Property To NRI Children
1 page
Fast Track Corporate Insolvency Resolution Process PPT PPDF
No ratings yet
Fast Track Corporate Insolvency Resolution Process PPT PPDF
21 pages
Iochpe Maxion SA Annual Report (Mar 14 2025)
No ratings yet
Iochpe Maxion SA Annual Report (Mar 14 2025)
102 pages
Management Coursework Sample
100% (2)
Management Coursework Sample
8 pages
Aguila Vs CA - DIGEST
No ratings yet
Aguila Vs CA - DIGEST
1 page
Gab Vendor Request Form V1.0
No ratings yet
Gab Vendor Request Form V1.0
3 pages
ACCT5919 - Lecture 7 T3 2024
No ratings yet
ACCT5919 - Lecture 7 T3 2024
31 pages
Sample Ent Business Plamn
No ratings yet
Sample Ent Business Plamn
115 pages
Approved By:: Calibration Procedure For Torque Wrenches
No ratings yet
Approved By:: Calibration Procedure For Torque Wrenches
7 pages
Welding, Brazing & Soldering PDF
No ratings yet
Welding, Brazing & Soldering PDF
9 pages
Brushed Axial Fans: VA40-B100-76A Performance Diagram
No ratings yet
Brushed Axial Fans: VA40-B100-76A Performance Diagram
3 pages

Data201 A#3

Uploaded by

Data201 A#3

Uploaded by

DATA 201

Figure 1.1 - Dirty Data

Figure 1.2 – Process of cleaning the data

Process of cleaning the 2nd of the 5 types of dirty entities

Figure 2.1 – Dirty Data

Sort the data alphabetically.

Delete the duplicates.

Figure 2.3 – Clean Data

Process of cleaning the 3rd of the 5 types of dirty entities

Figure 3.1 – Dirty Data

Figure 3.3 – Clean Data

Figure 4.1 – Dirty Data

Figure 4.2 – Process of cleaning the data

Figure 4.3 – Clean Data

Process of cleaning the 5th of the 5 types of dirty entities

Figure 5.1 – Dirty Data

Figure 5.2 – Process of cleaning the data

Select Cluster and Edit.

Figure 5.3 – Clean Data

You might also like