0% found this document useful (0 votes)
12 views9 pages

Data201 A#3

The document describes cleaning 5 types of dirty data: missing data, duplicate data, inconsistent data formats, incorrect data types, and erroneous values. For each type of dirty data, it provides an example from a dataset, the process used to clean the data (deleting empty columns, removing duplicate entries, standardizing date formats, changing text to numeric types, and clustering similar values), and why cleaning that data is important for accurate and efficient analysis.

Uploaded by

shahoodfarooq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views9 pages

Data201 A#3

The document describes cleaning 5 types of dirty data: missing data, duplicate data, inconsistent data formats, incorrect data types, and erroneous values. For each type of dirty data, it provides an example from a dataset, the process used to clean the data (deleting empty columns, removing duplicate entries, standardizing date formats, changing text to numeric types, and clustering similar values), and why cleaning that data is important for accurate and efficient analysis.

Uploaded by

shahoodfarooq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

DATA 201

ASSIGNMENT # 3
CLEANING DATA
Sydney Pratte
Lab 07
Shahood Farooq Ranjha
Process of cleaning the 1st of the 5 types of dirty entities

Figure 1.1 - Dirty Data

Figure 1.2 – Process of cleaning the data


Figure 1.3 – Clean Data

The data quality being cleaned here is Missing data. The name column is a redundant column, it is
empty we do not need it, as shown in figure 1.1. Having it in our dataset will increase the time to
analyze the entire dataset and might lead to inaccurate results/ conclusions. It is just sitting there
without any actual data in it, hence eliminating this column will increase the chances of our dataset
being of high quality and easy/quick to understand and analyze. I used Excel to clean this set of data
by deleting the entire name column (figure 1.2), the cleaned data is shown in Figure 1.3. (we can do
the same for the columns “keywords” and “language”.

Process of cleaning the 2nd of the 5 types of dirty entities

Figure 2.1 – Dirty Data


Figure 2.2 – Process of Cleaning the data

Sort the data alphabetically.

Delete the duplicates.

Figure 2.3 – Clean Data

The data quality being cleaned here is Entity Resolution. The sponsor column had a lot of duplicate
entries in other words more than two entries for the same thing. This is very important to clean, as
cleaning takes up most of the time in the entire data collection process, and if we do not eliminate
the duplicates, we will waste a lot of time analyzing the same sets of data, hence reducing our data’s
quality, and hindering us to reach our desired goal. We must take into consideration two most
important factors when presenting data, accuracy, and the time it took to analyze the data. If we
have duplicates, firstly our data won’t be accurate and will be very hard to clearly understand,
secondly, we will have to spend a lot of time analyzing the data to avoid storing duplicates. Hence, it
is very crucial that we eliminate duplicates in the cleaning process so that our data is accurate and
analyzing it is quicker/ more efficient. To clean the data, I first highlighted the duplicates using
conditional formatting on excel, then I eliminated those duplicates as shown in figure 2.2, finally
after eliminating I had a clean set of data shown in figure 2.3, although this data is not entirely clean
as we can see a question mark in the column, but I addressed one dirty entity while cleaning this
column which was “duplicates”, and that I did eliminate.

Process of cleaning the 3rd of the 5 types of dirty entities

Figure 3.1 – Dirty Data

Figure 3.1.1
Figure 3.2 – Process of cleaning the data

Figure 3.3 – Clean Data

The data quality being cleaned here is Data Integration. In the date column (figure 3.1) we can see
difference in the format (a type of type conversion) of how the dates are written as shown in figure
3.1.1. This could be due to collecting data from different websites each with a different way of
presenting dates, hence leading to variation in format when integrated together. This could be very
chaotic and messy if not cleaned as the individual cleaning the data might not be the one analyzing
it, meaning the analyzer would have to spend a good amount of time analyzing the dates to get
accurate results. This could also cause a problem while sorting the data, for example if we want to
find out which date occurred first, and the format is different for each date some have the year
written first, some the day and some the month, then the analyzer will not get correct results
leading to horrendous decisions being made and a lot of frustration. Due to these reasons, to save
time and money, we need to clean this column, for that I used open refine as shown in figure 3.2.
Figure 3.3 shows a screenshot of the cleaned date data, now all the dates have the same format,
making analyzing the data more effective at the same time increasing our data’s quality and
accuracy.
Process of cleaning the 4th of the 5 types of dirty entities

Figure 4.1 – Dirty Data

Figure 4.2 – Process of cleaning the data

Figure 4.3 – Clean Data

The data quality being cleaned here is Type Conversion. In the dish_count column we have numeric
sets of data however they are being presented in text format as shown in figure 4.1. This can lead to
huge errors in analyzing the data, for example if we are asked to find mean of the dish count, we
won’t be able to as the data needs to be in number format to run any formulas on it. To make
accurate conclusions and save time in the analyzing process it is critical that we clean the data and
change its format. The process of cleaning the data is shown in figure 4.2 and the final cleaned data
is shown in figure 4.3. Now that we have the data is numeric format, we can run formulas on it and
analyze the data appropriately. (Same goes for the page_count column)

Process of cleaning the 5th of the 5 types of dirty entities

Figure 5.1 – Dirty Data

Figure 5.2 – Process of cleaning the data

Select Cluster and Edit.


Cluster the data to make it more understandable.

Figure 5.3 – Clean Data

The data quality being cleaned here is Erroneous Values. The event column had a lot of fluctuation in
the way the events were presented for example some rows had DINNER whereas some had
[DINNER], as shown in figure 5.1. This incorrect format to present the same event could cause major
confusion while analyzing the data hence making the analyzing process time consuming leading to
increased costs. This could also cause problems such as inaccurate results when running a formula
for e.g., if we want to find out which event is the most occurring, we will get an inaccurate result as
our events are not formatted the correct way and we will get erroneous results, decreasing our
surveys quality and credibility. Hence it is very important to clean this error so we could save time
and get accurate results, the process to clean this data is shown in figure 5.2 and the cleaned data is
shown in figure 5.3, I used open refine to clean this data as the clustering feature is very efficient
and useful.

You might also like