0% found this document useful (0 votes)
59 views35 pages

Data Cleaning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views35 pages

Data Cleaning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Data Cleaning

Data Cleaning
• Data Cleaning is the process of fixing or
removing incorrect, corrupted, incorrect
formatted , duplicate, or incomplete data
within a dataset.
• Data cleansing improves data quality and
helps provide more accurate, consistent and
reliable information for decision-making in an
organization.
Why Data Cleaning is Necessary
Why Data Cleaning is Necessary
• Data cleaning might seem uninteresting, but it’s one of
the most important tasks you would have to do as a data
science professional. Having wrong or bad quality data
can be detrimental to your processes and analysis. Poor
data can cause a stellar algorithm to fail.
• On the other hand, high-quality data can cause a simple
algorithm to give you outstanding results. There are
many data cleaning techniques, and you should get
familiar with them to improve your data quality. Not all
data is useful. So that’s another major factor that affects
your data quality. Poor quality data can come from many
sources.
Cont..
• Usually, they are a result of human error, but
they can also arise if a lot of data is combined
from different sources. Multichannel data is
not only important, but it is also the norm. So
as a data scientist, you can expect errors from
this type of data. They can cause incorrect
insights in your project and sidetrack your data
analysis process. This is why data cleaning
methods in data analysis are so important.
Reasons why data cleaning is essential
• Efficiency
• Having clean data (free from wrong and
inconsistent values) can help you in performing
your analysis a lot faster. You’d save a
considerable amount of time by doing this task
beforehand. When you clean your data before
using it, you’d be able to avoid multiple errors. If
you use data containing false values, your results
won’t be accurate. A data scientist has to spend
significantly more time cleaning and purifying
data than analyzing it.
Error Margin

• When you don’t use accurate data for analysis, you


will surely make mistakes. Suppose, you’ve gotten a
lot of effort and time into analyzing a specific group
of datasets. You are very eager to show the results
to your superior, but in the meeting, your superior
points out a few mistakes the situation gets kind of
embarrassing and painful.
• Wouldn’t you want to avoid such mistakes from
happening? Not only do they cause embarrassment,
but they also waste resources. Data cleansing helps
you in that regard full stop it is a widespread
practice, and you should learn the methods used to
Determining Data Quality
Is The Data Valid? (Validity)

• The validity of your data is the degree to which it


follows the rules of your particular requirements.
For example, you how to import phone numbers of
different customers, but in some places, you added
email addresses in the data. Now because your
needs were explicitly for phone numbers, the email
addresses would be invalid.
• Validity errors take place when the input method
isn’t properly inspected. You might be using
spreadsheets for collecting your data. And you
might enter the wrong information in the cells of
Range

• Some types of numbers have to be in a


specific range. For example, the number of
products you can transport in a day must have
a minimum and maximum value. There would
surely be a particular range for the data. There
would be a starting point and an end-point.
Data-Type
• Some data cells might require a specific kind
of data, such as numeric, Boolean, etc. For
example, in a Boolean section, you wouldn’t
add a numerical value.
Compulsory constraints

• In every scenario, there are some mandatory


constraints your data should follow. The
compulsory restrictions depend on your
specific needs. Surely, specific columns of your
data shouldn’t be empty. For example, in the
list of your clients’ names, the column of
‘name’ can’t be empty.
Cross-field examination

• There are certain conditions which affect


multiple fields of data in a particular
form. Suppose the time of departure of a
flight couldn’t be earlier than its arrival. In a
balance sheet, the sum of the debit and credit
of the client must be the same. It can’t be
different.
• These values are related to each other, and
that’s why you might need to perform cross-
field examination
Unique Requirements

• Particulars types of data have unique


restrictions. Two customers can’t have the
same customer support ticket. Such kind of
data must be unique to a particular field and
can’t be shared by multiple ones.
Set-Membership Restrictions

• Some values are restricted to a particular set.


Like, gender can either be Male, Female or
Unknown.
Regular Patterns
• Some pieces of data follow a specific format. For
example, email addresses have the format
[email protected]’. Similarly, phone
numbers have ten digits.
• If the data isn’t in the required format, it would also be
invalid.
• If a person omits the ‘@’ while entering an email
address, then the email address would be invalid,
wouldn’t it? Checking the validity of your data is the
first step to determine its quality. Most of the time, the
cause of entry of invalid information is human error.

Cont..
• Getting rid of it will help you in streamlining
your process and avoiding useless data values
beforehand.
Consistency
Consistency
• You can measure consistency by comparing
two similar systems. Or, you can check the
data values within the same dataset to see if
they are consistent or not. Consistency can be
relational. For example, a customer’s age
might be 15, which is a valid value and could
be accurate, but they might also be stated
senior-citizen in the same system.
Next
• In such cases, you’ll need to cross-check the
data, similar to measuring accuracy, and see
which value is true. Is the client a 15-year old?
Or is the client a senior-citizen? Only one of
these values could be true.
There are multiple ways to make your data
consistent
• Check different systems
• You can take a look at another similar system to
find whether the value you have is real or not. If
two of your systems are contradicting each
other, it might help to check the third one.
• In our previous example, suppose you check the
third system and find the age of the customer is
65. This shows that the second system, which
said the customer is a senior citizen, would
hold.
Check the latest data

• Another way to improve the consistency of


your data is to check the more recent value. It
can be more beneficial to you in specific
scenarios. You might have two different
contact numbers for a customer in your
record. The most recent one would probably
be more reliable because it’s possible that the
customer switched numbers.
Check the source

• The most fool-proof way to check the


reliability of the data is to contact the source
simply. In our example of the customer’s age,
you can opt to contact the customer directly
and ask them their age. However, it’s not
possible in every scenario and directly
contacting the source can be highly tricky.
Maybe the customer doesn’t respond, or their
contact information isn’t available.
Uniformity

• You should ensure that all the values you’ve


entered in your dataset are in the same units. If
you’re entering SI units for measurements, you
can’t use the Imperial system in some places. On
the other hand, if at one place you’ve entered the
time in seconds, then you should enter it in this
format all across the dataset.
• This may happen while formatting dates as well.
Make sure to use the same date format for all your
entries. If you are using the DD/MM/YYYY format,
stick to that, do not change it to MM/DD/YYYY for
some of the entries, this will contaminate the data
and create problems.
Cont..
• Checking the uniformity of your records is
quite easy. A simple inspection can reveal
whether a particular value is in the required
unit or not. The units you use for entering
your data depend on your specific
requirements. Checking for uniformity across
datasets is one of the most important factors
of data cleaning in data analysis.
Heterogeneous data
• Heterogeneous data are any data with high
variability of data types and formats. They are
possibly ambiguous and low quality due to
missing values, high data redundancy, and
untruthfulness. It is difficult to integrate
heterogeneous data to meet the business
information demands.
Example
• Heterogeneous data structures are data
structures that contain diverse types of data,
such as integers, doubles, and floats. Linked
lists and ordered lists are good examples of
these data structures
Missing Data
• Missing data, or missing values, occur when
you don't have data stored for certain
variables or participants. Data can go missing
due to incomplete data entry, equipment
malfunctions, lost files, and many other
reasons. In any dataset, there are usually
some missing data.
Example
Data Transformation
• Data transformation is the process of
converting and structuring data into
a usable format that can be analyzed
to support decision making
processes, and to propel the growth
of an organization. Data
transformation is used when data
needs to be converted to match that
of the destination system.
Data Segmentation
• Data Segmentation is the process of
taking the data you hold and dividing
it up and grouping similar data
together based on the chosen
parameters so that you can use it
more efficiently within marketing and
operations
Example
• A company might segment customers into
groups based on age gender, customer loyalty,
geographic location or the product and
services customers use most

You might also like