0% found this document useful (0 votes)
36 views5 pages

Why Data Cleaning Is Critical

This document discusses the importance of clean data and defines what constitutes dirty data. It provides examples of different types of dirty data like duplicate, outdated, incomplete, incorrect, and inconsistent data. The causes of dirty data are also examined, such as manual errors, outdated systems, and improper collection. Finally, the summary outlines how dirty data can negatively impact businesses by leading to inaccurate insights, poor decision-making, and lost revenue.

Uploaded by

lamnt.vnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views5 pages

Why Data Cleaning Is Critical

This document discusses the importance of clean data and defines what constitutes dirty data. It provides examples of different types of dirty data like duplicate, outdated, incomplete, incorrect, and inconsistent data. The causes of dirty data are also examined, such as manual errors, outdated systems, and improper collection. Finally, the summary outlines how dirty data can negatively impact businesses by leading to inaccurate insights, poor decision-making, and lost revenue.

Uploaded by

lamnt.vnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Why data cleaning is critical

Clean data is incredibly important for effective analysis. If a piece of data is entered
into a spreadsheet or database incorrectly, or if it's repeated, or if a field is left blank,
or if data formats are inconsistent, the result is dirty data. Small mistakes can lead to
big consequences in the long run.

Dirty data is incomplete, incorrect, or irrelevant to the problem you're trying to solve.
It can't be used in a meaningful way, which makes analysis very difficult, if not
impossible.

Clean data is complete, correct, and relevant to the problem you're trying to solve.
This allows you to understand and analyze information and identify important
patterns, connect related information, and draw useful conclusions. Then you can
apply what you learn to make effective decisions.

In some cases, you won't have to do a lot of work to clean data. For example, when
you use internal data that's been verified and cared for by your company's data
engineers and data warehouse team, it's more likely to be clean.

Data engineers transform data into a useful format for analysis and give it a reliable
infrastructure. This means they develop, maintain, and test databases, data
processors and related systems.

Data warehousing specialists develop processes and procedures to effectively store


and organize data. They make sure that data is available, secure, and backed up to
prevent loss.

When you become a data analyst, you can learn a lot by working with the person who
maintains your databases to learn about their systems. If data passes through the
hands of a data engineer or a data warehousing specialist first, you know you're off to
a good start on your project.

There's a lot of great career opportunities as a data engineer or a data warehousing


specialist. If this kind of work sounds interesting to you, maybe your career path will
involve helping organizations save lots of time, effort, and money by making sure
their data is sparkling clean. But even if you go in a different direction with your data
analytics career and have the advantage of working with data engineers and
warehousing specialists, you're still likely to have to clean your own data.

It's important to remember: no dataset is perfect. It's always a good idea to examine
and clean data before beginning analysis. Here's an example. Let's say you're working
on a project where you need to figure out how many people use your company's
software program. You have a spreadsheet that was created internally and verified by
a data engineer and a data warehousing specialist. Check out the column labeled
2

"Username." It might seem logical that you can just scroll down and count the rows
to figure out how many users you have.

But that won't work because one person sometimes has more than one username.

Maybe they registered from different email addresses, or maybe they have a work
and personal account. In situations like this, you would need to clean the data by
eliminating any rows that are duplicates.

Once you've done that, there won't be any more duplicate entries. Then your
spreadsheet is ready to be put to work. So far we've discussed working with internal
data. But data cleaning becomes even more important when working with external
data, especially if it comes from multiple sources. Let's say the software company
from our example surveyed its customers to learn how satisfied they are with its
software product. But when you review the survey data, you find that you have
several nulls.

A null is an indication that a value does not exist in a data set. Note that it's not the
same as a zero. In the case of a survey, a null would mean the customers skipped that
question. A zero would mean they provided zero as their response. To do your
analysis, you would first need to clean this data. Step one would be to decide what to
do with those nulls. You could either filter them out and communicate that you now
have a smaller sample size, or you can keep them in and learn from the fact that the
customers did not provide responses. There's lots of reasons why this could have
happened. Maybe your survey questions weren't written as well as they could be.
Maybe they were confusing or biased, something we learned about earlier. We've
touched on the basics of cleaning internal and external data, but there's lots more to
come.
3

What is dirty data?

Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant
to the problem you are trying to solve. This reading summarizes:

 Types of dirty data you may encounter


 What may have caused the data to become dirty
 How dirty data is harmful to businesses

Types of dirty data

Duplicate data

Description Possible causes Potential harm to businesses


Any data record that Manual data entry, Skewed metrics or analyses, inflated or
shows up more than batch data imports, or inaccurate counts or predictions, or
once data migration confusion during data retrieval
Outdated data

Potential harm to
Description Possible causes
businesses
Any data that is old which
People changing roles or Inaccurate insights,
should be replaced with
companies, or software and decision-making, and
newer and more accurate
systems becoming obsolete analytics
information
Incomplete data
4

Description Possible causes Potential harm to businesses


Any data that is Improper data Decreased productivity, inaccurate
missing important collection or incorrect insights, or inability to complete
fields data entry essential services
Incorrect/inaccurate data

Description Possible causes Potential harm to businesses


Any data that is Human error inserted during Inaccurate insights or decision-
complete but data input, fake information, making based on bad information
inaccurate or mock data resulting in revenue loss
Inconsistent data

Description Possible causes Potential harm to businesses


Data stored
Any data that uses Contradictory data points leading to
incorrectly or errors
different formats to confusion or inability to classify or
inserted during data
represent the same thing segment customers
transfer
Business impact of dirty data

For further reading on the business impact of dirty data, enter the term “dirty data”
into your preferred browser’s search bar to bring up numerous articles on the topic.
Here are a few impacts cited for certain industries from a previous search:

 Banking: Inaccuracies cost companies between 15% and 25% of revenue


(source).
 Digital commerce: Up to 25% of B2B database contacts contain inaccuracies
(source).
 Marketing and sales: 99% of companies are actively tackling data quality in
some way (source).
 Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s
electronic health records (source).

Key takeaways

Dirty data includes duplicate data, outdated data, incomplete data, incorrect or
inaccurate data, and inconsistent data. Each type of dirty data can have a significant
impact on analyses, leading to inaccurate insights, poor decision-making, and
revenue loss. There are a number of causes of dirty data, including manual data entry
errors, batch data imports, data migration, software obsolescence, improper data
collection, and human errors during data input. As a data professional, you can take
steps to mitigate the impact of dirty data by implementing effective data quality
processes.
5

You might also like